Corresponding Author Prof. Peter Byass, Immpact, University of Aberdeen, Health Sciences Building, Foresterhill, Aberdeen AB25 2ZD, UK. Tel.: +44 1224 555850; Fax: +44 1224 555704; E-mail: email@example.com
Objectives To assess our experiences of using hand-held computers (personal digital assistants, PDAs) for direct data capture in a large community-based geo-referenced survey in rural Burkina Faso, highlighting benefits and lessons learnt from their use.
Methods A population-based geo-referenced survey of over 500 000 people was undertaken using PDAs with in-built GPS receivers and the resulting database analysed in terms of successful completion, error rates and interview durations.
Results Surveys were successfully completed for 84 861 households (98.3%) by 127 interviewers. The data input error rate was assessed at 0.24%, with more than half of the errors being made by less than 10% of the interviewers. Faster interviewers were not less accurate. Time-stamped and geo-referenced data allowed reconstruction of particular interviewer-day activities.
Conclusions Although the survey setting was challenging, the feasibility of using direct data capture on a large scale was well established. We learnt that, with more experience, we could have made better use of real-time entry and quality control checking procedures. The work involved in designing and setting up a complex survey on PDAs prior to data collection should not be underestimated.
Large-scale community surveys in developing countries are difficult and expensive, frequently requiring large teams of interviewers for house-to-house enquiries, considerable logistic support, and often long periods afterwards for data entry and cleaning. In many settings, established procedures for capturing, entering and controlling data quality (Rowan et al. 1987) have not changed markedly during the past 20 years, despite huge advances in information technology. Community surveys have typically involved capturing data on paper questionnaires at the household level and then using a nearby centre as a base for data entry and cleaning, but current technological advances and economic trends mean that hand-held computers or personal digital assistants (PDA) are fast becoming a viable alternative (Gupta 1996; Diero et al. 2006). Proponents would argue that direct data capture at the point of interview can reduce error rates and speed up the cleaning process, and hence make databases available for analysis sooner (Epihandy 2006; Menda 2007), and may prove more cost-effective in the longer term. Possible counter-arguments are that there may be considerable logistic difficulties in operating high-tech equipment reliably under difficult field conditions, and that the risks of losing an entire unit, together with the data it might be holding, could outweigh the advantages. However, in the literature we have been unable to find any substantial evidence of adverse experiences or disadvantages of using direct data capture in fieldwork, although there is some evidence from institutional settings (Shelby-James et al. 2007).
Relatively few attempts have been made to carry out field data capture using different approaches as a comparative trial (Forster et al. 1991; Fletcher et al. 2003; Vivoda & Eby 2006), perhaps as this increases (apparently unnecessarily) the complexity of what may already be a difficult undertaking. Evidence on the use of PDAs in community-based surveys, particularly in Africa, is therefore sparse and often subjective. Even where non-comparative methods have been used, there is often very little analysis of parameters such as error rates, inter- and intra-observer variations and cost-effectiveness, so that objective comparisons of different experiences are difficult to make.
Immpact set out to undertake a community-based survey of maternal health outcomes in two districts of Burkina Faso in early 2006 (Hounton et al. 2008). The total target population amounted to approximately 500 000, in the remote south-eastern part of the country where infrastructures were generally weak. This paper describes the technology-based strategy using PDAs that was employed for data capture in this large geo-referenced epidemiological survey, to demonstrate both strengths and weaknesses, and to provide a practical resource base for other groups who might wish to implement a similar approach. We have considered issues of PDA hardware, software and utilisation, together with analyses of the collected data.
This study was conducted in Burkina Faso, one of the poorest countries in the world. Burkina Faso, in West Africa, is ranked 174/177 on the Human Development Index (United Nations Development Programme 2006). The population has a generally low level of education and poor health status. The climate is Sudanese with one rainy season from June to September. The study area comprised two remote and rural districts to the east of the country, mostly lacking electricity, running water, good roads and effective transport systems. More details of the study methodology are described elsewhere (Hounton et al. 2008).
Careful consideration was given at the planning stage of the survey as to whether or not direct data capture should be used. In many sub-districts, no reliable mains electricity would be available for charging battery-powered devices. On the other hand, it was particularly important to deliver a large dataset on a short timescale, and to avoid a lengthy data entry and cleaning phase after the survey. The target was to complete the field work in 3 months.
We decided to take the risk of using direct data capture in the hope of keeping to our targets. Geo-references for every household were an important part of the data collection, and, as we were concerned about arrangements for recharging equipment, we decided to use PDAs with integral GPS receivers to minimise unnecessary batteries and interface issues, while avoiding the possible errors inherent in transcribing GPS coordinates manually. We selected the Mio 168 PDA (Mio Technology), which combined these requirements satisfactorily and at reasonable cost (approximately US $350 per unit). Based on previous experience and the volume of data to be collected, we estimated a need for approximately 120 interviewers over a 3-month period of data collection, plus training, etc., and recruited school leavers who were competent in the local languages of the two districts respectively. The linguistic need to recruit locally meant that there was no possibility of finding large numbers of people experienced in this kind of interviewing, nor in the use of PDAs. However, mobile telephones are becoming commonplace in these communities and we were able to specify familiarity with using one as an advantage for recruitment. A considerable part of the training process was devoted to the use of the PDAs, particularly emphasising important concepts such as avoiding ‘delete’ functions and ensuring that all data were regularly and securely copied to the PDAs’ memory cards. The overall team of interviewers had exactly the same training process, but was then split between the two districts being surveyed. Each district had its own team of supervisors and data managers to download and collate the data from the PDAs. A Microsoft Access database held on a laptop in each district was automatically incremented every time a PDA was synchronised, thus building up an overall database.
Pendragon 4 software (Pendragon Software Corporation) was used with the PDAs and supervisors’ portable computers to program the required questionnaires and to handle data input and uploading into a Microsoft Access database.
Battery charging arrangements varied according to local circumstances. In some places mains electricity was available overnight. Elsewhere charging was carried out either from vehicles, using a cigarette-lighter adapter, or from 12 V motorcycle batteries which were small enough to be reasonably portable, and could either be charged centrally or using solar panels during the daytime. Regularly saving data from the PDAs’ internal volatile memory to non-volatile memory cards was an essential strategy for avoiding data loss, as memory cards could still be read externally in the event of a PDA losing power or failing.
The overall Immpact research proposal was approved by the Ministry of Health National Health Research Ethics Committee (Ouagadougou, Burkina Faso). The specific Evaluation and Evidence Research Group protocol was approved by Centre MURAZ (Bobo-Dioulasso, Burkina Faso) Institutional Review Board. Administrative authorizations were obtained at all level of the administrative chain (Ministry of Health, Region Governorates, Regional Directorates of Health, National and Regional Hospital Directorates, Province High Commissioners, District Health Management Teams, Heads of Clinical Services in Hospitals and village community leaders.
Immpact’s community survey included a number of different modules, starting with a household census, and going on to further data relating to particular individuals and those who had recently died. The complete data collection phase was successfully completed in 15 weeks, just a little over our 3 month target. We experienced no technical problems with PDAs that seriously hampered data collection, although a total of 48/151 PDAs encountered some technical problems during the course of the survey, one being destroyed as a result of being run over by a vehicle.
For the purposes of this analysis, we are concentrating on the household census module, a form containing around 20 questions on household characteristics (e.g. construction, sanitation, access to health care, assets) as well as some auto-capture fields (e.g. handheld identifier, time-stamp at the start of the interview, GPS fix). Records were successfully captured for 86 376 households during February to May 2006, by 127 interviewers. A time-stamp was automatically recorded from the PDA system clock at the start of the interview, and the interview was designed to terminate by recording the GPS position of the household, which also automatically included a record of the system clock. All records carried valid time-stamp data, although for some unexplained reason nine PDAs changed system clock values from 2006 to 2003 briefly mid-survey (199 records, 0.23%). Overall, 1543/86 376 (1.8%) of time-stamp data were not between 06:00 and 18:00, possibly reflecting interviewers re-visiting or completing records in the evening. GPS positions were successfully recorded for 80 846 households (93.6%). Of these, 52 (0.06%) anomalously had positions recorded with no associated time-stamp. The missing GPS data possibly arose in cases where interviewers later reviewed records at a different location, or failed to wait for an adequate satellite position fix when capturing the GPS data.
The household interview was reported by the interviewers to have been successfully completed in 84 861 cases (98.3%). A valid duration for the household module (from start to GPS fix) was available in 76 650 cases (88.7%), with a mean time of 21.8 min and a median time of 17.8 min. The median duration decreased over the course of the survey and was slightly shorter for interviews that were not successfully completed, as shown in Table 1.
Table 1. Mean and median duration for household interview module
Mean (95% CI)
Median (95% CI)
(With valid duration)
10 793 (14.1)
28 151 (36.7)
21 834 (28.5)
15 872 (20.7)
75 729 (98.8)
36 875 (48.1)
39 775 (51.9)
Accuracy of data capture was evaluated using the field for the interviewers’ personal identification code, which was a three-digit number unique to each interviewer and which had to entered by tapping the digits on the PDA screen at the start of each interview. Overall there were 209/86 376 errors made in this field, a rate of 0.24%. Many of these errors involved one adjacent digit substitution or digit transpositions.
This relatively large set of data allowed comparisons to be made in terms of accuracy, efficiency and completeness among the 127 interviewers. Errors in interviewers’ personal identification codes were not randomly distributed among interviewers, with 108/209 errors (51.7%) being made by 12/127 interviewers (9.4%). No identification code error was made by 64/127 interviewers (50.4%). Most of the identification code errors (186/209, 89.0%) occurred in Ouargaye district (error rate 0.48%).
Missing GPS fixes were also not randomly distributed between the districts or among interviewers. In Diapaga district, 5480/47 404 (11.56%) of interviews failed to secure a GPS fix, compared with 50/38 972 (0.13%) in Ouargaye. The reason for this difference is unclear. By interviewer, the GPS failure rate in Ouargaye ranged from 0/811 to 11/568 (0 to1.9%), while in Diapaga the range was 2/760 to 732/734 (0.26% to 99.7%).
Two interviewers in Diapaga hardly ever achieved valid GPS fixes in their interviews, and therefore had to be excluded from consideration of interviewers’ median interview duration. The median interview times for the remaining 125 individual interviewers, correlated with the average individual number of households visited per day worked, are shown in Figure 1. The distribution of identification error rates by median interview time for the same 125 interviewers is shown in Figure 2.
It is also possible to use this kind of time-stamped and geo-referenced approach to data collection to reconstruct a particular interviewer-day. An example of this is shown in Figure 3, for one randomly selected interviewer-day.
This somewhat forensic analysis of a large, directly captured database reveals a range of both encouragement and concern. The overall conclusion is that it proved possible to undertake a large-scale household survey under difficult conditions and more or less within the required timeframe. Rates of data input error as measured above were low, and comparable to those previously reported using both paper and PDA approaches (Forster et al. 1991). However, supervision and monitoring during the survey did not address some important concerns, such as the lack of geo-referencing data being collected by certain interviewers.
Although Pendragon software is not specifically designed for epidemiological surveys, it proved reasonably adaptable. However, other more specialised alternatives are now becoming available (e.g. Epihandy 2006) which could have advantages for future work of this kind. Because of time pressure in setting up the survey tools, plus our lack of experience in formulating the most effective checks within Pendragon for incoming data in advance, some problems with data quality were not effectively controlled. We underestimated the amount of work required to transform questionnaires into the Pendragon format, which also requires a reasonable level of skill and experience, and as a result failed to maximise our use of some of the available options for input checking and other real-time quality control procedures. Village names, for example, were implemented as a text-entry field, but would have been better as a drop-down list to avoid ambiguities of spelling, etc. Combinations of input checks, plus quality control measures at the stage where data were downloaded to portable computers in the field, should have picked up concerns such as missing GPS data at an earlier and remediable stage.
Some technical anomalies were observed, which did not affect the overall quality of the data but were nevertheless unexplained. This included temporary system clock changes in a few PDAs, and some incomplete GPS data strings. There is always the possibility that technical issues, such as powering down the PDA while data were still being written to the memory card, or other similar situations, could cause these effects in rare circumstances. In terms of longer-term use of PDAs in the field, it became clear that protective plastic screen covers were essential for preserving the quality of the screens.
Analyses of inter-observer accuracy and performance revealed a considerable range. Some interviewers clearly worked faster with the PDAs than others, though these were not necessarily those who covered the greatest number of households per day worked (Figure 1). However, those who carried out interviews relatively fast were generally also those who made the least input errors (Figure 2). Thus greater speed did not seem to be a driving factor for making more mistakes, but instead probably reflected varying levels of competence among interviewers. In surveys of this kind, where competence in local languages is an important factor, there are often not many options in terms of who can be recruited as interviewers. It is encouraging that school leavers in one of the world’s poorest societies were, in general, able to make a good job of interviewing using PDAs. The interviewers reacted favourably to being equipped with this technology, and felt it was also well-received by respondents, as has been noted elsewhere (Greene 2001). The observed inter-district differences in errors and inconsistencies, as well as inter-interviewer differences, suggest that data quality control is a matter to be addressed both at individual interviewer and supervisory levels.
Although a detailed cost-effectiveness analysis was not implemented, in broad terms the PDA/GPS equipment for direct data capture in this survey cost US $60 000, whereas a paper-based approach for the same survey would have required over one million printed pages, 100 hand-held GPS receivers and 20 additional desktop computers for data entry, amounting to a higher overall cost.
The PDAs and geo-referenced data offer good opportunities for quality control measures in this kind of work, which as yet have not been widely exploited (Dwolatsky et al. 2006). The possibility of reconstructing the activities of a particular interviewer on a particular day, not using data (s)he has directly entered or managed, but by means of automatically registered time-stamps and GPS data, offers significant possibilities for management and quality control (Figure 3). In principle, it very much reduces any possibility of data fraud. However, to ensure the viability of data for these purposes, it is necessary to ensure that if records are legitimately revisited, to complete or revise entries, then original time-stamp and geo-reference data must not be compromised. Our programming did not ensure this, and may have led to some cases inadvertently acquiring invalid or missing time and position data. This would also be particularly important if direct data capture were to be used in clinical trials to good clinical practice standards, in which an adequate audit trail needs to be demonstrated in all respects (Lampe & Weiler 1998; Koop & Mösges 2002). The high rate of in-flow of data with a large team of interviewers using PDAs means that procedures for quality control need to be well-designed in advance, something we did not entirely achieve.
In conclusion, we believe the risk we took in deciding to use PDAs for this survey was justified. Our data were collected and became available in a timely fashion, with generally good quality. However, as no direct comparison with paper-based capture was intended or possible, we cannot reach any comparative conclusion as to data validity or quality. There were some aspects of the process of using PDAs which, in hindsight, could have been better planned and managed, and which could have resulted in better data quality. This was true both at interviewer and supervisor levels. However, the viability of PDA technology as a means of survey data capture, even in one of the world’s harshest environments, seems to have been well established in this survey.
This work was undertaken as part of an international research programme – Immpact, funded by the Bill & Melinda Gates Foundation, the UK Department for International Development, the European Commission and USAID. The funders have no responsibility for the information provided or views expressed in this paper. The views expressed herein are solely those of the authors.
Conflicts of interest
The authors have not declared any conflicts of interest.