Correspondence: EO Ohuma, Nuffield Department of Obstetrics & Gynaecology, University of Oxford, Women's Centre, Level 3, John Radcliffe Hospital, Headington, Oxford OX3 9DU, UK. Email firstname.lastname@example.org
The INTERGROWTH-21st Project data management was structured incorporating both a centralised and decentralised system for the eight study centres, which all used the same database and standardised data collection instruments, manuals and processes. Each centre was responsible for the entry and validation of their country-specific data, which were entered onto a centralised system maintained by the Data Coordinating Unit in Oxford. A comprehensive data management system was designed to handle the very large volumes of data. It contained internal validations to prevent incorrect and inconsistent values being captured, and allowed online data entry by local Data Management Units, as well as real-time management of recruitment and data collection by the Data Coordinating Unit in Oxford. To maintain data integrity, only the Data Coordinating Unit in Oxford had access to all the eight centres' data, which were continually monitored. All queries identified were raised with the relevant local data manager for verification and correction, if necessary. The system automatically logged an audit trail of all updates to the database with the date and name of the person who made the changes. These rigorous processes ensured that the data collected in the INTERGROWTH-21st Project were of exceptionally high quality.
INTERGROWTH-21st is a multicentre, multiethnic, population-based project, being conducted in eight health institutions (Brazil, China, India, Italy, Kenya, Oman, UK and the USA), with technical support from four global specialised units, to study growth, health and nutrition from early pregnancy to infancy. The project comprises three components: the Fetal Growth Longitudinal Study (FGLS), the Preterm Postnatal Follow-up Study (PPFS) and the Newborn Cross-Sectional Study (NCSS).
The primary objective of these studies is to develop new, international, ‘prescriptive’ standards to describe fetal, preterm and neonatal growth as well as nutritional status, and to relate these standards to neonatal health risk in eight geographically diverse populations. In brief, FGLS monitors and measures fetal growth clinically and by ultrasound in a population-based sample of ‘healthy’ mothers. PPFS follows preterm infants in FGLS who delivered at ≥26+0 but <37+0 weeks of gestation, to describe their postnatal growth pattern. NCSS is a cross-sectional study documenting the anthropometric measures – length, head circumference and weight at birth – plus neonatal morbidity and mortality rates in the population of all newborns who delivered at the study centres over a 12 month period. These studies are described in greater detail elsewhere.[1, 3]
The data management element of INTERGROWTH-21st was built into the study protocols to ensure a high quality of data collection, validation, data security and confidentiality. The study protocols and other project documents, including operation manuals used during the project, have been available on our website (www.intergrowth21.org.uk) from the outset. The design and conduct of the data management processes for such a multinational project benefited from the experiences of similar large-scale multicentre studies conducted by others, as well as of members of our team who have used online data management systems in developing countries.[5, 6] The construction of a focused, well-organised and transparent data management plan is essential for ensuring the validity and credibility of large-scale projects such as INTERGROWTH-21st.
In this paper, we describe the basic concepts and procedures applied in managing data for the INTERGROWTH-21st Project. It is one of a series of papers being published as a special supplement to BJOG: An International Journal of Obstetrics and Gynaecology describing the different components that relate to the processes and implementation of the INTERGROWTH-21st Project. It should be read in conjunction with the following papers: (1) The objectives, design and implementation of the INTERGROWTH-21st Project; (2) Ultrasound methodology used to construct the fetal growth standards in the INTERGROWTH-21st Project; (3) Standardisation and quality control of ultrasound measurements of fetal growth; (4) Anthropometric standardisation and quality control protocols for the construction of new international fetal and newborn growth standards; and (5) Statistical considerations for the development of prescriptive growth standards; among others that also appear in this supplement.
The INTERGROWTH-21st Project started recruiting in the UK centre in May 2009 and the other seven centres progressively followed in the same year. This was after successful completion of the preparatory phase and piloting of the FGLS data collection forms by all centres between January and April 2009.
Data collection instruments
All documentation and forms used for data collection in each of the three studies were prepared by the INTERGROWTH-21st Project Coordinating Unit (PCU) and the Data Coordinating Unit (DCU) in Oxford. The draft forms were translated and pretested at each centre during the pilot phase and introduced thereafter into the online data management system specifically developed for these studies by Medical Science Online (MedSciNet), a private company with extensive experience of large multicentre trials and observational studies (http://medscinet.com).
MedSciNet was asked to create a comprehensive data management system that allowed online data entry by the local Data Management Units (DMUs) at each centre and real-time data management and monitoring by the DCU in Oxford. All forms were integrated into the system and linked by a six-digit unique subject identifier (the first two digits representing the centre and the remaining four digits the participant), to avoid duplication in the data entry process and to facilitate internal consistency and data quality control mechanisms.
Separate forms were developed for the three studies, so that they could be analysed independently; however, there is one form common to all three studies, which collects standardised pregnancy and delivery information at birth from mothers and their newborns. The forms were designed to ensure that all data collected addressed the specific aims of each study, avoiding the common temptation of collecting unnecessary information unrelated to the main aims of the study. There were clear instructions to limit the number of questions asked so as to ensure high data quality. A detailed description of the data collection system and quality control strategies for specific components are presented in another paper in this supplement.
General organisation of the data management system
A hybrid data management structure was adopted incorporating both a centralised and decentralised system for the eight study centres. The data management was decentralised so that each centre was responsible for the entry and validation of their country-specific data under the direction of the local DMU. All the local DMUs received support from the DCU in Oxford via the centralised coordination system.
Data access, security and confidentiality
Data integrity and security were maintained by creating different access rights to users in keeping with their duties, so that data entry personnel, local data managers and general users all had different access rights. For example, local data managers could only view their own country's data and data entry staff could not delete records already saved in the system. Only the DCU in Oxford (two data managers and a statistician) had access to all eight centres' data, among other data access rights not available to the local DMUs. All user accounts were protected by passwords, which expired and had to be renewed every 3 months. Confidentiality was maintained by not collecting or storing any identifiable information on either the paper forms or the online data management system. Therefore, the names of women enrolled in the study were not captured or recorded anywhere in the database but instead were linked to a unique, six-digit, subject identifier. A paper list containing subject identifiers and their corresponding identifiable information was securely stored in a locked location only accessible by the field collection team.
Organisation of the data sets
The data management (collection, cleaning and processing of the data, and creation of master files) for each of the three studies was managed separately (see Supporting Information Appendices S1–S3 for the individual study data collection flow models).
The database of the longitudinal study, FGLS, supported several data sets. (1) The screening data set which described the first contact of all women screened irrespective of whether they were enrolled in the study or not. Each woman in the screening data set could be identified by a unique combination of country code, antenatal clinic code and screening number. This identifier was then used to link the screening data set to (2) the maternal study entry data set, which collected information from enrolled women on maternal characteristics at study entry. At this point, they were allocated an FGLS subject number, which uniquely identified them on all subsequent antenatal visits. (3) The pregnancy and follow-up data set contained information, obtained by the study clinical staff, relating to the pregnancy and ultrasound follow-up visits. Actual ultrasound measurements taken at these visits were collected in a separate data set, as they were uploaded directly from the ultrasound machine to the data management system via USB sticks to avoid potential data transcription errors. Other data sets contained information on: (4) pregnancy and delivery (collected at birth), including newborn information on a pregnancy event summary component; (5) maternal referral (in the case of referral to another level of care, hospital or other medical admission during pregnancy); (6) fetal abnormalities detected during an ultrasound examination; (7) neonatal abnormalities detected during clinical neonatal examination; (8) any severe medical adverse events occurring during pregnancy (as required by the Data Monitoring and Safety Committee), and (9) known nonmicrobiological contamination such as pollution, radiation or any other toxic substances within the home and work-related environments (this information was collected for a 20% sample of women participating in FGLS at each centre and is described elsewhere in this supplement).
PPFS is a follow-up of all the preterm infants born to mothers in FGLS. Data for this study were organised into three data sets: (1) neonatal follow-up; (2) infant follow-up; and (3) infants' dietary intake data.
NCSS data were captured in one data set, which contained information collected at delivery on the mother's antenatal clinic details, delivery and anthropometric information about the newborn (head circumference, length and weight at birth).
Preparatory work and system set-up
Data entry was performed at each centre using the online data management system, which allowed the DCU to monitor the data in real-time without any delay due to the physical transfer of paper forms. Validations (ranges, logical values and internal consistency) were created to prevent the input of invalid values during the data entry process. All changes to the online data management system were automatically recorded with the date and name of the person who made the changes. Figure 1 illustrates the conceptual framework and set-up of the data management process.
The online data management system exactly matched the paper questionnaires (in English) with regard to the wording of questions and the order in which they were asked. In Brazil, China and Italy, the forms were translated centrally into the local language and independently back-translated into English to ensure that the content and interpretation of the questions remained unchanged.
Before the start of data collection, each centre's data manager participated in a 3-day training workshop in Oxford organised by the DCU and MedSciNet. This included presentations detailing the required procedures and exercises designed for training personnel in data entry, verification and validation. Exercises were organised to ensure that participants clearly understood the data management manual, which included a step-by-step guide to all requirements, processes and tasks to be followed by each local data manager. Attention focused on the data manager's responsibility to ensure timely, remote data entry and communication of queries to the DCU. Each data manager was provided with copies of the data management manual to train their own staff before data collection commenced.
Routine procedures at the local DMU
The data management team within each local DMU followed the processes outlined in the data management manual and described during the training meeting (Table 1). Standardised procedures were developed to guide data entry, query management and data control at the country level. Each data manager was responsible for controlling the quality of the data collected in their country by managing the training of their data entry staff, monitoring the data entry processes and responding to queries from the DCU relating to their centre's data.
Table 1. Overview of routine data management processes within local data management units in liaison with the Data Coordinating Unit in Oxford
1. The local data management units (DMU) were responsible for the transfer of paper forms between the DMU, study clinics and hospitals
Ultrasound measurements and images
1. USB sticks were sent to local clinics and hospitals for daily backup of the ultrasound machine. Once the backup was received by the DMUs, the measurements were uploaded to the online data management system
2. A backup was taken from the ultrasound machine every 2 weeks. This was contained on a USB stick and sent by courier to Oxford for storage
3. A monthly backup was taken and kept on a hard drive at the study centre
1. Once received, forms were entered onto the online data management system through personal ‘user’ log-in accounts
2. Where forms failed the validations, they were first saved as a draft
1. In the first instance, data entry staff checked the form against the database to correct any data entry errors. If these corrections fixed all validation errors the form was saved as a final version
2. The local data manager checked all forms saved as drafts and worked with the field collection staff to correct any forms with missing or inconsistent data. When corrections were necessary the paper copy was corrected so that the original answers were not obscured and an audit trail was maintained. The database was then corrected with a comment describing the error recorded for every field changed
3. If the form could not be corrected, the validations were overriden through the use of a personal ‘monitor’ log-in account, which was restricted to only one person per centre. The data manager at the Data Coordinating Unit (DCU) in Oxford was then informed of these instances
4. The data manager at the local DMU also responded to queries from the Oxford DCU. The processes outlined under query management (points 1 and 2) were repeated for all these queries
Data quality control
Paper forms were compared against the database for a list of subject numbers provided by the DCU. Data managers recorded the number of data entry errors found on a standard template and returned it to the DCU. These were then corrected on the online database
The importance of the initial data collection stage was emphasised to the local data managers to ensure that they provided their staff with the necessary knowledge to complete and enter the forms correctly and in a consistent manner. Procedures were also put in place to manage data editing by ensuring that paper forms were never altered in a way that obscured the original entry and by automatically generating an electronic, audit trail of changes to the database. A combination of clear instructions printed on the data collection forms and validations built into the data management system helped to support routine procedures. For example, on the first page of the form booklet the process for correcting an error is described:
‘If you do make an error please cross it out and write the correct answer (and your initials) outside the box. Correction fluids should not be used.'
These instructions were translated into local languages where necessary.
Validation programmes built into the data management system were also standardised and used to identify data transcription and entry errors. These routine procedures for data collection, entry, query management and quality control are summarised in Table 1.
Routine procedures at the DCU in Oxford
In addition to consistency and completeness checks performed by local data managers, the DCU data managers and statistician developed routine validation programmes to perform overall checks on the data using statistical software SAS version 9.2 (SAS Institute Inc., Carey, NC, USA) and STATA version 12 (StataCorp, College Station, TX, USA).
Summary and descriptive statistics were used to assess the data for completeness, consistency, duplicate records and potential outliers. The outputs of these programmes included frequencies, cross-tabulations, box plots, scatter plots and histograms, which were used to detect obvious errors in the data. Visual inspection of the distribution of the raw data using histograms, scatter plots and Bland–Altman plots was also employed to identify potential errors by reviewing outliers from ultrasound measurements. For example, a Bland–Altman graph of the difference between two ultrasound measurements (taken in duplicate) of abdominal circumference versus their average (Figure 2A) and a scatter plot of head circumference versus gestational age (Figure 2B).
Variables were cross-checked for consistency: for example, at the screening stage of FGLS, all women were asked if they had ever been diagnosed with, or treated for, threatened miscarriage, depression, rhesus disease, anaemia, sexually transmitted infections, or high blood pressure as these conditions are likely to compromise optimal fetal growth. For consistency, we checked that all women enrolled as eligible for the study (evaluated by a separate summary question) responded ‘no’ to all these conditions.
Routine reports on the data from each centre, such as weekly missing data reports, were run by the DCU data managers and sent to the respective centres for review. All queries lodged with centres were documented and their responses were stored electronically at the DCU. In addition to performing data quality checks, the DCU also produced monthly reports on recruitment accrual for the three studies (FGLS, PPFS and NCSS) for each centre. These compared actual recruitments against the expected recruitment, and produced a summary of the numbers of eligible and ineligible screened women per centre, with the reasons for ineligibility. These reports were also used to monitor recruitment, retention and compliance with the entry criteria for each study.
Each local DMU was required to send USB sticks of all ultrasound images to the DCU for storage and back-up every 2 weeks. The images were then transferred to the Institute of Biomedical Engineering (IBME) at the University of Oxford for permanent storage. A monthly comparison was made between data stored at IBME and data added to the online data management system to ensure that these all matched and that there were no subjects enrolled with missing ultrasound information or vice versa. The images were made available for audit and ultrasound quality control purposes. Also, individual follow-up visits with ultrasound measurements were checked against the pregnancy follow-up ultrasound forms to ensure that every visit had both an ultrasound form and the required ultrasound measurements.
Data quality control
INTERGROWTH-21st also implemented a data quality control process, which involved a detailed review of each centre's data for the three studies (FGLS, PPFS and NCSS). For FGLS specifically, a 10% random sample of all data was taken for each centre after 250 women were recruited (i.e. half the minimum recruitment target). All the variables were reviewed for completeness and accuracy by comparing data on the data management system with the paper forms.
Similarly, for NCSS, a 5% sample was regularly taken during the course of the study to ascertain the accuracy and completeness of data entry. Once a centre had recruited 3500 women (i.e. half the total recruitment target), a 5% sample was taken and all variables were reviewed and compared with the paper forms. This process was repeated in each centre on completion of recruitment. An error rate was calculated based on all variables to ensure that data entry errors were kept below 0.5% in each centre. If the error rate was above 0.5%, data entry personnel were retrained and a subsequent 5% sample of new recruits was taken to ensure their performance had improved.
As the sample size for PPFS was small, all the forms and variables for each preterm infant were reviewed for completeness, consistency and adherence to the protocols by both the DCU and the postnatal follow-up study group.
Prevention is the best form of data quality assurance. Therefore, within the INTERGROWTH-21st Project, a number of measures were adopted so as to increase the quality of the data and reduce the error rate. For example, validations incorporated into the online data management system reduced the risk of incorrect and inconsistent values being captured and the following measures completed the strategy: high-quality standardised procedures, face-to-face training sessions for all data managers and ultrasonographers, a dedicated INTERGROWTH-21st computer at each study site, a uniform data management system, and the use of ongoing reporting and monitoring tools by the DCU.
The continual monitoring processes developed for INTERGROWTH-21st allowed problems to be identified and corrected while the project was ongoing. For example, checks on the number of repeat anthropometric measurements taken at birth (i.e. head circumference, length and birthweight) at each site were used to assess protocol adherence. If the number was markedly different from the approximately 5% expected at each centre (i.e. <1% or > 10%), then a follow-up was arranged. This ensured adherence to the protocol at all centres, as well as early detection and correction of nonadherence, which meant that systematic errors could not persist throughout the study. Further details are reported in another paper in this supplement.
A hybrid version of a centralised data management structure was employed to maintain tight control over the overall data management of the project. In this structure, the Oxford-based DCU had overall responsibility for data monitoring and validation. However, local DMUs were also required to handle each centre's own data collection and entry, perform initial data quality checks, and resolve queries originating from the Oxford-based DCU. In this way, consistent communication links were maintained between the DCU and the study sites, and delays in data entry were avoided. The structure yielded the benefits of a decentralised system, despite highly centralised management.
The international transfer of large amounts of ultrasound image data from the local DMUs to the DCU produced unanticipated challenges. Each DMU maintained daily back-ups, which were consolidated periodically for transfer to, and storage in, Oxford. Seven of our study sites collated these anonymised images every 2 weeks on USB sticks, which were delivered by courier to Oxford. This method proved effective as only one stick was lost in transit during the course of the study; fortunately, the ‘lost’ data were reconstituted from the back-up held locally at the DMU and re-sent successfully. One centre had to transfer images electronically through a Dropbox facility (www.dropbox.com), as customs procedures would have prevented the timely transfer of USB sticks to the UK. Clearly, even though logistical processes were standardised across countries, flexibility was sometimes required for such a project involving eight study centres across different continents.
In summary, building on our extensive experience with multicentre research, we have implemented a project-specific data management system that has produced a very strong and reliable database. All data collection forms and manuals are freely available from the INTERGROWTH-21st Project website for those interested in implementing similar large-scale studies, provided naturally that our contribution is cited and acknowledged. Lastly, the INTERGROWTH-21st Project team is more than happy to share our experiences with any researchers.
Disclosure of interests
Laima Juodvirsiene is an employee of MedSciNet, U.K. Ltd, London, UK. The remaining authors have no potential conflicts of interest to declare.
Contribution to authorship
EOO wrote the manuscript and all the authors read and approved the final version.
Details of ethics approval
The INTERGROWTH-21st Project was approved by the Oxfordshire Research Ethics Committee ‘C’ (reference: 08/H0606/139), and the research ethics committees of the individual participating institutions and corresponding health authorities where the Project was implemented.
This project was supported by the INTERGROWTH-21st Grant ID# 49038 from the Bill & Melinda Gates Foundation to the University of Oxford, for which we are very grateful. DGA is supported by a Programme grant from Cancer Research UK (C5529). ATP is supported by the Oxford Partnership Comprehensive Biomedical Research Centre with funding from the Department of Health NIHR Biomedical Research Centres funding scheme.
A full list of Members of the International Fetal and Newborn Growth Consortium for the 21st Century (INTERGROWTH-21st) and its Committees appears in the preliminary pages of this supplement.