Making research data publicly available is becoming the norm, as many donors, publishers, and institutions are adopting open data policies.
IFPRI’s recently revised research data management and open access (RDMOA) policy requires research data to be deposited in an open access repository as soon as possible and (1) no later than a year after data collection ceases, (2) within six months of the publication of a peer-reviewed information product that makes use of the research data, or (3) upon the completion of the research project—whichever of these comes first and notwithstanding any agreement, covenant, or provision to the contrary.
Many donors, including Bill and Melinda Gates Foundation (BMGF) and Department for International Development (DFID), require making data public immediately. Researchers should therefore keep in mind that their data will ultimately become a public asset and plan ahead accordingly at all stages of the research cycle. Publicly shared datasets should include data files and documentation files, enabling all users to understand, analyze, and interpret the context, content, and meaning of the data, as well as a license that describes the provisions for reuse.
This guideline describes current standards and information required to publish datasets through the IFPRI Dataverse, which is IFPRI’s institutional data repository. It provides guidance for:
- Preparing data files,
- Preparing documentation,
- Ensuring the respondent's confidentiality, and
- Preparing metadata
1. Preparing data files
Preparing data files for publishing from the beginning of a research study will save time and resources. We recommend creating data files based on the modules or sections of your survey instruments.
Do not dump all your data into a single file. Very large (generally more than 20 variables) and very small (2 to 3 variables) data files are difficult to manage and ensure internal consistency.
Geospatial data files should be shared as a package that includes all the dependencies needed to open and use the data. These files tend to be large; we therefore recommend compressing the package using a compression tool such as Zip. Compressing is not recommended for other types of files.
In general, when preparing data files for public sharing:
- Ensuring that variable and value labels are consistent according to the questionnaire.
- Remove all temporary, administrative or dummy variables that were created for internal purposes.
- Provide weights as variables when applicable; do not apply weights in the data files.
- Ensure that there are no duplicate or redundant variables in the data files.
We suggest quality control tests to detect errors and inconsistencies in data. Some common techniques include spot-checking some values in the data files; sorting data files by different fields to easily spot outliers and empty cells; and calculating summary statistics or plotting data to identify incorrect or extreme values. It is also important to check whether copyright permissions are needed for publishing.
[faqs_group id=3]
2. Preparing documentation
Documentation provides all the information necessary to understand, interpret, and use a dataset. It should include the context, meaning, content, and structure of the data and how they were created. Good documentation ensures that the data can be searched and retrieved and understood and interpreted in a meaningful way. The following information is needed to ensure good documentation of data.
[faqs_group id=4]
Following documentation are required (at minimum) for publishing data through an institutional data repository.
[faqs_group id=5]
Although publicly available interviewer manuals, summary reports, and working papers would help potential users to better understand and use the data, they are not a mandatory requirement for publishing data.
Documentation in ASCII can be submitted as is. If documentation files are in other electronic formats, such as Word or Excel, they should be converted into a PDF. PDFs are considered more appropriate than Word files for long-term preservation. Keep in mind, however, that some donors require making datasets available in nonproprietary formats such as CSV, JSON, or XML.
3. Ensuring the respondent's confidentiality
Published datasets should not compromise the confidentiality of respondents. Data and documentation should be reviewed thoroughly for information that could identify respondents. Data and documentation should be checked for both direct and indirect identifiers.
[faqs_group id=6]
4. Preparing metadata
Metadata is “data about data.” Sufficient study-level metadata is critical to understanding a study and its context. IFPRI datasets are published and shared through the Harvard Dataverse, and metadata elements are adopted from the metadata templates provided by the Dataverse. These metadata elements can be easily mapped to other schema, such as DDI and Dublin Core.
[faqs_group id=8]