Who shares? Who doesn't? Bibliometric factors associated with open archiving of biomedical datasets


  • Effective February 1 2011, all copyrightable material in this work is released under a Creative Commons Attribution 3.0 License. All data in the article and supplementary material, interpreted inclusively, are available under a CC0 waiver; please attribute according to academic norms.


Many initiatives encourage investigators to share their raw research datasets in pursuit of increased research quality and efficiency. Despite these investments of time and money, we do not yet understand the impact of these initiatives. In this study, I use bibliometric methods to understand the prevalence and patterns with which investigators publicly share their raw gene expression microarray datasets after study publication.

Automated methods were used to identify 11,603 published studies that created gene expression microarray data. At least 25% of these studies have datasets in one of the two predominant public databases for microarray data, increasing from 5% in 2001 to 35% in 2009. Fifteen factors that described authorship, funding, institution, publication, and domain environments were derived from 124 article attributes. Most factors associated with the prevalence of data sharing (p<0.01). In particular, publishing in a journal with a relatively strong data sharing policy, receiving funding from many NIH grants, publishing in an open access journal, and having prior experience sharing gene expression data were associated with the highest data sharing rates. In contrast, increased first author age and experience, having no experience reusing data, and studying cancer and human subjects were associated with the lowest data sharing rates.

In second-order factor analysis, previously sharing gene expression microarray data was most positively associated with high data sharing rates, whereas publishing a study on cancer or human subjects was strongly associated with a negative probability of data sharing.

I hope these methods and results will contribute to a deeper understanding of data sharing behavior and eventually more effective data sharing initiatives.