Identifying and quantitatively coding 17 of the key characteristics of the SDRs allowed for reasonable quantitative analysis aimed at preclassification of the data. Relationships between characteristics (called variables in the analysis section) can be better described through the combined use of cluster analysis and logistic regression used as a post clustering explicatory tool. Being able to identify groups through clusters of similar SDRs facilitates studying the effect of these group characteristics with respect to group membership, sustainability, success, and future trends. This may also enable the development of common data management and stewardship plans or tools and help provide avenues for social networking across domains.
Grouping similar SDRs.
Cluster analysis was chosen as a good foundation for an exploratory analysis, using the categorical data collected. This analysis enabled the maximizing of dissimilarity between groups to uncover possible similarities among SDR groups to see if they naturally partitioned in ways that could be explained well with the variables in our model. If a few well-defined classes of SDRs emerged from the cluster analysis, and it was possible to measure how successful they were, then this information could be used to provide models for the successful creation and management of SDRs. If instead many different types emerged with little similarity, then it would be difficult to develop such guidelines or generalizations. Cluster analysis was performed using Ward's method (Romesburg, 2004) on the 17 multinomial variables using PROC DISTANCE to obtain distance measures, PROC CLUSTER to perform the clustering, and PROC TREE to obtain a dendrogram (Johnson, 1967) used to help visualize the clustering results (SAS Version 8.1). The dendrogram in Figure 3 visually depicts the clustering results, showing individual independent SDRs (bottom) as they form into larger and larger groupings (top). As they grow upwards, the smaller clusters merge into larger clusters of SDRs with similar characteristics. A cluster that remains together for a long time (represented by a long line on the vertical axis) demonstrates a persistent set of similar characteristics. The y-axis value is a semipartial r-squared, which gives a measure of degree of differentiation between the groups. Note how Clusters B, C, and D in the four-cluster solution join to form Cluster 2 in the two-cluster solution shown as Cluster 1 and Cluster 2 on the dendrogram.
Figure 3. Dendrogram study comparing cluster formation results. The y-axis shows semipartial r-squared values, a measure of cluster differentiation. The X-axis shows the cluster groupings.
Download figure to PowerPoint
For further analysis, the four-cluster solution, depicted here as clusters A, B, C, and D, was selected for its persistence (as indicated by the semipartial r-squared value on the y-axis) and for the resulting degree of differentiation. This number of clusters also allows for a more detailed exploration of group composition. At this level of clustering, the four groups have been shaded the same in Figure 3 and Table 6, for easy comparison.
Table 6. The listing of individual SDRs in Clusters A, B, C, and D.
To understand more about the composition of the four clusters, cluster membership was incorporated into the dataset and a simple logistic regression was performed (PROC LOGISTIC, SAS Version 8.1) on each variable. Comparison of the Wald chi-square test statistic, divided by the degrees of freedom for each variable (depicted in Figure 4), yields a measure of relative strength of association on cluster membership for each variable when each of the variables is taken independently. As an exploratory analysis, this simple test helps to describe the role each of the characteristics plays in distinguishing between the four clusters (A, B, C, and D) in Figure 3.
Figure 4. Relative contribution of 17 analyzed variables to the four-cluster solution as shown in Figure 3. The y-axis value is the Wald chi-square divided by the degrees of freedom for each regressed variable.
*The HoldingSize variable most closely approximates the classification set out in the NSB 2005 report for research, community/resource, and reference level data collections.
**The HowBased variable most closely approximates the distinction set out in the NSB 2005 report between data collections as governmental, university based, or data federations (though here this is broken into the categories independent and aggregate).
***The research/community/reference variable included here, unlike that in the NSB 2005 report, is descriptive of how the overall organization functions. On some level, these are all “research” enterprises, some are particularly “community” centric, and a small number might view themselves as “reference” organizations.
Download figure to PowerPoint
Figure 4 shows the variables most responsible for group differentiation: GrantsContracts, MultipleSponsors, HoldingSize, PreservationPolicy, VirtuallyBased, RegistrationRequired, HowBased, AcceptSubmittedData, Centralized/Distributed, and InstrumentBased. Of somewhat lesser strength of association were Research/Community/Reference, Portal, NaturalScience, SubscriptionMembership, ScientificArea, BusinessType, and FreeinthePublicDomain.
The variables not used in the main analysis may also play a role in differentiation. The effects of the remaining variables, though, are harder to standardize and measure. In addition, it can be assumed that some of the effect of these variables has been inherently represented in the related variables included in the analysis.
To go beyond understanding individual variable contribution (Figure 4) and group membership (Table 6), group composition is examined. We performed a simple decomposition by identifying the majority values for each variable in each group (Table 7). Differences in majority values across the groups that were considered qualitatively meaningful are highlighted. From this analysis the relative “group titles” are obtained.
Table 7. The four groups as represented in Table 6 are presented.
Based on the information in Table 7, it is clear that some variables may be highly correlated, as in the cases of GrantsContracts, MultipleSponsors, AcceptSubmittedData, and InstrumentBased. Certainly for establishing the characteristics in Cluster A, the “Governmental” cluster, this is the case. The variables GrantsContracts and MultipleSponsors also play roles in differentiating among the remaining clusters B, C, and D. There is a subset of variables—include NaturalScience, SubscriptionMembership, and FreeinthePublicDomain—that do not play a role at all in the final clustering results, as the values remain consistent across all four groups. This is probably because the number of either yes or no values was too small to generate much of an effect. It is important to note that in the case of the nominal variables with many response options, HowBased (4), Research/Community/Reference (3), ScientificArea (11) and BusinessType (10), seeing any single response option predominate in a group is indicative of the composition of the group and suggests they are particularly noteworthy components of group composition.
The clustering results depend on the variables included. Because the 17 variables used in this analysis were chosen based on expected importance as well as being able to collect reasonably accurate and homogenous data from SDR Web sites, the selection of different variables, or potentially more accurate data, could lead to different results. Several different combinations of variables were tested to evaluate the robustness of this solution (details provided upon request) and review of data collected was performed by at least 61% of SDR administrators contacted. The alternative solutions showed only minor changes in group membership, as would be expected. Furthermore, they did not differ greatly from each other in terms of the semipartial r-squared values, suggesting some stability of the results.
Without an effective way to longitudinally sample the SDR Web sites over time, and with a small subset having been created before 1985 (22), the remaining 78 SDRs (with an inception date of 1985 or after) were studied in terms of mean variable responses over time. One year (1987) in the period from 1985 to 2008 had no observations. The cutoff date of 1985 was selected, because it marked a time of incredible growth in the development of new technologies like the personal computer, the Internet, and the advent of a variety of informatics disciplines. Although this coarsely samples the time period and may be confounded by nonrandom sampling of the SDRs, it may provide hints as to how some SDR features are changing over time (see details at http://ils.unc.edu/bmh/pubs/SDR_final_sheet.xlsx). Looking closely at the top 10 variables identified as differentiators in Figure 4, SDRs with grant and contract support as well as multiple sponsors is on the increase. This may be because of, in large part, an increased tendency for major governmental agencies like the National Science Foundation (NSF) and the National Institutes of Health (NIH) to provide funding to external entities for projects, perhaps contributing to the trend away from governmentally based SDRs toward more independent and aggregate SDRs. The number of university-based SDRs appears to have remained steady. These findings probably relate to the observation that SDRs in the holding size category of 2=medium/broad appear to be increasing, along with perhaps an increasing trend in observed 1=small/less broad SDRs and a steady emergence of 3=large/more broad type SDRs. There is also some evidence that distributed SDRs, those “housed in a set of physical locations and linked together electronically to create a single, coherent collection” (NSB, 2005), appear to be on the rise.
The existence of a preservation policy, while fairly steady over the period, appeared to fluctuate a little, and it is unclear whether this is represents a real pattern or an artifact of the data. There may be some evidence of a decline in SDRs, which consider themselves primarily virtually based. It could be presumed that this finding is the result of a trend toward the procurement of dedicated staffing or organizational infrastructure to assist with data curation, stewardship, and management, changing the nature of an otherwise virtually based organization. Two other important rising trends are that of a registration requirement, limiting use in some cases to subscribers or members but in more general use to help track usage of the data collections and other services of the SDRs. There is also perhaps a slight increase in the tendency to both provide data for export as well as ingest.
Based on a temporal review of the Web site for an SDR, there is a lack of clear beginning and end points on the life cycle of an SDR and little information on SDRs that may have existed but, for one reason or another, did not remain in existence. It is critical to be able to observe the Web presence of SDRs over time to be able to track changes in characteristics, as well as potential measures for success or failure. The Internet Archive, which captures and archives Web pages over time, could be used for such a comparison. To investigate the feasibility of this approach, more than a dozen SDRs included in the study were investigated using historical Web site data from the Internet Archive (1997–2007). The basic principle of the archive is to attempt to capture “snapshots” of Web sites by URL over time.
Figure 5 shows a comparison of the Carbon Dioxide Information Analysis Center (CDIAC) Web site, using the Internet Archive Wayback Machine to obtain an archive taken January 20, 2002, and the current “live” snapshot taken September 24, 2009. Note how the older site is less user friendly, less ADA compliant, and more graphically intensive. Importantly, the search has changed over time from being a node to being a central and persistent feature in the top right navigation. Notice also how the navigation has been developed into a sophisticated structure, aimed at repeat use and displaying the wealth of information underlying the site. Of particular interest, the presence of a “Data Submission” element is present now, which didn't appear in the original version. Despite these changes, this Web site demonstrates more consistency than the bulk of the sites investigated, using the Wayback Machine. For a number of the sites examined, data could not be retrieved for a variety of reasons, including data retrieval failures, sites using tools that (perhaps inadvertently) block the archival process, inconsistent URLs over time, incomplete data, and, in many cases, the visual comparisons were unclear, as in the case of the CDIAC illustration, because of unavailable images or comparison tool limitations. The Wayback Machine also does not necessarily capture many layers worth of information from a Web site and problems with fixity for newer sites with more sophisticated back end programming and dynamically driven pages are apparent. The potential to use the Wayback Machine for longitudinal analysis of SDRs is exciting, but, for this study, it was not possible to perform adequate comparisons using this methodology.
Figure 5. Sample use of the Internet Archive WayBack Machine to compare an archive taken January 20, 2002, and a “live” snapshot taken August 27, 2009, for the Carbon Dioxide Information Analysis Center.
Download figure to PowerPoint