Improper data practices erode the quality of global ecological databases and impede the progress of ecological research

The scientific community has entered an era of big data. However, with big data comes big responsibilities, and best practices for how data are contributed to databases have not kept pace with the collection, aggregation, and analysis of big data. Here, we rigorously assess the quantity of data for specific leaf area (SLA) available within the largest and most frequently used global plant trait database, the TRY Plant Trait Database, exploring how much of the data were applicable (i.e., original, representative, logical, and comparable) and traceable (i.e., published, cited, and consistent). Over three‐quarters of the SLA data in TRY either lacked applicability or traceability, leaving only 22.9% of the original data usable compared with the 64.9% typically deemed usable by standard data cleaning protocols. The remaining usable data differed markedly from the original for many species, which led to altered interpretation of ecological analyses. Though the data we consider here make up only 4.5% of SLA data within TRY, similar issues of applicability and traceability likely apply to SLA data for other species as well as other commonly measured, uploaded, and downloaded plant traits. We end with suggested steps forward for global ecological databases, including suggestions for both uploaders to and curators of databases with the hope that, through addressing the issues raised here, we can increase data quality and integrity within the ecological community.

Observatory Network; Keller et al., 2008;Knapp et al., 2012;Magurran et al., 2010) have contributed to the growth of data quantity and accessibility.To organize, aggregate, store, and eventually utilize these data, there has also been a rapid growth of databases and repositories (e.g., Dryad, GBIF, GenBank; Farley et al., 2018), with some suggesting that researchers who fail to embrace these tools risk becoming "scientifically irrelevant" (Hampton et al., 2013).However, with increasing data contribution and usage comes, the need for clear and standardized data management practices the entire ecological research community should adopt.
Transparent curation of datasets and metadata, responsible sharing and citing practices, and appropriate data integration and metaanalytical approaches are but a few key responsibilities of individual researchers and the broader scientific community (Farley et al., 2018;Gallagher et al., 2020;Hampton et al., 2013;Wüest et al., 2020), particularly when it comes to contributing to easily accessible databases.
Several databases have become widely used in the plant ecology research community (e.g., FRED, sPlot; Bruelheide et al., 2019;Iversen et al., 2017), including the TRY Plant Trait Database (Kattge et al., 2011(Kattge et al., , 2020)).TRY was created in 2007 as an open-access database with the goals of documenting plant functional diversity, introducing the broader plant biology community to the utility of traits, and providing access to data that may prove crucial in an era of global change (Kattge et al., 2011).
A so-called "database of databases," its inception was the amalgamation of numerous existing plant trait databases (e.g., GLOPNET, LEDA; Kleyer et al., 2008;Wright et al., 2004), with the hope that consolidating within a single database would simplify and improve accessibility (Kattge et al., 2020).As of 2020, the database has more submissions and downloads than any other trait database (Schneider et al., 2019) and possesses 12 million trait data points (e.g., physiological, morphological, anatomical, and phenological) for almost 280,000 species across the world (Kattge et al., 2020).
Several recent advancements in our understanding of plant form and function are a direct result of TRY (e.g., Bruelheide et al., 2018;Díaz et al., 2016), and the publication associated with the most recent version release (Kattge et al., 2020) has already been cited over 1000 times.The TRY Plant Trait Database has and will continue to fundamentally shape the field's understanding of worldwide plant biodiversity, ecosystem function, and responses to global change.As such, data submitted to TRY and used in regional to global analyses must be reliable.The importance of data quality and accessibility has been widely acknowledged in peer-reviewed ecological datasets (Ely et al., 2021;Fegraus et al., 2005;Gallagher et al., 2020;Keller et al., 2022;Michener, 2006;Poisot et al., 2013;Reichman et al., 2011;Roche et al., 2015;Schneider et al., 2019;Whitlock, 2011).However, less attention has been paid to the quality of the largest and most frequently used ecological databases globally.
We conducted a systematic review of data quality in TRY using its most downloaded trait, specific leaf area (SLA, leaf area per unit dry leaf mass; Kattge et al., 2020), for four cosmopolitan plant groups of distinct taxonomy and function: conifers (gymnosperms, tree/shrub), Quercus (angiosperm, tree/shrub), Plantago (angiosperm, forb), and Poa (angiosperm, graminoid).For each data point, we determined whether data met the following criteria: (1) applicable, or whether the data point is original, representative, logical, and comparable, and (2) traceable, or whether the data point is published, cited, and consistent (see FAIR and ALCOA principles for similar standards; Rattan, 2018;Wilkinson et al., 2016).We used these criteria to create a rigorous data cleaning protocol and compared this to the standard cleaning protocol taken by most researchers and that is suggested by TRY (Lam et al., 2022).With this, we asked: (1) how does the quality of data available from TRY differ between rigorous and standard cleaning protocols?(2) do ecological analyses and interpretations differ between rigorously and standardly cleaned data? and (3) what are the most common issues deteriorating data quality and how can they best be addressed?Since TRY is open access, most of our investigation focuses on how data are uploaded and the consequences of poor data management practices, as the responsibility for upholding data quality standards for openaccess databases should be on the data uploaders (Table 1).

| Datasets overview
We downloaded SLA data from TRY (i.e.,traits 3115,3116,and 3117; petiole excluded, petiole included, and unspecified SLA, respectively) for all species in our four plant groups: conifers (Araucariaceae, Cupressaceae, Pinaceae, Podocarpaceae, TA B L E 1 Definitions of terms frequently used throughout this paper.1) Individuals who collect original data (i.e., collectors) may upload their own data to TRY (i.e., uploaders) without publishing them in a peer-reviewed journal, while (2) other collectors may upload their data to TRY after publishing the study associated with the data.In both situations, the collector is also the uploader.Alternatively, uploaders conducting meta-analyses may incorporate a collector's (3) unpublished and/or (4) published data into their own dataset and upload them to TRY with the rest of their meta-analysis data.
(5) Other uploaders still may upload a collector's published data into TRY without using it in their own separate analysis.Finally, (6) uploaders conducting meta-analyses may download data from TRY for their analysis and then upload their data back to TRY after publication.
Different data quality issues arise depending on how the data were uploaded to TRY, most of which fall into three overarching groups.(A) The first common issue is unpublished data.Data within TRY are uploaded as a specific trait (e.g., plant height, seed dry mass, and SLA).For unpublished data, collection methods cannot be determined outside of the trait name for which the data were uploaded.Data may be uploaded as multiple traits or the trait they are uploaded as is neither informative nor accurate, leaving uncertainties regarding how comparable unpublished data are to other data in the database.For example, SLA can be uploaded including the petiole, without the petiole, or with an undefined petiole status.In none of these cases are the methodologies for measuring leaf area specified and may not be standardized, and different methodologies can yield markedly different SLA values (Eimil-Fraga et al., 2015;Garnier et al., 2001).(B) The second common issue is data that were not uploaded by the collector.Data that are readily available online are rarely used in a single analysis.As such, many researchers download, use, and then re-upload other's data which are often already in TRY, leading to extensive data duplication.Furthermore, the uploader, rather than the collector, receives citation credit for those data.(C) The third common issue is uncited data.Failing to cite other's data is plagiarism, and it is impossible to determine the methodology or if the data have been altered without reference to the primary literature (see A).
What do data downloaded from TRY look like?Many of the six main TRY data entry pathways result in consistent data quality issues that are evident with close inspection (second figure in Box 1; note that none of the datasets shown are real and the column names are slightly different from how they appear in TRY for ease of interpretation).When downloading data from TRY, the downloader is provided with both the uploaded data (1-6) and columns that provide metadata for that data point (A-E).The columns include information such as the (A) uploader's name, (B) dataset reference, (C) individual data point reference, (D) species name, and (E) trait value (e.g., SLA).Often, there will be a dataset reference but no reference for the individual data points.The dataset reference is typically either (1) unpublished, (2) a published paper for which the data were originally collected, or (3-6) a paper where previously collected data were (re)used (e.g., a meta-analysis).Although not available from the TRY data download, we also show here columns that help document data quality based on information available from the data point reference: the (F) SLA value in the data point reference and (G) collection methods used to obtain this measurement.
When datasets are composed of (1) unpublished or (2) published data uploaded by the collector, having only the dataset reference is sufficient, as the data point reference typically matches the dataset reference.A dataset with a published reference with publicly available data (either in the main or Supporting Information) outside of TRY permits verification of the methods and data accuracy.This is necessary because sometimes a data point is not usable because the collection methodology is not comparable to the field standards (e.g., SLA not measured on a projected leaf area basis) and/or the reported values in the database are not consistent with those in the reference.When datasets contain data from multiple sources (3-6), references become more convoluted.(3) Sometimes the dataset will include references for all individual data points or (4) the dataset may have individual data points uncited, generating uncertainty about where the data come from.Furthermore, (5) others might upload data that they neither collected nor used in a meta-analysis or other peer-reviewed research.Finally, (6) some datasets include data downloaded from TRY and reuploaded, occasionally after being altered (e.g., uploading the global mean value for a species).
Flow of data into TRY.
Flow of data out of TRY.
BOX 1 (Continued) points that did not have an individual reference, we initially assumed the dataset reference to be the source.There were many data points within datasets that we identified as uncited (i.e., only had a dataset reference but were determined to not be from that source) for which we were unable to determine a reference.For datasets that we assumed to contain uncited data, we explored the literature to verify this assumption and found data from over 30 papers that contained data that had been uploaded to TRY but were not cited.Of the original 120 datasets, 26 contained no data that could be clearly linked to the uploader, including the four largest datasets that comprised 51.4% (10,175) of all the data we downloaded from TRY.Although this does not necessarily mean the uploader provided no original data-some of the unpublished and/or uncited data within the dataset may have been collected by the uploader, and many datasets were not fully covered by our four plant groups-many uploaders appear to have not been involved in data collection.Only 33.7% of the data points (6679) appear to have been uploaded by the individual who collected them, showing that the credit (i.e., citations) being received for uploading data to TRY is mostly going to individuals who did not collect the data.

| Standard data cleaning protocol
We cleaned the downloaded portion of the TRY database following suggested cleaning methodology provided by TRY for data quality assurance (Figure 1; Kattge et al., 2020;Lam et al., 2022).Hereafter, this cleaning protocol will be referred to as the "standard" cleaning.
This protocol removed data if they were: (1) marked by TRY as duplicated data (i.e., data points identical to ones already uploaded to the database), ( 2) not a mean or a single observation (e.g., minimum and maximum values), or (3) marked by TRY as an outlier (i.e., more than three standard deviations away from the species, genus, or family trait mean).After this cleaning, 12,860 data points (64.9% of originally retrieved data) for 304 species (93.5% of the original species with data) remained.Of the removed data, 70.2% (4874) were removed for being duplicated, 4.1% (287) were removed for not being a single observation or mean, and 25.7% (1783) were removed for being outliers (Figure 1).

| Rigorous data cleaning protocol
We cleaned the downloaded data a second time following our own more rigorous protocol, which involved tracing each data point back to its original source.Hereafter, this protocol will be referred to as "rigorous" cleaning (Figure 1).While some of our data cleaning criteria overlapped with the standard cleaning criteria (see 1-3), we set more strict standards for data quality within those criteria.
Specifically, we focused on removing data that were not applicable (i.e., original, representative, logical, and comparable) and traceable F I G U R E 1 Flow diagram of how data were cleaned using both the standard and the rigorous protocols.Each step for data removal is listed in the bottom left.Blue circles display the amount of data at the start and end of the cleaning process (labeled "Downloaded Data Points" and "Cleaned Data Points," respectively) or the number of each cleaning step (1-7).The size of the circle scales with the amount of data either removed or that remain.Different colored circles signify why data were removed (i.e., not applicable or traceable in red and orange, respectively), with the total amount of data removed for each step in the middle of the circle.
(i.e., published, cited, and consistent).For each of the 19,804 data points, we assessed seven criteria to determine whether they were: (1) original (i.e., not already uploaded; applicable), (2) a representative value of the measured individual or region (i.e., either a raw value or local mean; applicable), (3) a logical value for that species, genus, or family (i.e., not an outlier; applicable), (4) comparable to other data (i.e., used the same collection methods; applicable), (5) published (i.e., found in the primary literature; traceable), (6) correctly cited (i.e., appropriately attributed to the initial data collector and associated primary literature; traceable), and (7) consistent with the value in the primary literature (i.e., the same as the referenced value; traceable).
We want to emphasize that this manual cleaning protocol is not an We manually identified duplicated data by tracing all data points to their original data source, which removed 42.7% (8453) of the data and represents the largest source of unusable SLA data for our chosen plant groups.Despite this also being a criterion in the TRY data cleaning protocol, we manually identified nearly twice as many duplicates as the standard cleaning protocol.Nearly all (88.3%; 7462) of the duplicated data were not uploaded by the data collector, meaning over half (56.9%) of contributed data not uploaded by collectors were duplicated.This demonstrates that uploading others' data often only benefits the uploader, as they receive credit via citations but do not contribute new, original data for users.We then removed data that were not a local mean or raw value following the TRY protocol, which removed an additional 0.7% (148) of the data.This ensured data were representative of an individual observation or local mean and comparable across measurements, which is needed for one of trait data's most common uses-traitby-environment analyses (Legendre et al., 1997).We next removed data that were more than three standard deviations away from the species, genus, or family mean, as marked and suggested by TRY, which removed an additional 7.5% (1495) of the data.The amount of data removed for steps two and three are slightly smaller than that removed by the same steps in the standard cleaning protocol because some of these data were already removed with our more rigorous duplication cleaning step.Finally, we removed conifer data that were not measured on a projected area basis or for which measurement methods were not specified as methodological differences may result in incomparable data values.While SLA can be measured in multiple ways on broadleaved species (Garnier et al., 2001), we found that variation between SLA measurements is small compared with the variation between measurement types in conifers and other needle-leaved species (Figure 2a).This removed an additional 2.2% (437) of the data.Across these above steps, data lacking applicability removed during our cleaning protocols composed 53.2% of the original SLA dataset for our specified plant groups downloaded from TRY (Figure 1).
Data lacking traceability were then removed.We first removed unpublished data uploaded by the collector (6.9%, 1359) and cited unpublished data not uploaded by the collector (5.4%, 1060).Uncited data were then removed (8.1%, 1614).We typically only considered a data point to be uncited if it came from a dataset that was from a meta-analysis or composed of numerous other datasets and the data point had no reference assigned to it (see Appendix S1).We then removed data from cited, published references that were neither public (i.e., not available outside of TRY) nor uploaded by the collector, which cannot be verified as original data points or checked for accuracy (1.1%, 218).Finally, we removed data that were reported differently in TRY and the cited reference.Typically when this occurred, the data in TRY were considerably different from those presented in the reference, presumably as a result of either unexplained data transformations or improper manual extraction from figures.There were also instances when the mean of the raw data that were uploaded to TRY was different from the mean presented in the cited reference by 2.3%-27.9%,presumably because only a subset of the data were used in the cited analyses or only a subset of the data got uploaded to TRY.Rarely, data were rounded to the nearest integer or were uploaded for the wrong species (e.g., Salix data uploaded for Picea abies).Removing data that were not consistent with values in the primary literature removed an additional 2.7% (489) of the data.
In total, 23.9% (4740) of the downloaded SLA data were removed due to their lack of traceability (Figure 1).
In total, 77.1% (15,273) of the data were not usable by these criteria, with 53.2% removed due to lack of applicability and 23.9% due to lack of traceability (Figure 1).After rigorous cleaning, only 221 (68.0%) of the original species had usable data.While the total number removed for each criterion varied depending on the order in which we removed data (e.g., there would be fewer items flagged as duplicates if we took this step last rather than first), the total amount removed did not.Notably, our cleaning protocol removed 49.7% (compared with 35.1%) of the data addressing the issues targeted in the TRY cleaning protocol (i.e., duplicates, observation type, and outliers).This discrepancy is mostly due to the large difference in duplicates removed between the two cleaning processes (4874 vs. 8166), illustrating the difficulty in automating the detection of duplicated data.

| Data quality impacts on ecological analyses and interpretation
For each species, we determined the sample size and mean SLA in lowing the rigorous cleaning protocol (Figure S1).The mean SLA varied the least within the angiosperms, with the median absolute change in SLA ranging from 1.5% to 7.7% within the three groups, and with 6.4% of species having an absolute change over 25% (Figure 2b).Conifers varied substantially, primarily due to the different methods used to measure SLA (Figure 2b).The median absolute change in conifers was 9.1%, and almost one-fifth (19.8%) of the species had an absolute change over 25% (Figure 2b), suggesting that conifer SLA data that have been used in prior analyses are not of comparable methodology, and therefore not applicable.
F I G U R E 2 How data quality impacts ecological interpretations.(a) Comparison of the effect of measurement type for a conifer (Pinus elliottii) and a broadleaved species (Plantago major).Data are separated by whether it was measured on a projected leaf area basis (lighter colors) or another/unknown way (darker colors).The data used were cleaned using the standard protocol.Darker-colored vertical lines represent the means resulting from using the whole dataset while lighter-colored vertical lines represent the means using data measured on a projected leaf area basis.For P. major, the difference in mean SLA was moderate (15.0%).However, for Pi.elliottii, the difference was 53.2%, and there was no overlap between the two measurement groups.Sample sizes for each species are displayed below the species name.(b) Comparison of how the number of species and mean SLA for each species varied between the standard and rigorous cleaning protocols.Each point represents an individual species, and the lines pair the mean SLA of each species between the cleaning protocols.
The taxa mean and interquartile range are represented by boxplots.The numbers below the bars represent the number of species in each group and cleaning protocol.(c) A comparison of how an ecological analysis varies depending on the strictness of data inclusion.The relationship between SLA and drought tolerance in conifers is shown using the database cleaned with the standard protocols (left), the rigorous protocols (right), and both protocols (center).In the center panel, the lines pair the mean SLA of each species between the cleaning protocols.SLA, specific leaf area.
We ran a standard trait-by-environment analysis to test how conifer SLA varies across drought index, with the hypothesis that SLA becomes smaller (i.e., leaves are smaller and/or thicker) as drought index increases (McCulloh et al., 2023).We found support for this hypothesis using both datasets (p < .001);however, both the magnitude of the effect (−0.008 vs. −0.012)and the variance explained (0.078 vs. 0.147 in the standard and rigorously cleaned data, respectively) were greater using the more rigorously cleaned data (Figure 2c).While the direction of the effect remained consistent across datasets, data errors can contribute to incorrect linear model estimates, an issue that would likely be magnified in non-linear modeling (Cragg, 1994).Indeed, non-linear relationships between environmental drivers and biotic responses are common and contribute to misinterpretation in frequently used statistical techniques in ecology (e.g., the "arch effect" in ordination; Gauch, 1982;Goodall, 1954;Legendre & Legendre, 1998;Podani & Miklos, 2002).Additionally, rampant data duplication contributes to data inflation and nonindependence, breaking the typical assumptions of the most widely used statistical techniques (Noble et al., 2017).Currently, ecological analyses using TRY data appear, at best, not as powerful as they can be and, at worst, wrong.

| DATA Q UALIT Y CHALLENG E S AND NEEDS FOR G LOBAL ECOLOG I C AL DATA BA S E S
The issues we uncovered in our analyses generally reflect incorrect dataset uploading and inadequate curation of database data.
Importantly, our findings should not be viewed as a criticism of open-access data practices.Instead, we use this case study to illustrate necessary precautions while working with big data from global ecological databases that largely stem from issues of applicability and traceability.Fortunately, these issues can be addressed both proactively and retroactively.Proactively, uploaders must take the required steps to ensure that their data are applicable and traceable (Box 2).To address applicability, methods for data collection must be explicit in the primary literature, and if uploading unpublished data, metadata with methodology must be included.Without such metadata, individual data points are not comparable and cannot reasonably be included in the same analyses.For example, SLA varies systematically with ontogeny (Dayrell et al., 2018;Ye et al., 2022) and environmental conditions (Poorter et al., 2009;Wellstein et al., 2017) in ways that can influence statistical analyses and alter ecological interpretations of results (Lusk, 2004).Without consideration of these and other sources of variation in big data analyses, individual data points are incomparable, data applicability is eroded, and results are not properly caveated (Körner, 2003(Körner, , 2017(Körner, , 2018)).
To proactively address traceability, citing each data point and verifying its accuracy are vital first steps.Prior to uploading another collector's data, uploaders should first verify whether these data are already in the database.Alternatively, researchers should contact the data's original collector to determine whether they have uploaded it.If not, original data collectors should be given the chance to upload their own data and provide necessary metadata and an appropriate reference.We strongly recommend this option, as it will allow for data collectors to receive credit when data are cited.
In general, we recommend that data from papers that compile data from original studies (e.g., meta-analyses) should not be uploaded to databases that are frequently used for reanalysis (e.g., TRY), as these often include duplicated and/or altered data and result in the uploader receiving credit (i.e., citations) for data collected by others (Kueffer et al., 2011).While open-access data for these kinds of studies are necessary for research transparency, data repositories with an emphasis in housing individual datasets (e.g., Dryad) are the most appropriate locations to store these data.One of the most

BOX 2 Recommended steps forward for global ecological databases
Uploaders: • Do not upload data to a database that has previously been downloaded from that same database and, in general, avoid uploading data that you did not collect yourself.
• Consider what database is most appropriate for your data given your specific study design and methodology (e.g., databases for reanalyzes vs. repositories for data storage).
• Follow field-specific methodological protocols and ensure methods and metadata are clear, replicable, and publicly accessible regardless of database guidelines. Curators: • Provide and require a standard data uploading format (sensu Environmental Data Initiative).
• Require all data points to be traceable (i.e., published, cited, and consistent across all data sources; if unpublished, citable methods and metadata must be provided to ensure data are applicable).
• Dedicate time, infrastructure, and resources to the proper management of data within databases (e.g., hire full-time research scientists to screen databases for unusable data and uphold uploaders to database standards).

Downloaders:
• Consider using rigorous cleaning protocols prior to using database data for analyses, while acknowledging the amount of time and effort these protocols may take.
• Be mindful of interpretation errors when performing analyses on uncleaned data or even data cleaned with standard protocols.
powerful aspects of TRY is its ability to facilitate novel research and questions, which is undermined by the inclusion of duplicated data.
TRY should therefore be home to exclusively original, raw data uploaded by the original data collector, rather than altered and duplicated data, which erode the quality of data to the detriment of those downloading from and using TRY.
Traceability may be further improved by limiting data in databases to those associated with published primary literature.We recognize that considering unpublished data unusable in our analysis is exclusionary.However, over one-third (39.6%, 7847) of the entire dataset does not have a published reference associated with it, and almost a third of this (31.4%,2465) appears to be uncited data (i.e., plagiarized).
Of the unpublished data within our dataset, only 22.1% (1737) were uploaded by the collector and, of the rest, almost half (49.1%, 3000) were duplicates.We found altered (e.g., transformed or rounded) and incorrect (e.g., reported differently in TRY than in the primary data reference) data within TRY, raising uncertainty as to whether unpublished data are usable.Additionally, there are inadequate metadata to determine how unpublished data were measured, leaving uncertainty as to whether they are applicable as well.Outliers made up 16.0% of the unpublished data (compared with 10.1% of published data), emphasizing the role that data scrutiny (i.e., peer review) plays in ensuring that logical data is collected, uploaded, and eventually re-used in large-scale meta-analyses and reviews.These issues reinforce that published data are necessary for quality assurance.
Retroactively, datasets must be managed by database curators (Box 2).An important first step to improve data applicability would be for curators to require thorough metadata upon data upload.
Data repositories (e.g., Knowledge Network for Biocomplexity; Jones et al., 2019), many of which are funded by government agencies (e.g., National Science Foundation), have metadata requirements, highlighting the known value of metadata in best data protocols.
To retroactively improve traceability, datasets with data not associated with the original collector should be checked and removed by curators when they are not uploaded by the collector.Curators should require any uncited data be edited to include all associated references.It is critical to remember that failing to cite other's data is plagiarism and is no different than failing to cite other's ideas (Roig, 2015).We found multiple examples of datasets with uncited data but many citations in the peer-reviewed literature, rewarding (through citations) uploaders for data that are not their own.
Additionally, when data do not match the values found in the publication, curators should require uploaders to add information as to why, whether it be the inclusion or exclusion of certain data or a data transformation.Again, we have found multiple examples of data that appears to be consistently and perhaps intentionally altered for the purposes of peer-reviewed publication that do not match the data uploaded to TRY (or vice versa).
While we are optimistic that the scientific community can resolve many of the problems within databases, there are larger systematic issues within the field that are more difficult to address.Many of the current issues in data management stem from the incentive structure of databases and the pressure it puts on researchers.As researchers increasingly utilize publicly available databases to guide research questions and conduct analyses, the abundance of currently available data will influence the trajectory of future research and data collection.If researchers are presented with data within a particular subset of taxa or traits, this may further incentivize additional research into those subjects at the expense of less well-studied taxa or traits (sensu "consolidating" vs. "disruptive" science; Park et al., 2023).The scientific community also disproportionately celebrates those who upload data rather than those who collect the data.As a result, continual re-uploading of the same data artificially inflates data quantity.
Within the current "publish or perish" academic climate, there is much to gain from uploading others' data and becoming an author on some of the most cited papers in the field.While the scientific community values the important ecological insights gained from big data papers, citing the individual papers where original data come from rather than just the databases that accumulate them must be normalized (e.g., Joswig et al., 2022).In taking these and other steps, we may better celebrate those who collect the data, often early-career researchers who are much less established than those uploading data.As we enter this era of big data in ecology, the scientific community must continually work toward equitable data use and proper database management.
each cleaned dataset (standard vs. rigorous) and the absolute change in mean SLA between the cleaned datasets.The standard cleaning protocol led to a median of 7.5 data points per species (mean = 42.3,range = 1-903) compared with 3.0 (mean = 20.4,range = 0-474) fol-

BOX 1 The ins and outs of TRY
expectation for TRY data users.This took hundreds of hours despite representing only 4.5% of all SLA data in TRY and is an unreasonable amount of work for data users to undertake to ensure highquality data.Rather, this exercise highlights issues in how data are integrated and handled by the standard TRY cleaning protocols and many other ecological databases worldwide.
(Eimil-Fraga et al., 2015;Garnier et al., 2001)cluding data that are not applicable contributes to inaccurate ecological interpretations and violations of statistical assumptions.Duplicated data (i.e., multiple identical data points from the same original data source), in particular, contribute to data inflation and nonindependence, breaking assumptions of widely used statistical techniques(Noble et al., 2017).Improper classification of mean values or raw data can lead to mistakes in the treatment of values, as can the inclusion of outlier data.Furthermore, differences in measurement techniques can lead to dramatic differences in trait values(Eimil-Fraga et al., 2015;Garnier et al., 2001).Fundamentally, including inapplicable data in analyses increases the likelihood of poor data quality, errant results, and misleading conclusions.Therefore, data should be original, representative, logical, and collected with comparable methods.