Assessing comparable bioconcentration potentials for nanoparticles in aquatic organisms via combined utilization of machine learning and toxicokinetic models

The toxicokinetic (TK) model‐derived kinetic bioconcentration factor (BCFk) provides a quantitatively comparable index to estimate the bioaccumulation potential of nanoparticles (NPs) that barely reach thermodynamic equilibrium in aquatic organisms, but experimental data are limited for various NPs. In the present study, a machine learning model was applied to offer reliable in silico predictions for the dynamic body burden of diverse NPs to derive corresponding parameters for the TK model. The developed eXtreme Gradient Boosting‐derived TK (XGB‐TK) model was applied to predict BCFk results for a broad range of metallic or carbonaceous NPs, with an appreciable prediction R2 of 0.96. The BCFk values were predicted based on a random combination of selected variable features, revealing that their bioaccumulation potential showed an overall negative correlation with NP density or organism size. By applying importance analysis and partial dependence plots, NP density and organism size were revealed to be the top essential features that impact the bioaccumulation potential. The conjunctively used XGB‐TK model enabled a prior comparison for diverse NPs and straightforward derivation on the dependency of features, which could also guide the bioaccumulation mechanism exploration and experimental condition formulation.


| INTRODUCTION
The foreseeable increasing production and application of nanoparticles (NPs) might induce rising existent varieties and amounts of NPs in aquatic environment, raising concerns about exposure and potential risks to organisms. [1][2][3][4] It is essential to study the uptake of these NPs by an aquatic organism as indispensable information to assess the toxic effects induced within the organism. 5,6 In the past decades, even though the uptake of NPs by typical organisms has been studied, there is still a huge knowledge gap on how NPs are taken up by aquatic organisms, 7 leading to their bioaccumulation potential incomparable and unpredictable. The bioconcentration factor (BCF), defined as the ratio of concentration in an organism to the concentration in water under thermodynamic equilibrium, was also applied to illustrate the bioconcentration of NPs as well. [8][9][10] However, experimental results have shown enormous discrepancies in the uptake and BCF of NPs in organisms, even in studies on the same type of NPs. [11][12][13] For example, the reported BCF values of Ag NPs ranged from 0.45 to as high as 8.75 × 10 4 L/kg, 14,15 which showed a wide range of 5 orders of magnitude. The incorrect use of steady-state bioconcentration models should be noted as one of the explanations for the inconsistent results, as the reduction of NP concentrations in exposed suspensions induced by precipitation 16 or dissolution 17 proceeded constantly. The common approach to experimentally determine BCF is unsuitable, as NPs in aquatic organisms barely reach thermodynamic equilibrium. These calculated BCF values could only be considered as instant enrichment of NPs, namely, instant BCF (BCF i ). This predicament put the determination of bioaccumulation potential for NPs into a time-consuming and error-incidental scenario during seeking assumed equilibrium conditions in laboratory. Thus, body burden (BB), which represents the NP concentration in an organism at a given time, is probably a better index to evaluate the uptake capacity. However, BB is dynamic and not able to represent the comparable degree of accumulation potential for various NPs and organisms. For a proper interpretation of the bioaccumulation potential, the toxicokinetic (TK) model, which relies on the measured chemical uptake and elimination rate, was used in the present study to predict BCF values under a pseudo-steady state. 18,19 This model was suitable to predict the kinetic BCF (BCF k ) of NPs that had a dynamic bioconcentration and a pseudo-steady state, which are not being widely used as yet. [20][21][22] The BCF k value provides a quantitatively comparable index to estimate bioaccumulation potential for specific NP that barely reaches thermodynamic equilibrium in aquatic organisms, enabling prior comparison among various NPs.
As two key inputs, experimentally measured uptake and elimination rate constants (k u and k e ), need to be determined for the use of the well-established TK model to predict BCF k values. The successful prediction of BCF k of TiO 2 NPs 10 and hematite NPs 23 by fitting these parameters for the TK model suggested that this method is an applicable approach to assess the bioconcentration potential of NPs. However, the data of these parameters are unavailable for numerous NPs of concern, which is extremely lack for risk assessment. The values of these parameters involving uptake and elimination are highly affected by the physicochemical properties of NPs, 24,25 the tested organism species, 26 and the experimental conditions. 27 Various effective factors have led to an exponential increase in laboratory work for data that are, nevertheless, often highly multidimensional. 28 Currently, in silico algorithms offer labor-saving methods to obtain the parameters and reveal the unknown dependencies of various factors with the TK model-calculated BCF k .
The various developed machine learning (ML) algorithms have shown enormous potential in predicting the uptake of NPs by training with experimental data sets for guiding chemical synthesis, 29,30 establishing structure-activity relationships, [31][32][33] and predicting biosafety indicators. [34][35][36][37] For instance, the random forest (RF) regression algorithm has shown good performance in predicting the formation of protein coronas, 35 immune responses, 38 and lung burden of NPs. 36 Very recently, 34 an artificial neural network was reported to predict the uptake and translocation of NPs in plants, having obtained accurate predictive results (R 2 > 0.8) as well. In a study of the crystallization propensity of metal-organic nanocapsules, eXtreme Gradient Boosting (XGBoost) afforded the highest prediction accuracy of >90% among the nine trained algorithms when trained using data sets consisting of multiple dimensions. 29 So far, use of ML algorithms for predicting the uptake potential of NPs in aquatic organisms has not yet been reported, which is largely due to the lack of training data that are directly reported as BCF values. In related studies, the uptake concentrations in the time course (C t ) are the main types of data reported, which need to be fitted in the TK model for deriving the kinetic parameters. Except for being utilized to directly predict toxicological results and parameters, 38 ML algorithms can be used to generate a broad set of NP concentrations in forecasting exposures, including enormous diversity of NPs, aquatic organism species, and exposure parameters. The variable features generated by TK models can be constructed to determine the BCF k of NPs. The utilization of ML algorithms to predict in silico data for further mathematical calculation or models in which experimental parameters are limited has been tested and demonstrated available in practice. By applying an RF algorithmbased quantitative structure-activity relationship (QSAR) model to generate in silico parameters for TK models, Dawson et al. 18 prioritized reliable risk assessment of various chemicals that required input parameters that were unavailable for the TK model, while in the current study, the approach using the ML model for filling data sets met the dynamic data needs of the TK model. Uptake of NPs over a time period could be predicted to provide the lacking data for establishing the TK model for each forecasting exposure with different exposure conditions, which might significantly improve the robustness of the TK model, compared with a direct prediction of the parameters, like k u and k e , for the TK model. 20 Although the high heterogeneity of quantitative and qualitative features of experimental conditions challenges the tolerance of algorithms to diverse input data, the comparison and assessment of algorithms are applicable for determination of appropriate ones to obtain accurate predictions and reveal the importance of features. In addition, the feature interpretation from ML output may pave the way for further investigations and standardization of experimental conditions.
In the present study, ML algorithms were used to explore the dynamic BB of metallic or carbonaceous NPs (CNPs) by aquatic organisms and their feasibility in predicting BCF k was evaluated by deriving parameters for TK models. The key features governing the variations of BCF values were identified, indicating the dependency of the parameters on the results. The utility of robust ML-derived TK models is promising in predicting the bioaccumulation potential of NPs with a wide range of features in diverse aquatic organisms before labor-costing experimental efforts, markedly improving the efficiency of assessing the exposure risk of tested or untested NPs. The straightforward determination of the dependency of factors, which is somewhat ignored by researchers, also guided bioaccumulation mechanism investigations and formulation of experimental conditions.

| Data extraction
Bioconcentration data of NPs in aquatic organisms were extracted from peer-reviewed publications, which were obtained by an extensive literature search using the ISI Web of Knowledge and Scopus databases with the searched topics of "nano*, uptake, bioaccumulation or bioconcentration, aquatic." A total of 40 searched papers were screened according to the following conditions: (i) full text was available; (ii) characterizations of NP and experimental conditions were detailed; (iii) uptake was reported as quantitative data like BCF values or concentrations in organisms; and (iv) the exposure was performed in aqueous environment. The reported BB values were included in the data sets with no further processing. For the papers that did not report exact BB values, the input log BB values were collected from the reported value or extracted from the published figures using the "Digitizer" tool provided by OriginLab. 36 In predicting BCF i , extracted BB values were transformed into BCF i as defined by OECD 305 39 using the following equation: where C t is the BB of NPs (μg/g) in the tested organisms at any sampling time and C w is the NP concentration in the exposed suspension (mg/L). To determine comprehensive relationships between exposure features and the bioconcentration of NPs in aquatic organisms, 24 features (see the figure in Section 3.3), which are usually controlled in exposure experiments, were considered effective factors and transferred to descriptors to establish the data sets. These features could be divided into nine material properties, seven animal properties, and eight exposure conditions. Finally, 577 data sets were extracted for establishing the model, which is detailed in the Supporting Information: Data set, and the distribution of various NPs and tested organisms is summarized and shown in the Supporting Information: Figure S1.

| Data processing
In screened publications, some inputs of descriptors for sampled data are missing due to the exclusion of important factors. For example, the body length of the tested organisms was not specifically reported in approximately 15 papers, while the surface charge, related to the agglomeration of NPs in suspension, was also missing in about 20 publications. The missing values were included by calculating the related parameters that could be obtained from other literature sources, that is, the body length of an organism could be determined using a reported coefficient of age to body length. Specifically, the specific surface area (SSA), which was reported as being relevant to the exposure risk, 40 was not considered in most of the screened publications. Thus, we, respectively, calculated the SSA values for spherical and rod-shaped NPs as follows: where ρ is the density of bulk metallic materials; d is the size of NPs; and l is the length of nanorods. For the features that are collected as characters, such as NP type, shape, surface modification, and tested organism, a coded transfer is needed to convert such variables into a form that can be recognized by ML algorithms. Here, a widely accepted "one-hot" encoding method was applied to deal with these disordered characteristic features. Due to the diversity of NP types (n = 11) and tested organisms (n = 8), the characteristics were not directly transferred using the one-hot method to avoid rapid expansion of the data sets. Classification descriptors were defined based on the intrinsic properties of features. As a numerical feature, "age" was also classified into new descriptors due to the wide differences in age ranges of various tested organisms. "NP types" was classified into "metallic NP (MNP)" (0/1) and "CNP"; "tested organisms" was classified into "Plankton," "Invertebrate," "Vertebrate," and "Carapace"; and "age" was classified into "Neonate," "Juvenile," and "Adult." To reduce the biases caused by the wide ranges of BCF values under various exposure conditions, the data were converted into values with less span by logarithmic transformation as log BCF.

| ML models
In the present study, a total of 10 different ML algorithms, that is, logistic regression, Gaussian Naïve Bayes, k-nearest neighbors, support vector machine, decision tree, RF, adaptive boosting, XGBoost, multilayer perceptron, and least absolute shrinkage and selection operator (LASSO), were trained to construct models and predict BB or BCF i of NPs. Detailed information about the ML algorithms is presented in the Supporting Information. The XGBoost model was determined as the ML algorithm for further in silico data expansion, owing to its good performance in prediction (see the Supporting Information for a detailed description). In classification training, the data sets of BCF values were classified as "bioaccumulative" (log BCF > 3.3) and "nonbioaccumulative" (log BCF < 3.3), according to the definition by REACH. 41 For predicting consecutive in silico BB (C t ), extracted BB or concentration values were input for regression analysis. The XGBoost model was used to directly predict the values of log BB and log BCF of NPs in each tested organism and further modeling. Collected data were shuffled and divided into training (70%) and test data sets (30%). Classification or regression models were generated by applying algorithms to the data sets. The generated models were then implemented in the test data sets to assess the prediction performance. A fivefold grid-search cross-validation (fivefold Grid-SearchCV) strategy was implemented in the study to minimize the possible overfitting and choose the best hyperparameters of the models. The LabelEncoder method was used to convert nominal descriptors into variables in data sets before constructing the models. All models in this study were constructed using Python 3.8 with the scikit-learn package.

| Evaluation metric
The model performance in classification was evaluated using a 10-fold cross-validation process and an external validation process. In the 10-fold cross-validation, the data sets were randomly divided into 10 equal subsets, where nine of them were used to train the model and the remaining subset was used to test the model performance. The validation was repeated 10 times. To ensure the reliability of the ML workflow and model results, the evaluation indicators for classification models included accuracy, precision, recall, F1, receiver-operating characteristic (ROC) curves, and the area under the ROC curve. The detailed calculation methods are presented in the Supporting Information. The correlation coefficient R 2 and the root mean square error (RMSE) were used as evaluation metrics to determine the regression performances of the models. The importance of all selected features was evaluated using the trained algorithms by determining the increase in the mean square error (MSE) and arranged according to the important values of features. Next, the model was iteratively fitted with decreasing input numbers of feature subsets. The prediction accuracy of each fitted model with a decrease in the most important features or the least important features was evaluated to optimize the model.

| TK model integration and evaluation
To achieve BCF prediction within a TK model context, an ML algorithm was utilized to predict in silico BB values in the time course (C t ) according to extracted BB data sets. The prediction outputs were set as inputs for establishing TK models. A one-compartment kinetic model was applied and fitted by the reported and predicted uptake concentrations in time course. The model was described as follows: where C is the instant NP concentration in the tested organism (mg/g); t is the exposure time (h −1 ); k in is the influx rate constant (L/(g·h)), described as the total uptake of NPs to the corresponding concentration in the medium; C w is the NP concentration in the medium (mg/L) and assumed to be constant; and k out is the efflux rate constant (h −1 ), reflected by the elimination rate of accumulated NPs from the organisms. After the integral transformation, BB values at time t(C t ) can be expressed as Under the condition where the equilibrium state is reached, the BB value at steady-state C ss = k in C w /k out BCF k can be calculated by Thus, the BCF value could be calculated as the ratio of the influx rate constant and the efflux rate constant using the first-order kinetics fitted TK model. The k in and k out values were obtained using the least-squares method. The predicted BCF k values within the context of the TK model that depends on the k in and k out values were evaluated by regression analysis among the reported results, the ML-derived TK model, and conventional TK models.

| Analysis of heterogeneous data
The data extraction of the bioconcentration of NPs in aquatic organisms was performed using the process described in Section 2.1. A total of 577 data points (Supporting Information: Figure S1) were extracted from 40 publications related to the bioconcentration of NPs and included in this study. The detailed referred publications are listed in the Supporting Information. The raw data related to the physiochemical properties of NPs (zeta potential, hydrodynamic diameter, density, size, SSA, carbon-based particle, metal-based particle, surface modification, shape), features of organisms (body length, body weight, plankton, vertebrate, invertebrate, shell, life stage-adult, life stage-neonate, life stagejuvenile), and experimental conditions (exposure time, exposure concentration, temperature, pH, dissolved oxygen, natural organic matter [NOM]) obtained through data mining are listed in the Supporting Information: Data set. To avoid the disruption of modeling performance, unrecognizable characteristic features (NP type, shape, surface modification, and tested organism) were transferred to numerical frequency using the one-hot method. 36 The quantitative variables (n = 12) and qualitative features (n = 12) were included in the modeling as statistical values (mean, frequency, and distribution range) to implement models.
The 24 features included in the present study covered the main issues related to the uptake behavior of NPs in aquatic organisms. 25,42 It is noteworthy that protein corona and toxicology of NPs were not included in current features. It has been found that protein corona forms on the surface of NPs as hard and soft layers might have an influence on the biological effects of NPs, including cytotoxicity, uptake, biodistribution, and biotranslocation. 43 However, the formation and type of protein corona were complex and variable in biological exposures, making it challenging to operate the ML algorithm, with the corona structure being included as one of the features. In the present study, the main focus was on the intrinsic properties of NPs for reliable prediction. In addition, the formation of a corona was shown to be dictated by the intrinsic physiochemical properties of NPs, 35 ensuring the robust operation of the current algorithm without NP corona as one of the features. Toxicological effects of NPs might alter the bioaccumulation as well by alleviating activity, disturbing ingestion behavior, or even inducing death. 44 However, the type of effects such as inflammation, apoptosis, or reactive oxygen species are complex and variable in biological exposures, which severely disrupt the robust operation of the ML algorithm on inclusion in the model. In addition, the structure-activity relationship between physiochemical properties of NP (size, zeta potential, shape, etc.) and effects has been well established. 36,38,45 Thus, the intrinsic properties of NPs included in current features can also ensure the robust operation of the ML model and accurate prediction without toxicology as one of the features.
Linear regression was used before to fit the data to explore the relationships between each descriptor and the corresponding BB values, as shown in the Supporting Information: Figure S2. The rather low R 2 (<0.5) and quite high RMSE (>100) values shown in the fitting indicated that a single variable could not accurately reflect the trend of uptake behaviors of NPs. Also, the wide range of log BCF values of different NPs in various organism species suggested high heterogeneity of the data, as shown in the Supporting Information: Figure S3, making it difficult to analyze the data and draw an unbiased conclusion by laboratory work or traditional linear regression. The highly heterogeneous bioconcentration potentials of NPs were mainly due to the diversity of NP types (n = 11) and organism species (n = 7). NPs that have been widely used or proven to be of high risk were studied by more researchers to obtain more data. For example, TiO 2 NP accounted for 94 data sets, while the number of data sets of AuNP was only 4 (Supporting Information: Figure S1). This diversity in the number of data sets also leads to an imbalance of NP type, decreasing the robustness in modeling, 28 while the most tested organism was Daphnia magna, a typical filter feeder, which was found in 53.6% of all data sets (Supporting Information: Figure S1), resulting in major bias in the uptake rate and amount. To decreases the biases caused by the imbalance of discrete characteristic features, classification descriptors were defined before transferring by the "one-hot" method for multiple descriptive features (NP types and organism species). NPs were classified as carbonaceous and metallic due to the significant differences in physicochemical properties. Organism species were classified as plankton, invertebrate, vertebrate, and carapace, mainly based on their different living habits, feeding modes, and moving abilities.

| Direct prediction on dynamic BB
The regression of direct prediction on log BB values by XGBoost was performed and is shown in Figure 1A. As a widely utilized model for disaggregated data, the gradient boosting regression tree model-based XGBoost algorithm, which is useful for both classification and regression tasks, achieved an impressive prediction performance, with an R 2 of 0.98 and an RMSE of 0.19. Prediction results showed the advantage of XGBoost in dealing with multiple dimensional parameters associated with disaggregated data for extrapolating to a suppositional exposure. This ML model succeeded in avoiding the complicated internal interactions between parameters, which might irregularly influence the prediction performance of regression models relying on mathematically depicted relationships. 36,46 Figure 1B presents the prediction results of the BCF i value (log BCF i ), which were obtained by inputting BCF values transferred from BB (Equation (1); BCF = C t /C w ). Here, the concentration of NP in the exposure medium (C w ) was assumed to be constant due to the following reasons: (1) Short exposure period: In most of the studies where we collected uptake data sets, the exposure duration was set as 72 h or shorter for an observed steady state. The concentration of tested NPs in this period generally slightly decreased by several percentages, which could be neglected, compared with the increase of BB by orders of magnitudes in organisms. (2) Renewal of exposure suspension: In reported exposure that lasted for a long term, the exposure suspension was renewed daily by following the principle suggested by OECD 202 to minimize the disturbance to the uptake process. (3) Experimentally measured sedimentation of NPs: The external literature was explored for NPs that lacked sedimentation information. In a short exposure period of less than 72 h, a small sedimentation/settle rate (<30%, typically) was found under similar conditions in aquatic exposure studies for Au NP, 47 ZnO NP, 42 TiO 2 NP, 48,49 and CeO 2 NP. 50 However, the predicted BCF i was considered to be higher than the measured values in a real exposure, since the deposition of NP from exposure suspension tends to constantly occur for most NPs. 51,52 Thus, BCF i could not represent the bioconcentration potential of a specific NP due to the difficulty in determination of the equilibrium state, but might depict its dynamic enrichment in the tested organism in an environmental system, which is normally open and dynamic. 21 The predicted BCF i in time course for a simulated exposure is plotted in the Supporting Information: Figure S7 to show its instant enrichment during the exposure. Computational determination of the maximum degree of enrichment (BB max and BCF imax ) was performed to identify the utmost threat that an exposed organism encountered. For example, in a reported exposure scenario 53 in which Danio rerio was exposed to a few-layer graphene suspension with a concentration of 75 μg/g, the measured maximum BB and BCF i at 48 h were found to be 29.37 μg/g and 391.63 L/kg, respectively, while the predicted BB max and BCF imax at 80 h were found to be 44.50 μg/g and 593.33 L/kg, respectively. This prediction also provided a bioconcentration profile with higher resolution (fewer intervals) and longer exposure term, guiding the formulation of parameters, like exposure time and sampling interval, for the experimental study.

| Feature importance analysis
The complex interaction among features makes quantification of their relative importance difficult by research or using traditional linear regression models. In the present study, the XGBoost model was applied to measure the significance of each feature or descriptor by tracking the change in the error (increase in MSE) on the out-of-box data in the permutation feature importance analysis. 54 The feature importance investigation could well reveal the key physicochemical properties of NPs and exposure conditions that are responsible for bioaccumulation behavior. 45 The relative scores for the feature importance generated from the MSE increase, as shown in Figure 2A. It appeared that the top six features far outweighed others in determining log BB values in bioconcentration prediction. These features included material properties, animal properties, or exposure conditions, two for each, suggesting the multiple influences on the model. By decreasing the top descriptors stepwise for model training, the accuracy of the XGBoost model gradually decreased with the suppressed R 2 and elevated RMSE, as shown in Figure 2B. When the number of descriptors decreased from 24 to 17, the R 2 decreased from 0.92 to 0.86 and the RMSE increased from 0.30 to 0.50, indicating the indispensable roles of these features in enabling the optimal operation of the current model. The accuracy test with deletion of the most unimportant features (Supporting Information: Figure S8 The density of NPs showed the greatest impact on BB, which might be beyond the intuition of researchers in this field. Commonly, properties such as size and surface charge (zeta potential) are considered to impact the uptake of NPs by altering the agglomeration state. 42 However, on comparing across various NPs, density might be found to play an important role in the BB value, which is applied as a mass quantity. 55 Other than organic chemicals, bioaccumulation of NP was actually determined by the mass transport process of NP in the intestinal tract of the organism, where internalized NP is mainly distributed, 26,53 while the process was proven to be dominated by NP density-determined sedimentation. 56 Thus, density is, therefore, a particularly important property governing bioaccumulation, though the extent and detailed mechanism of the impact are still unclear. Density was often overlooked in bioaccumulation studies due to its indirect relationship with the uptake results. The use of the ML model provided a straightforward way to identify the important features that may be ignored due to knowledge bias and ambiguous evidence. Partial dependence plots were drawn and are shown in the Supporting Information: Figure S9 to illustrate the nonlinear dependence of predicted BB on density. Complex turbulence displayed in dependence with the variable density further indicated the great disturbance of this feature to the performance of the model.
The size of the tested organism showed a significant influence on the uptake prediction owing to the high rank of two animal properties (body weight and body length). Body weight was directly related to the calculation of BB, while the body length might determine the volume for the accommodation of NPs. 9,26 It is noteworthy that the different types of organisms with huge variations in body weight and body length scored much less in this analysis (ranking below 11th), avoiding the overestimation of the importance of organism size. It is well known that exposure conditions impact the uptake. Exposure concentration and exposure time directly affect the intensity and duration of interaction between organisms and NPs, thus governing the BB. Partial dependence plots were drawn and are shown in Figure 2C as well to visualize the variation of BB with exposure time. The exposure time has shown a complex marginal effect on predicted BB values for exposure up to 48 h, suggesting that the observed steady state was reached fast in the studies. However, more attention should be paid to this pseudoequilibrium state due to the turbulence shown in the follow-up dependence, suggesting the necessity for more careful identific of the equilibrium state. This dependence model could be utilized to determine the actual steady state in a long-term prediction for experimental studies.

| Predicting bioconcentration potential using the XGBoost-derived toxicokinetic (XGB-TK) model
To assess the bioconcentration potential of NPs in an exposure assumed not to have reached equilibrium, a onecompartment toxicokinetic model was incorporated into the ML model for predicting BCF k . 57 In the current study, the XGBoost model was utilized for predicting BB values in the time course (C t ) as in silico data sets within the context of the TK model. Figure 3A shows example sets of predicted results for known exposure, with the corresponding reported data presented in the Supporting Information: Figure S10, and forecasting exposure (red plots). The first-order kinetic regression fitted the predicted data sets well for deriving TK parameters (k in and k out ), with an impressive value of R 2 greater than 0.90, as shown in Table 1. BCF k was thus calculated for each exposure scenario. Here, TK parameters were directly obtained using a MATLAB-driven least-squares method after fitting the first-order kinetic equation, 58 ensuring that k in (uptake rate; k u ) and k out (elimination rate; k e ) values could fulfill the scenario of bioaccumulation, simultaneously including uptake and depuration processes. In a typical uptake-elimination experiment, the k out was solely measured by transferring the exposed organism to a pure depuration medium, generally water. The measured k e in the clean medium may not be incorporated into the TK model (Equation (5)) to calculate k in due to the space exclusion effect. 59,60 In a daphnia model, the presence of diets or NOM significantly accelerated the depuration rate of graphene NP, compared with the process in freshwater. 59 This effect was also found in zebrafish 53,60 and snail 61 models. In the TK model, the uptake and depuration simultaneously induce the bioaccumulation of NP. The continuous uptake of NP aggregates will undoubtedly push the accumulated NP in backend, increasing the efflux rate of NP. Thus, the measured k e was probably underestimated to derive BCF in the context of the TK model and consequently yielded overestimated BCF. The values of the predicted BCF k in the present study appeared to be lower than the values obtained using the reported TK model. 10 The measured log BCF k value of TiO 2 NP in D. magna of 5.61-6.03 at an initial exposure concentration of 1 mg/L was higher than the XGB-TK predicted value of 4.76 in the same exposure scenario. Depuration was conducted after an extremely short accumulation time (2 h), which led to quite limited uptake of TiO 2 NP in daphnia. The measured k e is positively related to the concentration in the organism, according to the first-order decay model. The k e was thus underestimated, leading to a higher BCF value than that in the present study.
Although the kinetic model appeared to be a suitable method to predict the bioconcentration of NPs, the practical use of this time-saving approach requires caution due to the space exclusion effect, which was particularly true for NPs. A mathematical method was suggested for deriving rate constants, which showed good fitting to the kinetic model, which included the complex interaction between uptake and elimination processes ( Figure 3A-D). The regression of log BCF k values was as shown in Figure 3E, clearly indicating that the XGB-TK model achieved excellent prediction accuracy on train and test data sets, with R 2 of 0.97 and 0.96, respectively. Randomly selected reported data sets, which included the real measurements that are independent of the modeled data sets, had been excluded from the training sets to construct the XGBoost model to further test the  accuracy of extrapolating prediction based on a combination of external variables. The excellent prediction performance improved the capability of the XGB-TK model to predict bioconcentration potential for a broad range of NPs and aquatic organisms. The prediction results for each combination of NPs and organisms, which are included in the present data sets, are summarized in Figure 4, illustrating the comparable bioconcentration potential of a broad range of NPs. The organism species were assigned to the x-axis based on body size. The overall tendency of BCF with increasing animal body size has been shown to be negative roughly, indicating that smaller organisms had greater NP bioconcentration potential. This sizedependent tendency has been observed in some studies, 9,26 although this has not been systematically investigated as yet. 21 As the second and third place features in importance analysis (Figure 2A), weight and size (body length) simultaneously impact the uptake of NPs significantly. We proposed that the proportion of body size to body weight might be an index can explain this tendency. As a material that does not undergo phase partition, NPs need space to remain in the dominant distributed tissue, generally the intestinal tract. An organism that has larger intestinal volume relative to body weight probably has a larger BB, which is a body weight-normalized concentration. Besides, feeding habitats that significantly impact the uptake rate also contribute to bioaccumulation. As a typical filter feeder with a small body size, D. magna and A. salina showed obviously greater BCF for NPs, while the bigger fish had a rather lower bioconcentration potential for NPs due to the weak capability to drink water along with consumption of food. 26 The significant impact of animal species and size on the bioconcentration potential revealed by the current XGB-TK model indicate the necessity of using standard organisms with similar sizes to conduct a bioaccumulation study for NPs.
The total predicted log BCF k values across all species showed comparable bioaccumulation potential for F I G U R E 4 Summary of the prediction results for each combination of nanoparticles (NPs) and organisms. A random combination of features was assigned as input to run the prediction, with three replicates for each combination. The organism species were assigned to the x-axis according to body size. The NPs were arranged by density.
various NPs, suggesting an overall higher bioconcentration potential for CNPs than MNPs. In the context of the current XGB-TK model, CNPs were concluded to be "bioaccumulative," with log BCF k values higher than 3.3, based on the definition of REACH, 41 while nearly all MNPs could be regarded as "nonbioaccumulative" materials due to the lower predicted log BCF k values than 3.3. The only exception was TiO 2 NPs, which were shown to be bioaccumulative in filter feeders with a small body size. The XGB-TK model-predicted results can definitely reflect the overall tendency of trained data sets, showing good agreement between the observed and predicted BCF. 20 On comparison of the bioaccumulation of various NPs by reviewing the calculated BCF using the TK model, it was found that C 60 was much more bioaccumulative than several MNPs such as ZnO NP, AgNP, and CuO NP. However, Bjorkland et al. 62 proposed low bioaccumulation potential for carbon nanotubes, which showed the highest bioconcentration potential in the present prediction. This discrepancy was largely due to the different benchmarks. The trophic transfer and absorption of the materials that have been reviewed for bioaccumulation assessments were not included in the present model for the quantitative prediction of BCF k values. The uptake and elimination rates showed that the index (prediction results for a known exposure are shown in Table 1) well modeled the process in an exposure scenario, which could reflect the bioaccumulation potential. It was still unclear why CNPs had higher bioconcentrations than MNPs because this index is related to various factors. We speculated that dissolution might be a very important factor in the low bioaccumulation of MNPs, while the CNPs usually did not undergo this transformation. For CuO, Ag, and ZnO, the dissolution in exposure media or organism could be as high as 80%, 63 which was not to be counted in as NP bioaccumulation. As the most impactful feature suggested by importance analysis, NP density definitely determined the bioconcentration potential. In Figure 4, each NP was arranged by density for showing the section of log BCF k . NPs with larger densities generally showed lower bioconcentration potential. For instance, the log BCF k for Au NP with the highest density is located at the bottom of the rendered image, while the lighter TiO 2 NP with low density occupied the top area in the tested MNPs. DeLoid et al. 56 suggested performing accurate in vitro dosimetry of NPs by determination of the effective density and diameter of the formed agglomerates because the mass transport of NPs in organisms is determined by the density of the suspending fluid. In fact, the bioaccumulation of a specific NP was intrinsically determined by the mass equilibrium between the inflow and the outflow of the suspending NP, which implies the important role of NP density. However, the mechanistic explanation for the impact of NP density on bioconcentration potential is still largely unknown due to the general exclusion of density as a significant influencing factor on the BCF value. Although the currently used model was also unable to explain the density-induced alterations in bioconcentration due to the limited interpretation ability of the ML model, their comprehensive predicting results uncovered the hidden relationship that may lead to the future investigation in current area.

| CONCLUSION
Limited bioaccumulation data, the high cost of animal studies, and large knowledge gaps for ascertaining the equilibrium state make bioconcentration assessment of NPs challenging. To comparatively interpret the bioconcentration potential of a broad range of NPs, the ML algorithm was used in conjunction with the TK model to predict the quantitative index of BCF k . The well-extrapolated performance of the current XGB-TK model, with an appreciable prediction R 2 of 0.96 for external test data sets, guaranteed reasonable estimates of bioconcentration potential with limited reported accumulation information. Predicting in silico BB using the XGBoost algorithm using a random combination of features provided an applicable strategy to expand data sets for the TK model that was feasible to depict the bioaccumulation process of NPs under an unreachable steady state. 20 Benefiting from the interpretation ability of the ML model to determine the relative contribution of each feature to the predicted results, the significant impact of NP density was revealed. However, the sole utilization of the XGBoost algorithm is unable to explain the mechanism. Considerable efforts still need to be made to boost the interpretability of the model for mechanism exploration. The development of a new ML framework or combined utilization of the ML model and a conventional model to determine the interaction between features can contribute to the identification of an internal mechanism for complex systems. The predicted BCF k results for a broad range of NPs broadly summarized the bioconcentration potential of commonly studied NPs, which enabled a comparison of these NPs before experimental design. We also speculated that ML models and the TK model could also be used in combination to predict the bioconcentration potential of other untested NPs and emerging particulate pollutants such as (sub) microplastic.