Predicting PM2.5 Concentrations Across USA Using Machine Learning

Economic growth, air pollution, and forest fires in some states in the United States have increased the concentration of particulate matter with a diameter less than or equal to 2.5 μm (PM2.5). Although previous studies have tried to observe PM2.5 both spatially and temporally using aerosol remote sensing and geostatistical estimation, they were limited in accuracy by coarse resolution. In this paper, the performance of machine learning models on predicting PM2.5 is assessed with linear regression (LR), decision tree (DT), gradient boosting regression (GBR), AdaBoost regression (ABR), XGBoost (XGB), k‐nearest neighbors (K‐NN), long short‐term memory (LSTM), random forest (RF), and support vector machine (SVM) using PM2.5 station data from 2017 to 2021. To compare the accuracy of all the nine machine learning models, the coefficient of determination (R2), root mean square error (RMSE), Nash‐Sutcliffe efficiency (NSE), root mean square error ratio (RSR), and percent bias (PBIAS) were evaluated. Among all nine models, the RF (100 decision trees with a max depth of 20) and support vector regression (SVR; nonlinear kernel, degree 3 polynomial) models were the best for predicting PM2.5 concentrations. Additionally, comparison of the PM2.5 performance metrics displayed that the models had better predictive behavior in the western United States than that in the eastern United States.

that machine learning finds generalizable predictive patterns.Prediction is aimed at forecasting unobserved outcomes or future behavior using historical data.Recently, Bazoukis et al. (2021) conducted a systematic survey on machine learning and conventional methods using health data sets, and they concluded that machine learning methods performed better than conventional methods for the task.Furthermore, machine learning algorithms often perform better than traditional statistical methods which require larger quantities of historical data to discern the relation between explanatory and target variables with a similar accuracy (Makridakis et al., 2018(Makridakis et al., , 2020)).
In this study, we focused on using both machine learning and time-series methods.Prediction time is an essential factor that needs to be fully considered and optimized in the PM 2.5 prediction model.A practical and reliable PM 2.5 predictor needs not only short-term forecasting to prevent PM 2.5 disasters in time (Liu et al., 2022) but also long-term forecasting to prevent pollution, control cities, and revise policies pertaining to air pollution.Time-series models predict future events by using past trends and algorithms (Ni et al., 2017).With the advent of machine learning, time-series modeling can better learn the dynamic relations between variables over time to better understand what caused the previous trends.While most of the aforementioned papers have used both nonlinear and linear regressions to correlate and predict PM 2.5 levels over a certain region, this study examines the entire United States and predicts PM 2.5 concentrations over several regions using machine learning models (Peng, 2015;Xu et al., 2018).This paper aimed to have better predictability of PM 2.5 values throughout different areas of the United States, as it is of great importance to know the future concentration of air pollutants such as PM 2.5 which can cause adverse health risks such as premature death and respiratory illnesses like ischemic heart disease (Apte et al., 2018).A research paper that investigated PM 2.5 concentrations in China also mentioned a similar problem faced by us when collecting PM 2.5 concentration data in the United States, namely, the insufficient coverage of the ground-measurement stations for measuring PM 2.5 (Song et al., 2022).However, when this study was done we did not have anything similar to FY-4A, a group of Chinese geostationary weather satellites, in the United States which allows the capture of PM 2.5 data with great accuracy in temporal and spatial resolutions (Song et al., 2022).As mentioned above, satellite image data can fill the gaps where station data's coverage is limited and will increase the accuracy of statistical and machine learning models.Another study predicting PM 2.5 in regions of China, "Obtaining vertical distribution of PM 2.5 from CALIOP data and machine learning algorithms", highlights the use of machine learning models, especially, the ET (Extra Trees) model which is similar to the random forest (RF) model, as it incorporates multiple decision trees (B.Chen et al., 2022).A significant corroboration between that study and ours is that a group of decision trees achieved the best performance in predicting PM 2.5 concentrations albeit in totally different regions of the world: the United States and China.
Earlier studies explored their data with a limited number of statistical models; but in this study, we used nine machine learning models to find the best estimation of PM 2.5 concentrations over a specified period.The data sets were split into two portions: 80% and 20% as training and testing data sets, respectively.The training data sets were used to train the model, whereas the testing data set was used to evaluate model performance of these trained models.Since there are multiple machine learning models to be evaluated, we used metrics such as root mean square error (RMSE) and mean absolute error (MAE) to compare them and cross validation to determine the best hyperparameters.In addition to this, our research paper took a novel approach in PM 2.5 concentration research by examining concentrations over the USA as opposed to China where many existing PM 2.5 studies have already been conducted.This paper presents the predictions of PM 2.5 over different states in the USA.The data collection process and different machine learning techniques applied in the context of time-series predictions are described in Section 2. Results and discussion are given in Section 3, and finally, the overall conclusions drawn from the present study are presented in Section 4.

Ground PM 2.5 Measurements
Daily PM 2.5 observational data were collected from January 2015 to December 2021 through the OpenAQ air quality database (https://openaq.org/)using calls to their Application Programming Interface (API).The data were derived from 1,081 stations around the USA.These ground-level PM 2.5 concentrations were collected according to national standards, and after data preprocessing, the data integrity exceeded 97%.These preprocessed PM 2.5 concentration data were taken as the dependent variable of the models.In this paper, the daily 10.1029/2023EA002911 3 of 15 PM 2.5 concentration data captured by 1,081 ground monitoring stations were classified into monthly and seasonal data from the original daily data which were provided to observe patterns on monthly and seasonal scales.There were small data gaps for the PM 2.5 data captured between certain stations, and we applied linear interpolation to fill these discrepancies.The data gaps arose since stations can be sparsely located, and thus, the ground-level monitoring of PM 2.5 levels is not always continuous between regions (Lin et al., 2015).This can be further observed in Figure 1 which displays the ground-level monitoring sites over the United States; the eastern part of the US has a smaller number of sites than the western part and the central US has very few stations.A trend that was observed was that remote areas (areas that seemed to be outskirts of major cities) had much lower PM 2.5 concentrations than urban areas which makes sense since there is an absence of anthropogenic sources of air pollution in those areas.

K-Nearest Neighbors
The k-nearest neighbors (K-NN) model was one of the earliest machine learning models to be devised, and it works by placing an unclassified data point (a data point that we do not know the category/class of) into the class which most of the data point's K-NN belong to (Alfeilat et al., 2019;Cover & Hart, 1967).The performance of the K-NN model depends on the distance measurement (either Euclidean distance or Manhattan distance to name a couple) and the hyperparameter k which can be fine-tuned via cross validation.Although in most cases the preferred distance is Euclidean distance, the choice of the hyperparameter k can be especially important since choosing a smaller value of k leads to overfitting, while too large of a value can cause underfitting.After the distance measurement is done, the distances from the unclassified data point and the classified data points are sorted and the k least distances are chosen (S.-H.Wu et al., 2008).Using the data points with the k least distances, we then look at these k data points' classes and choose the class which most of the k data points had as the class for the unclassified data point.

Random Forest
RF is a machine learning algorithm proposed by L. Breiman (2001); it uses ensemble learning by combining multiple decision trees which each uses the classification and regression tree (CART) algorithm for learning.Each tree in the RF is created with the combination of bootstrapped sampling, random feature selection, and majority voting (or averaging) which helps to create an ensemble of diverse and accurate decision trees (Park & Kim, 2019).This ensemble approach can handle complex data and reduce the risk of overfitting that may occur with a single decision tree.The total number of parameters in the RF model can be estimated by considering all decision trees in the ensemble (total parameters = number of trees × parameter per tree; J. M. Breiman, 1996).RF also makes sure the data in each region are portioned as uniformly as possible by dividing the space of the variables into smaller subspaces (L.Breiman, 2001;Franklin, 2005).Pairing RF with bootstrap resampling results in better performance as this resampling method extracts k random samples (with replacement) from the original training set in order to better equalize data among classes that have lower amounts of training data.The way which RF works is by using many classifiers (the multiple decision trees constructed) and taking the majority class as decided by these decision trees.The primary reason why RF is used widely in research and industry is because taking the majority class from multiple decision trees greatly reduces error of taking the decision from a single tree.

XGBoost
XGBoost is an optimized gradient boosting algorithm made efficient by parallelization and is used for both classification and regression tasks (Chen & Guestrin, 2016).This algorithm is considered to be a boosting algorithm because it takes multiple decision trees as weak learners which are learned using gradient descent, an algorithm that minimizes the loss of the objective function being learned.Since the boosting algorithm uses gradient descent, it is considered to be a gradient boosting algorithm and since the model itself uses out-of-core and distributed computing XGBoost tends to be an efficient gradient boosting algorithm (Friedman, 2001).This framework incorporates several key features that make it popular and highly effective for various machine learning problems.

Long Short-Term Memory
Long short-term memory (LSTM) excels in predicting based on time-series data and can learn longer-term dependencies when compared to other models which are also great at predicting with the same natured data such as the Recurrent Neural Network (RNN), but one of the great advantages is that it deals with the vanishing gradient problem (Alahi et al., 2016;Kong et al., 2019).The LSTM as opposed to the RNN can capture long-term dependencies and has essential abilities for predicting PM 2.5 concentrations in the future as the concentrations do vary with time and are a form of time-series data.An LSTM works by forgetting information that the algorithm considers irrelevant (forget gate), the input gate receives any new information, and the output gate forwards the updated information to the next LSTM unit which undergoes the same process.The LSTM mitigates the vanishing gradient problem by turning the multiplication of very small probabilities into additions which can also help with not forgetting long-term essential information in the past, something that a RNN can fail to do.

Decision Tree
Decision trees are a widely used machine learning algorithm for their simplicity and good performance with smaller number of classes and less complex data (Quinlan, 1986).They are a graphical representation of all possible solutions to a decision-making problem where each decision tree node represents a decision and each branch represents an outcome or a decision that leads to further nodes or leaves (Hastie et al., 2009).The tree structure begins with a root node which can be further split into new nodes (which each denote classes).The way the decision tree model decides whether to split is with the mean squared error (MSE) of the prediction before a split and after a split; if the MSE is lower after the split, the split is performed (Quinlan, 1986).

Gradient Boosting Regression
Gradient boosting regression combines multiple weak learners into one model to make use of the boosting method which results in a better accuracy than each of the weak learners (Friedman, 2001).The gradient boosting regression works by fitting a series of regression trees to the data, where each subsequent tree is trained to correct the errors of the previous one (Hastie et al., 2009).Since this model has "gradient" in its name, it uses gradient descent to minimize the coefficients in the regression polynomial (the output of the model).The algorithm minimizes a loss function typically aimed to reduce the MSE for classification tasks (for the classification version of the model), and for regression tasks, it makes use of the Log Loss cost function.

Support Vector Regression
Support vector regression (SVR) is used to find a hyperplane in the feature space that best approximates the relationship between the input variables and the corresponding target variables.The hyperplane is defined by a set of support vectors, which are data points that have the most significant impact on the location and orientation of the hyperplane (Druker et al., 1997;Hastie et al., 2009).SVR is commonly applied to time-series prediction tasks and was considered quite a novel forecasting approach when it was first used because SVR was usually used to create nonlinear boundaries using kernel functions (Bao et al., 2016).The way it works in forecasting tasks is that it is trained independently based on the same training data with different targets, and the function it then learns can be linear or nonlinear (commonly known as kernel functions; Suykens & Vandewalle, 1999).This kernel function is then used for predicting and is usually evaluated with either RMSE or MAE.

AdaBoost Regressor
AdaBoost is a popular boosting technique which uses boosting in combination of weak learners as an advantage which makes it require a lower amount of training data when compared to other boosting approaches (Freund & Schpire, 1997).The main difference that AdaBoost has and its advantage among other boosting approaches is that it uses relative error as loss rather than absolute error.Using relative error allows AdaBoost to better compare performance among its predictions than looking at a lower value overall (which would need more data points to reach a lower loss).The AdaBoost regressor "adapts" by fitting to the data set and adjusting its weights in tandem to the error of the current prediction that it made which reduces bias (as this is closer to the desired value of the prediction) and reduces variance as its predictions are closer in the final model (due to relative error comparison among its predictions).

Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.The way this model "learns" or fits to the data is by reducing the average perpendicular distance from the line or surface which is the prediction function to the desired values (training data).This then gives the best line/surface of fit which can then be used to predict future PM 2.5 values in this paper's case by using time as the independent variable.Linear regression is widely used in various fields for predictive modeling, relationship analysis, and making forecasts (Berk, 2020).
Before applying machine learning methods on this time-series data, we needed to ensure stationarity which is a preprocessing prerequisite.Stationarity is a prerequisite before applying machine learning models to time-series data because it eliminates any properties that change with time and keeps only the statistical properties and relationships between variables that are innate.To do this, we removed the seasonal and annual variations from all the states' data for the entire time period.Figure 2 shows an architectural flow diagram of building the best performing PM 2.5 prediction model by displaying the pipeline for training various machine learning models and then selecting the best one.

Evaluation Parameters
The errors between the estimated and real values were evaluated using different statistical metrics such as RMSE, mean absolute percentage error (MAPE), and Nash-Sutcliffe efficiency (NSE).The RMSE is the square root of the mean of the squared differences between each of the predictions and the desired values.Since the closer the prediction is to the desired value the RMSE becomes smaller, a lower value of the RMSE is desired; it in turn means that more predicted values are closer to the desired value.Another way to evaluate models is the R 2 which is also known as the coefficient of determination and is an indicator of collinearity between the predicted and desired values ranging from 0 to 1.A greater R 2 value is therefore greatly desired because the better the collinearity between two variables, the closer they are together in relationship and more likely to be similar (Santhi et al., 2001;Van Liew et al., 2003).The MAE is similar to the RMSE; but instead of the average of the squared differences between the predicted and desired values, we instead take the average of the absolute differences between predicted and desired values.Similar to the reasoning of why a smaller RMSE is optimal, a smaller MAE is also optimal due to a smaller difference between predicted and desired values causing smaller error.The MAPE is the ratio between errors and observations (desired values to keep the language consistent) calculated as a percentage; thus, since we want the number of errors to be minimal, the lower the MAPE, the better the accuracy of the model (G.Chen et al., 2018).Root mean square error ratio (RSR) is the ratio of the RMSE to the standard deviation of the observed data (the desired values; Stajkowski et al., 2020), and it is classified into four intervals: very good (0.0 ≤ RSR ≤ 0.50), good (0.50 < RSR ≤ 0.60), acceptable (0.60 < RSR ≤ 0.70), and unacceptable (RSR > 0.70), respectively (Khosravi et al., 2018).The NSE is an evaluation metric used to find the relative magnitude of the residual variance relative to the variance in the data (Nash & Sutcliffe, 1970).The levels of performance of the model with respect to the NSE are evaluated as follows: very good (0.75 < NSE ≤ 1.0), good (0.65 < NSE ≤ 0.75), satisfactory (0.50 < NSE ≤ 0.65), and unsatisfactory (NSE ≤ 0.50).The percent bias (PBIAS) measures the absolute value of the sum of the differences between the predicted and desired data values divided by the sum of the desired data values times a hundred to get the percent (Malik et al., 2018;Nury et al., 2017).Since the determining factor of the model having a higher accuracy depends on the numerator term the most which is the absolute value of the sum of differences between the predicted and desired values, the lower that term is the better, thus, this leads to a lower value of the PBIAS denoting a better model.In practice, for research and industrial purposes, PBIAS is classified into four ranges, very good (PBIAS < ±10), good (±10 ≤ PBIAS < ±15), satisfactory (±15 ≤ PBIAS < +25), and unsatisfactory (PBIAS ≥ ±25).
where N refers to the number of data points; x oi and x pi are the observed (desired) and predicted daily PM 2.5 concentrations, respectively.

Monthly Variations of PM 2.5
Before proceeding to train the PM 2.5 concentrations using different machine learning models, we will first discuss the PM 2.5 concentration monthly means to better understand the structure of PM 2.5 data and potentially adjust our hyperparameters for machine learning models.Figure 3 displays USA monthly anomalies and quantiles for 4 years which were derived using the daily PM 2.5 values.The monthly anomalies were calculated in the following procedure: subtracting 100 to set the average value to 0 to keep them in percent form.To further examine the data closely, we calculated the minimum, maximum, first quartile, median, third quartile, and the interquartile range for each month of the entire time period (see Figure 3).Inland locations in the USA had the highest levels of PM 2.5 in 2018 and these levels declined by approximately 20% in 2019.The secondary particle formation from NO x , SO x , and NH 3 gases' oxidation results in the accumulation of PM 2.5 in the air (South Coast Air Quality Management District, 2017).A drastic change can be observed in the PM 2.5 concentrations before and during pandemic years: PM 2.5 levels were higher in the spring and summer months toward the end of summer (August) and early fall (September) during summer years before the pandemic when compared to during the pandemic.
Some of the observations from the monthly anomalies include that they are the greatest in 2018 when compared to the other years, and positive anomalies were observed with greater frequency in August of that year, whereas negative anomalies were observed more in September than any other month of the same year which was interesting.Since 2018 was the year before the onset of COVID-19, the PM 2.5 concentrations were higher than in the other years throughout the USA.Liu et al. (2022) also noticed that PM 2.5 concentrations had fallen in February and March 2020 by about 33.3% in all provinces in China.Another notable observation is that PM 2.5 values were also higher in the eastern USA than in western USA.The decrease in PM 2.5 levels was moderate in urban areas and was smaller in rural areas probably because of the PM 2.5 levels being lower in those areas even before COVID-19 due to lower air pollution levels on a whole.This is furthered by Shakoor et al. (2020) who stated that the COVID-19 reported cases and deaths were significantly correlated with air pollution in both China and the USA.This study concluded that restricted human activities and the limited anthropogenic activities in the lockdown situation due to this novel pandemic disease resulted in a significant improvement of air quality by reducing the concentrations of environmental pollutants (Shakoor et al., 2020).

Evaluation of Model Performance
The nine machine learning models which were used can describe daily variations of the desired and predicted values of PM 2.5 concentrations; this is better shown in Figures 4 and 5 where the blue curve represents the observed PM 2.5 concentrations (which are the desired values to predict) and the red curve represents the predicted PM 2.5 concentrations (our estimates).Although we generated time-series plots for each of the states, we displayed only one state from each of the parts of the USA: California for the western side (Figure 4) and New York for the eastern side (Figure 5).It is worth mentioning that the selected eastern and western regions of PM 2.5 concentrations can vary spatially and temporally within the region.The western states have more challenges in predicting PM 2.5 due to the influence of wildfires, while the eastern states may face issues related to industrial emissions or urban populations.In addition, PM 2.5 can exhibit significant region variations due to various factors such as industrial activities, population density, and geographical features.This was to easier visualize and represent the widely occurring patterns in the states along each side (western and eastern) in the United States since they were very similar to one another.All the models also show seasonal variability of PM 2.5 levels; most notably PM 2.5 concentration levels are nearly 2.3% and 1.8% lower in the spring and summer than in the autumn and winter for California and New York states, respectively.These variations are due to atmospheric circulation during those seasons.Inherently, since the data for PM 2.5 concentrations are noisier if there is greater air pollution in a certain area or timeframe, the concentrations during autumn and winter are comparatively less accurate than those during the spring and summer.The machine learning models that performed best on the PM 2.5 data were the SVR and RF models.However, in both of these models, the PM 2.5 predictions/estimations were less accurate for California and other western states when compared to those for New York and other eastern states.This again is attributed to the fact that air pollution is more severe in the western states during the summer in the time period considered because of the forest fires that occurred.Furthermore, sulfate concentrations could be a major influence on PM 2.5 levels, and although they seem to have decreased from the east to the west in the US, there were higher amounts in California (Meng et al., 2018).
Figures 6 and 7 display California and New York's scatter plots of the observed versus estimated daily PM 2.5 concentrations during the period of observations using different machine learning models, respectively.The scatter plot shows a positive linear relationship between the two variables.Since the points on the scatter plot lie on approximately a straight line, this indicates the difference between the observed and predicted values is close to zero, and hence, there exists a stronger correlation between the two.Tables 1 and 2 indicate the performance and statistical metrics as estimated for New York and California.The metrics of all models in Table 1 are for New York: RF with R 2 = 0.899, MAE = 2.122, and RMSE = 3.121 has less error than the other models.The next model with the lowest error is support vector machine with R 2 = 0.857, MAE = 2.145, and RMSE = 3.125 (see Table 2).
The machine learning models that performed best on the PM 2.5 data were the SVR and RF models.However, in both of these models, the PM 2.5 predictions/estimations were 4.5% less accurate for California and for the other western states when compared to those for New York and the other eastern states.This again is attributed to the fact that air pollution is more severe in the western states during the summer in the time period chosen because of the forest fires that occurred.Furthermore, sulfate concentrations could be a major influence on PM 2.5 levels, and although they seem to have decreased from the east to the west in the US, there were higher amounts in California (Meng et al., 2018).The performance of the machine learning models at different states was good at most sites throughout the USA, since 73% of them show a R 2 greater than 0.62 and 10% of them show a R 2 value less than 0.3.Furthermore, other evaluation metrics exemplify the good performance such as an average RMSE of less than 4.5 mg/m 3 in 70% of the states.However, there are patterns to where the models are further from the ground truth PM 2.5 : PM 2.5 estimations tend to be lower when there is a higher concentration of PM 2.5 but higher when there is a lower concentration of PM 2.5 .This phenomenon was observed not only with our models' performance using the data from the US but also by Zhan et al. (2017) with PM 2.5 concentrations in parts of China.At both the extremes, in areas where PM 2.5 levels are really high and also in areas where they are low, there is a lack of training data because we do not have many records of extreme concentrations.There are also differences in the models' performance among different states which could be due to the individual states' weather patterns and various amounts of pollutants that are present at different seasons.Ghahremanloo et al. (2022) mentioned that PM 2.5 major changes in different states because of air temperature.Increased temperature can also result in more biogenic emissions, which, in turn, leads to more PM 2.5 concentrations (Ghahremanloo et al., 2021).For example, our results corroborated Ghahremanloo et al. ( 2021)'s findings that PM 2.5 levels were at the maximum in the summer in Texas which is due to the humidity and higher temperatures being catalysts for the formation of nitrate and sulfate from nitrogen dioxide and sulfur dioxide.We analyzed the models' performance on the states which we chose to represent the eastern and western sides of the United States: California and New York, respectively.The performance of the model that did the best (RF) was pretty good: California's R 2 , RMSE, and MAE were 0.77, 3.051 mg/m 3 , and 2.233 mg/m 3 , respectively; New York's R 2 , RMSE, and MAE values were 0.899, 3.121 mg/m 3 , and 2.12 mg/m 3 , respectively.From the results we obtained from the RF model, we observed that California's PM 2.5 concentration values and biases were slightly higher than those of New York.This similar trend extrapolates to other states on both sides of the United States because average error values are 1.2% lower in the eastern states than those in the western states.Each of the evaluation metrics we chose was produced for all the machine learning models that we used and from these results the RF and SVR models did the best; that is the RF and SVR models best predicted the PM 2.5 concentrations in any region of the US.
Comparing the performance of the SVR model to the RF model, we observed that on average the R 2 of the SVR model is higher by 5%; however, other metrics are worse.For either of the models, the biases are approximately 15% lower in the eastern states than those of the western states.This could be due to an external factor which we discerned as higher sulfate concentrations due to ship emissions produced from the Los Angeles and Long Beach areas which account for a fourth of all the cargo container traffic in the United States (http://www.dot.ca.gov;Vutukuru & Dabdub, 2008).Another factor to consider when performance for the machine learning models is not so great is that during winter and autumn the air pollution is worse than that in the spring and summer leading to less accurate PM 2.5 concentrations.Among the nine machine learning models that were used, RF and SVR were the only ones that gave desirable results in the mildest air pollution cases.The model that did the worst out of the nine models used was the LSTM, which we found to neither reflect the variations of PM 2.5 concentrations nor estimate the PM 2.5 concentrations accurately.A Taylor diagram can display multiple metrics in a plot and summarizes the relative model performances with several states' PM 2.5 model outputs (in the case of this paper).Fundamentally, this diagram can characterize statistical relationships between two fields-in this case, the two fields are the observed and estimated PM 2.5 values.In the paper, we use "observed" for representing the measured values from the PM 2.5 ground stations and "predicted" for values simulated by a machine learning model.Figure 8 are Taylor diagrams illustrating the standard deviation and correlation of the SVR and RF models, respectively, for each state.The position of each number on the plot quantifies how close the predicted and observed PM 2.5 values are; for example, state 50 has a correlation of 0.78 (correlation between predicted and observed PM 2.5 values).The dotted line contours in both figures represent normalized standard deviation values; in the case of state 50, the value is centered at around 1.65.Intuitively, the closer the predicted values for PM 2.5 are to the observed dotted line, the lower the prediction error.Most of the state values are quite close where some are slightly further from the observed values, and interestingly the predicted PM 2.5 values which are not as close to the observed values tend to be greater than the observed PM 2.5 values.

Conclusion
In this paper, we presented the prediction of PM 2.5 concentrations over the USA with various machine learning algorithms and compared their performances.Since machine learning models are a promising approach  in analyzing large data sets we used them, namely linear regression, decision tree, gradient boost, AdaBoost, XGBoost, K-NN, LSTM, RF, and SVR, to better predict PM 2.5 levels.
The obtained machine-learning-based methods' accuracies vary in all of USA's states, but the performance of RF (California: R 2 = 0.77, NSE = 0.817, PBIAS = 7.022, and RSR = 0.355; New York: R 2 = 0.899, NSE = 0.811, PBIAS = 2.989, and RSR = 0.331) and SVR (California: R 2 = 0.71, NSE = 0.897, PBIAS = 7.027, and RSR = 0.424; New York: R 2 = 0.857, NSE = 0.280, PBIAS = 3.011, and RSR = 0.338) was better than the other examined models.However, the model performance depended on other factors such as different climate and other region-specific features during different seasons thereby impacting the consistent accuracy of the models.Both RF and SVR models' R 2 scores were between 0.71 and 0.899, RMSE scores ranged between 3.05 and 3.714, NSE values ranged between 0.811 and 0.899, PBIAS ranged between 2.989 and 7.027, and RSR scores ranged between 0.331 and 0.424 for California and New York states.These metrics exemplified that the RF and SVR were robust machine learning models for the task most likely because of the sheer amount of data present.
Our study's main task of predicting PM 2.5 concentrations can contribute to limiting human health exposure risks and aiding epidemiology in studies of air pollution.This is of great relevance especially with the onset of COVID-19 and greater aggravation of symptoms associated with areas with greater PM 2.5 concentrations.With the better performance of predictions compared to statistical methods, machine learning methods can serve as an essential tool for decision-makers to devise sound PM 2.5 policies.Something that is desired for better results in the future are real-time measurements of the chemical composition of PM 2.5 to better estimate PM 2.5 concentrations in areas.
In order to better estimate PM 2.5 concentrations around the world in the future, we need features such as GDP per capita, urbanization data, and other atmospheric parameters that are useful to train machine learning models.When focusing on the area of interest for this research paper, the United States needs more ground monitoring stations because there are only 1,000 in such a large nation like the US which suggests that the spread is quite sparse (Figure 1).Some states more than the others face this problem especially central United States, and this remains an important issue for fixing because understanding the spatial and temporal distributions of each region over the United States is helpful, especially considering rural areas because they do not have much data in general.This lack of data contributes to some of the error in models trying to predict PM 2.5 .In addition to this, the machine learning models can be made online to better accommodate the new PM 2.5 concentration data that are newly added by the ground monitoring stations improving the predictive ability of these models.The first author (PPV) acknowledges the Jet Propulsion Laboratory (JPL) for providing him the opportunity with their summer internship program.Author JHJ conducted research at the Jet Propulsion Laboratory and California Institute of Technology under contract by NASA.We sincerely acknowledge the open air quality group for providing PM 2.5 station data used in this study.

Figure 1 .
Figure 1.Locations of PM 2.5 monitoring sites over USA.

Figure 2 .
Figure 2. Architectural flow diagram of the proposed PM 2.5 prediction model using various machine learning models.

Figure 3 .
Figure 3. Monthly anomalies and quantiles for the observed period (2018-2021) using daily PM 2.5 values over United States.

Figure 8 .
Figure 8.Taylor diagrams illustrating the standard deviation and correlation of the SVR and RF models respectively for each state.

Table 2
Different Model Metrics for California State