Hourly and Daily PM2.5 Estimations using MERRA-2: A Machine Learning Approach

Health and environmental hazards related to high pollutant concentrations have become a serious issue from the perspectives of public policy and human health. The objective of this research is to improve the estimation of grid-wise PM 2.5 , a criteria pollutant, by reducing systematic bias in estimating PM 2.5 empirically from speciation provided by MERRA-2 using a ML approach. We present a unique application of machine learning (ML) for estimating hourly PM 2.5 concentrations at grid points of Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2). The model was trained using various meteorological parameters and aerosol species simulated by MERRA-2 and ground measurements from Environmental Protection Agency (EPA) air quality system (AQS) stations. monitors. The ML approach signiﬁcantly improved performance and reduced mean bias in the 0-10 μ g m -3 range. We also used the Random Forest ML model for each EPA region using one year of collocated datasets. The resulting ML models for each EPA region were validated and the aggregate data set has a Pearson correlation of 0.88 (RMSE = 4.8 μ g m -3 ) and 0.82 (RMSE = 5.8 μ g m -3 ) for training and testing, respectively. The correlation (and RMSE) increased to 0.89 (4.0), 0.95 (1.6), 0.94 (1.1) for daily, monthly, and yearly average comparisons. The results from initial implementation of the ML model for global region are encouraging but require more research and development to overcome challenges associated with data gaps in many parts of the world.


Introduction
Deteriorating air quality has emerged as a major concern of the twentieth century since it impacts global climate change and health hazards, and significantly influences socioeconomic policies.Air pollution claims three times more lives than road traffic accidents globally (Myllyvirta, 2020;World Health Organization, 2018).In addition, a recent report from the Centre for Research on Energy and Clean Air (CERA) estimates economic costs resulting from air pollution to be 2.9 trillion USD (Myllyvirta, 2020).According to the report, in 2018, PM2.5 (mass concentrations of fine particle less than 2.5 µm in aerodynamic diameter) caused 4.5 million premature deaths and was responsible for 1.8 billion days of work absence.In another report, the EPA (US EPA, 2015) estimated that after implementation of the Clean Air Act of 1970, 2.3 million premature deaths were avoided in the next 40 years.The report also estimated that the cost to benefit ratio was $1 to $30 for moderate estimates; the high estimate was $1 to $90, and the low estimate was $1 to $3.These estimates imply that economic benefits exceeded costs by at least a factor of 3. Therefore, developing policies specific to environmental pollutants is of paramount importance, and estimating the quantity and impact of pollutants is a necessary step.
With recent advancements in measurement techniques, remote sensing observations, ground and surface observations, determining the atmosphere's composition (physical and chemical) and its constituents (ozone, SO2, PM2.5, NOx etc.) has become easier.Despite these advancements, generating a spatially complete dataset is still challenging.It is also difficult to collect in-situ measurements in remote geographical locations.Deploying a uniformly and densely distributed network of stations is not cost effective, and such a network would be demanding to manage.Satellite Remote sensing has its own limitations such as measuring under cloud cover (Christopher & Gupta, 2010), retrieving data in snowy regions, scanning complex surfaces, and detecting heavy aerosols layers near sources (Hoff & Christopher, 2009).Significant effort and resources have been devoted to developing systems that improve grid-wise estimation of atmospheric constituents and the state of the atmosphere.However, these systems have been limited by the quality and amount of data (Ghahremanloo et al., 2021) and by the performance of numerical and/or chemical transport models (Pouyaei et al., 2020).Addressing these limitations incorporates both in-situ measurements and remote sensing products into data assimilation techniques (Bocquet et al., 2019;Jung et al., 2019), which are limited by the availability of the respective data products.When combined with numerical modeling and data assimilation, these measurement methods provide the true state of the atmosphere at any given point in space and time.Over several decades, studies have reported significant advances in measuring and assessing surface features (Mulla, 2013) as well as forecasting and managing air quality (Mhawish et al., 2018).Satellite remote sensing data sets have contributed essential data pertaining to the global distribution (Christopher and Gupta, 2020;Lee et al., 2016;Martin, 2008), evolution (Q.Zhang et al., 2012), and transport (Kim et al., 2017;Y. Wu et al., 2018) of atmospheric pollutants.As such, methods to estimate surface concentrations of PM2.5 between in-situ measurement stations are essential to address the accuracy limitations of model simulations and spatial limitations of air quality stations.
Despite the efforts made, establishing a reliable and spatially complete dataset is still a challenge, primarily due to the lack of adequate ground-based observations, missing data due to cloud cover in passive satellite remote-sensing, and uncertainties in methods and outputs from methods.In this study we aim to address these issues and develop a ML model that can estimate hourly PM2.5 at MERRA-2 grid resolution by training data-driven models for the United States.To do this, several ML models were developed similar to the Gupta et al., (2021) model.The models were trained using in-situ measurements and MERRA-2 generated meteorology and aerosol diagnostic parameters.The developed models were then evaluated and selected based on their statistical performance.Once an ML model was selected, two different analyses were performed: regional analysis, wherein the selected model was optimized for the regions defined by the EPA, and comparative analysis with empirically estimated PM2.5 from MERRA-2.

Data and Methods
The proposed methodology focuses on the estimation of the hourly PM2.5 concentration based on the hourly time-averaged meteorology and aerosol products from MERRA-2. Figure 1 shows the process flow diagram of the proposed methodology.The methodology includes data retrieval from MERRA-2 and ground stations, spatial and temporal collocation, and model training.To determine the best model, various ML models were trained and evaluated based on several statistical parameters.The ML models were trained using meteorology and aerosol diagnostics from MERRA-2 reanalysis runs as input and spatio-temporal collocated ground measurements as output.The detail methodology is discussed in further sections.

Models
In this study, several ML models and a numerical model based on reanalysis were used and are described in detail in the following sections.
MERRA-2 also produces 3-hourly analyses of gridded aerosol diagnostics on a global scale by assimilating bias corrected-AOD at 550nm.The AOD values are retrieved from various satellite sources including Moderate Resolution Imaging Spectroradiometer (MODIS), Advanced Very High Resolution Radiometer (AVHRR), and Multi-angle Imaging SpectroRadiometer (MISR) and the surface-based AErosol RObotic NETwork (AERONET) for data assimilation in MERRA-2 (Randles et al., 2017).It is important to note that due to various data availability issues, specific sensor's data assimilation into MERRA2 varies by years.For example, MODIS data assimilation were only after year 2000 and MISR data assimilation were discontinued after year 2015 (Randles et al., 2017).The aerosol components (dust, sea salt, black carbon, organic carbon, and sulfate) are derived from simulating the Goddard Chemistry Aerosol Radiation and Transport (GOCART) model (Chin et al., 2002;Colarco et al., 2010) coupled with the GEOS atmospheric model.Since MERRA-2 provides both meteorological parameters and aerosol diagnostic species at a comparatively fine resolution of approximately 50km× 65km (0.5° latitude and 0.625° longitude) on a global scale, we used these derived products for this study.In addition, PM2.5 can be calculated using aerosol diagnostic products of MERRA-2, shown in equation 1 below (Buchard et al., 2016b;Malm et al., 1994Malm et al., , 2011)).

Machine Learning Models
From equation 1, the PM2.5 can be estimated at grid level, but the relationship has several missing terms, like nitrate and ammonium concentrations, (Buchard et al., 2016b(Buchard et al., , 2017) ) that may constitute the total PM2.5.Also, the equation assumes that OC, BC, and SO4 sizes are less than 2.5 µm.Due to these issues and model uncertainties, biases (low or high) can be introduced in PM2.5 estimations.The semi-volatile and volatile species can bias PM2.5 toward low or it may be high because of size growth due to soluble organic/inorganic species (Malm et al., 2011).Since ML models can help reduce systematic biases by handling complex non-linear relationships between meteorology and surface aerosols, we compared several regression and ML models as discussed below for estimating hourly PM2.5.
The regression models used in this study were Ordinary Least Square (OLS), Ridge, and Lasso regression.The ML models used were Stochastic Gradient Descent (SGD), k-nearest neighbor (KNN), Adaptive Boosting (AdaBoost), Gradient Boost (GB), Extreme Gradient Boost (XGB), Support Vector Machine (SVM), and Random Forest (RF).The OLS model minimizes the residual sum of squares between inputs and output to fit a linear model with coefficients and intercepts.Alternatively, Ridge and Lasso Regression regularize model by imposing penalty (Friedman et al., 2010;Koh et al., 2007;Rifkin and Lippert, 2007) In SGD, a linear model is fitted using an SGD learning method with penalties to minimize the loss function (Y. A. LeCun et al., 2012).
KNN estimates a target (output) by interpolating the nearest neighbors of the target in the training set.In boosting methods like AdaBoost, GB, and XGB, the process depends on boosting the performance of weak learners, or the decision trees with single splits.In AdaBoost, more weight is given to observations that are difficult to classify.In the GB method, weak learners are boosted by optimization (minimizing loss) using gradient descent-like methods.XGB is an efficient and computationally fast implementation of the GB method.SVM is a supervised learning technique in which the model makes a discrete estimation based on best fit.The best fit line represents a hyperplane containing the maximum number of points.Unlike other regression models, SVM doesn't try to minimize the loss function.Instead, it attempts to predict the best fit based on the distance between the hyperplane and boundary line.RF is a decision tree-based model that ensembles various models (trees) for estimations.It fits several decision trees on various subsamples of a dataset and improves accuracy by averaging and controlling overfitting.

Data and Pre-processing
The models were trained for the Continental United States (CONUS), Alaska, Puerto Rico, and Hawaii.Figure 2 shows the number of valid pairs of spatiotemporally collocated hourly observed, and MERRA-2 parameters at PM2.5 monitoring stations (AirNow).The reanalysis data was obtained from MERRA-2 for hourly averaged meteorology and aerosol components for the year 2018.Table 1 lists all the parameters from MERRA-2 used in this study.In addition, the parameters listed in Table 1, sun-earth distance, solar zenith angle, latitude and longitude were also used as the input parameters for the model.These additional parameters accounts for change in season (sun-earth distance), diurnal cycle (solar zenith angle) and geographical location (latitude/longitude) in ML model.The parameters were selected based on a previous study by Gupta et al., (2021).The importance of meteorology factors such as temperature, humidity, surface condition, and vegetation is well known and critical in the formation and transport of aerosols via pressure, wind speed and direction (Gupta & Christopher, 2009;Pandis, 2004;Seinfeld & Pandis, 2016).The aerosols that represent surface PM2.5 were selected from the MERRA-2 aerosol diagnostic product.Additionally, to represent the total aerosols in the atmospheric column, total aerosol optical thickness (AOT) measured at 550nm from MERRA-2 was selected.Hourly concentration surface measurements of PM2.5 for 2018 were obtained from the AirNow network managed by the EPA.Several state, city, and county level data collection agencies share their data of various air quality parameters to the EPA which prepares, and quality checks the data.For this study, 882 unique monitoring sites were selected based on collocated data availability.The spatial collocation is performed by selecting the nearest MERRA-2 grid to the EPA monitoring station by spherical distance (< 38.5 kms; Gupta et al., 2021).Due to the coarse resolution of MERRA-2, some of the stations were assigned the same grid point.Temporal collocation was accomplished by selecting the same hour of data from both EPA monitor and MERRA-2 outputs.The final integrated datasets included 2,128,126 valid collocated data points.After collocating data, multiple ML models were trained with only 10% of data points which were randomly selected (212,813 of 2,128,126 valid data points).The data was further split into a 70:30 ratio for model fitting and validation.Several regression and ML models were trained with 70% of the selected data (148969), and the remaining 30% (63844) was used for testing.The reduced sample size was only used to select the best-performing algorithm.The most successful ML model was evaluated further using full datasets.

Results and Discussion
To evaluate the models discussed in this study, a 10-fold cross validation algorithm was used.In this algorithm, 10 different sets of 9:1 train-test splits were created such that each set was unique, and when all the "test" samples in the 10 sets were combined gives the whole dataset.Each set was evaluated for various statistical performance parameters like root mean squared error (RMSE), mean bias (MB), Pearson linear correlation coefficient (R), slope of the fitted line, percentage error, and computational time.In addition to comparing the ML models, the performance was also evaluated compared to MERRA-2 estimated PM2.5 (Eq.1).

Evaluation of ML Models
All the ML models were trained and evaluated on several statistical parameters.returned low RMSE of 2.61 µg m -3 and 7.15 µg m -3 for training and testing, respectively.The computation time was less than 10 minutes, which was higher than most models, but RF outperformed all other ML models in other statistical metrics.The linear models like; OLS, Ridge, Lasso and SGD; it was expected to have a lower performance metrics due to non-linearity of PM2.5.
RF performed better than boosting methods like Adaboost, graideint boost and XG, because it tries to reduce the variance instead of bias, which in the case of PM2,5 (having diurnal variation) improved the overall accuracy.Since, RF model is an ensemble of multiple predictive decision tree, it effectively reduces biases in estimation and hence have improved the performance over other techniques.Although, there are some inconsistencies in performance between training and testing in RF models, but 10-fold cross validation further improved the consistency (table 3).Therefore, for this study, we selected RF for further evaluations.
Once the RF model was selected, 10-fold validation was performed with all valid data (2,128,126 data points).From this data, 10 sets (folds) of training and testing with a 9:1 split were prepared.
Each testing set contained a unique amount of data and, when combined, included all data.Table 3 lists the performance statistics of all 10-folds.All folds performed similarly on all evaluation metrics, suggesting that the model was robust and could be used for any dataset combinations.

Model Performance on EPA regions
In the next phase of RF model evaluation, the whole dataset was divided into 10 geographical regions assigned by the EPA (US EPA, 2020).Supplementary figure S1 shows each EPA region.Although Puerto Rico, Hawaii, and Alaska were assigned to regions 2, 9, and 10 respectively, they were evaluated separately as they are not part of CONUS.The RF model was trained with 70% of the data points from each region separately and tested on the remaining 30% of data with six different combinations of number of estimators (N), maximum depth (D), and minimum sample leaf (m), which are sklearn RF parameters.The best combination for each region was selected to further evaluate the RF model.Atlanta).The performance of the ML model in estimating primary and secondary pollutants (like ozone, nitrogen dioxide, PM, etc.) is affected by several meteorological and anthropogenic factors such as urbanization, diurnal variability, wind speed and direction, and boundary layer height (vertical mixing) (Sayeed, Choi, et al., 2021).Since each EPA region in the US has its distinct anthropogenic (emission, urbanization etc.), geographic (geo location, elevation etc.) and meteorological (seasonality, diurnal variability) features and characteristics, it is challenging for a generalized model to have a similar performance in all regions.It is also evident from figure 2 and supplementary figures S3 and S4 that regions with the larger concentrations range of PM2.5 performed better than regions with lower concentrations ranges.This suggests that the ML model is sensitive to concentration variability, performing better in regions with high PM2.5 compared to regions with uniform PM2.5.This is possibly due to the skewness of data (number of data-points with low concentrations far exceeds the high concentrations), while the higher representations of highs make the performance better, its lower representation makes the performance comparatively poor.This is an expected behavior of regression or ML based models where data distribution often force model to estimate mean better than extreme values (low and high) (Gupta et al., 2021;Gupta & Christopher, 2009;Ma et al., 2021).It was also noted that, in regions with less variability, the RF model performed better in areas with less than 30,000 data points, suggesting that adding more data to a less variable region makes the model susceptible to errors possibly due to overfitting.
Next, we present results of inter-comparison of training and testing separately while combining data from all the regions (i.e., Table 4).Figure 4a shows density scatter plot of aggregated data points from all the regions together.The left panel shows training datasets while right panel is for testing datasets.The scatter plot demonstrates overall good correlations with consistent performance between training (r=0.88) and testing (r=0.82).The mean bias is close to zero in both datasets with RMSE changes from 4.8 µg m -3 in training to 5.8 µg m -3 in testing datasets.The slope values of (~0.7 and 0.62) implies overall underestimation of PM2.5 by ML models.We further evaluated the ML model performance by combining data from training and testing and presenting as diagnosis and prognosis errors in Figure 4 (b) and (c) respectively.The complete collocated dataset was sorted according to PM2.5, binned into bins with equal number of collocations (1000 data points), and then mean, median and standard deviation of each bin was calculated.Orange dots and line show the mean.The green dots and line show the median.The grey shaded area indicates one standard deviation of each bin.The horizontal black dotted line denotes zero difference.Positive values of bias indicate that estimated PM2.5 is higher than observed AirNow PM2.5 and vice-versa.The figure 4(a) shows that ML model overestimates PM2.5 values when observed values are lower than 10 µg m -3 otherwise it underestimates for the remaining observed range of PM2.5.The negative bias is more stable for values lower than 100 µg m -3 but then decreases exponentially for the higher PM2.5 values.This behaviors of bias with observed PM2.5 suggest a manifestation of data distribution with high density of low concentration values, produces better training for the ML model and  consistent with our previous findings (Gupta et al., 2021).Figure 4 (c) show bias as a function of estimated PM2.5 with more flat behavior for estimated PM2.5 values smaller than 100 µg m -3 but then underestimates.The diagnosis and prognosis error analysis are as expected and suggests that applying more advanced data sampling methods such as oversampling (Vu et al., 2022) may require to achieve better results for complete range of observed PM2.5.

Comparison of RF model with MERRA-2 estimates
The RF model estimates were compared to MERRA-2 estimated PM2.5.PM2.5 is not a direct product of MERRA-2, rather it was calculated empirically using equation 1.Additionally, MERRA-2 has the advantage of providing data everywhere, even in places and times when observations are not available.However, MERRA-2 data is representative of a ~50x50 km 2 grid box area and has limitations due to uncertainties in emissions (Buchard et al., 2016a) Figure 5a&b shows the density scatter plot comparison of PM2.5 estimated by MERRA-2 (equation 1) and the RF model (MERRA-2-ML) respectively.The scatter plots are plotted with aggregate of all data (training and testing combined) with observations as x-axis and estimations as y-axis.The dotted black line represents the 1:1 line.These clearly show the larger scatter in MERRA-2 PM2.5 when comparing with ground monitors, with slope value of 1.15, mean bias of ~ 4.1 µg m -3 and RMSE of 28.7 µg m -3 .The plot shows that the MERRA-2 estimates have high positive bias with low correlation (r=0.38).This is also evident in supplementary figure S5; during summer and spring, MERRA-2 estimates exhibit large biases, and MERRA-2-ML successfully reduced those biases.MERRA-2 shows a better agreement with Interagency Monitoring of Protected Visual Environments (IMPROVE) sites as compared to EPA-AQS sites as IMPROVE sites are generally located in rural remote areas having less emission variability (Buchard et al., 2016a).During fall and winter, the MERRA-2 overestimation was reduced, but the RF model still performed better.Figure 5 c, d, & e shows the scatter density plot for the RF model (MERRA-2-ML) where y-axis represents the estimation and x-axis represents the observations.The narrow spread and concentration of high density of points near 1:1 line suggest that the estimated PM2 .5 is in good agreement with observed PM2.5. Figure S6 shows the regional daily mean PM2.5 concentrations that were observed and the concentrations modeled by MERRA-2 and MERRA-2-ML.
During the 2018 California wildfire season, very high PM2.5 concentrations were observed in August through September and again in November in regions 8, 9, and 10.MERRA-2 estimated high concentrations for these periods, but the estimates exhibited large overestimations with days exceeding observations by 75 µg m -3 (see figure S6).In contrast, the MERRA-2-ML (RF) model did not overestimate by 6 µg m -3 except for four days in Puerto Rico where bias was between 6-13 µg m -3 .Figure 5b shows the daily, monthly, and yearly scatter plots for observed and estimated PM2.5.As the averaging window increases, the correlation between observations and estimations also increases.The negative bias in lower range (<10 µg m -3 ) of observed PM2.5 remains an issue even for larger time averages.

Assessment of Western US fires using estimated PM2.5
To evaluate the RF model performance on the out-of-box scenarios, the model was tested by estimating hourly PM2.5 from August 20-27, 2020.The previously trained models (using 2018 data) for each EPA region were used to estimate the hourly PM2.5 at each grid corresponding to MERRA-2 for the CONUS.This time period was selected because, during this period, several fires were reported in the western US. Figure 6 shows a series of corrected reflectance RGB images from Suomi NPP Visible Infrared Imaging Radiometer Suite (VIIRS) satellite.The superimposed red dots are fire locations detected by VIIRS (https://worldview.earthdata.nasa.gov/).The second column shows observed PM2.5 from the EPA AirNow network, and the last column includes maps of estimated PM2.5 using MERRA-2-ML.The data corresponds to 13:00 local time for a) Aug 20, 2020, b) Aug 22, 2020, c) Aug 24, 2020, and d) Aug 26, 2020.The RGB images (column 1) shows fire locations (red dots) and smoke (grayish haze type looking feature in the image) transport across the western US.The AirNow network specifically in CA observed high PM2.5 (>60 µg m -3 ) values and corresponds to visible thick smoke plume seen in RGBs.It is important to note here that visible smoke in RGB can be located at any height in the atmosphere and not necessary affect surface air quality.By comparing the RGB and PM2.5 maps, it appears that spatial patterns in estimated MERRA2-ML PM2.5 matches with larger smoke plumes.The RF model was able to capture the movement of the plume from west to east during August 20-26, 2020.RF model results show a high concentration of PM2.5 east and northeast of the recorded fire on Aug 20, 2020, which is also evident from the VIIRS image as haze.The plume can be seen migrating from California in the west to the northern US, crossing over the Rocky Mountains, and then moving down toward the central US.A similar rise in PM2.5 is also seen in these regions, as estimated by the RF model.The ability of MERR2-ML to map the smoke plumes and provide corresponding surface PM2.5 in areas with sparse PM2.5 ground measurements can be very useful in time of increasing wildfires in the US.The refined MERRA2-ML data can also be used to inter-compare with other regional model outputs for both analysis and forecasting.

Global Implications and Challenges
The new ML models presented here are designed to work only over the Continental United States (and perhaps other regions with similar environmental conditions).Implementing the similar ML for the global regions may be challenging and requires more extensive research development.In the CONUS we had a relatively good dense network of ground stations to work with: 800+ AirNow stations with distribution across various EPA regions.We were able to train the ML models by using relationships in input aerosols components and meteorological parameters from MERRA-2 to estimate surface PM2.5.The lack of ground monitors, varying amount of data availability, change in aerosol type, varying meteorological conditions, and aerosol transport in other parts of the world will make accurate estimation of PM2.5 from MERRA-2 a more complex problem.However, because the results over CONUS have been so encouraging, we decided to train a similar RF model for entire globe using same one year of collocated datasets using more than 3000+ ground station.The ground data for global regions were collected from OpenAq platform (openaq.org).We had total six million plus collocated data points to train and test a ML model.Figure 7 presents the comparison between observed PM2.5 with ML estimated PM2.5 from MERRA-2 for the entire global data set, combining data for both training and testing of ML model (Random Forest).The results are surprisingly good.The global ML model has mean bias close to zero, with slope of 1.02 and RMSE of 6.4 µg m -3 globally.These initial results and analysis are very encouraging, but more in-depth analysis for specific regions will be required to better understand the performance and limitation of ML approach before applying it operationally at global scale.

Conclusions
In this study, we developed ML models to estimate hourly PM2.5 using aerosol and meteorological parameters from MERRA-2 reanalysis for United States.Ten different ML algorithms were tested to select the best algorithm to accurately model the surface PM2.5.Comparing these ML algorithms showed that the Random Forest algorithm performed best and computationally faster than other ML algorithms like SGD, KNN, Adaboost, Gradient Boost and XGB, and linear regression models like Lasso and Ridge.The RF model correlation values were ~25% better for training and ~10% better for testing compared to KNN, which was the next best model.The RF model was further evaluated with 10-fold cross validations.The cross validations were performed by dividing all valid data into ten sets.The sets had similar performance metrics and average correlations.Averaging all ten data sets, the overall correlation was 0.82, and the RMSE was 5.70 µg m -3 with mean bias close to zero.To further improve performance, separate regional RF models were developed with a combination of model parameters.
Regional RF model performance improved from east to west.The best performing regions were regions 9 and 10, Alaska, Puerto Rico, and Hawaii.While variability in observed PM2.5 is the likely reason for favorable results in regions 9 and 10, better performance in Alaska, Puerto Rico, and Hawaii is likely due to fewer data points.Variability in large data sets produces a greater number of decision trees in the RF model which improves the performance.Attaining higher estimation accuracy for data sets with less variability can be achieved by using less data.However, using fewer data points might also induce more error in out-of-box scenarios.Therefore, larger data sets are preferable to smaller data sets despite having lower accuracy.This is one of the limitations of RF models.
The RF model was also compared with MERRA-2 empirical estimation of PM2.5.The MERRA-2 grid estimations (larger spatial area average) largely overestimated as compared the observed PM2.5 at point location by ground monitors.While the ML model has small and uniform bias throughout the day and in all seasons, MERRA-2 exhibits comparatively higher overestimation between 07:00 UTC and 13:00 UTC.MERRA-2 also produces a slightly lower mean of overestimation during winters, but all other seasons the mean was greater than 0. Since the ML model was better at estimating PM2.5 than empirical calculations from MERRA-2, it can be extended to other years and grids of MERRA-2 reanalysis while taking account in change in assimilation datasets into MERRA2 datasets.This could build a spatiotemporal trend for further research and model development.Another application of such algorithm could be bias-correction of PM2.5 forecasts from the GEOS-FP model at regional scales (Gupta et al., 2021).
To further determine the robustness of the developed model, a special case of western US fires during August 20-26, 2020, was studied.The regionally trained model was used to estimate PM2.5 at each MERRA-2 grid location for the period.The estimated values accurately represented the plume seen in VIIRS true color images and observed PM2.5, suggesting the RF model can perform well in the out-of-box scenarios as well.

Figure 1 :
Figure 1: Schematic diagram of the methodology used for estimation of PM2.5.The color code of boxes represents the major steps in model development.The boxes in green represents data pre-processing; the blue boxes represent model selection, and the orange box represents model fitting.

Figure 2 :
Figure 2: Locations of PM2.5 measurement stations in the United States binned by number of valid observations in the year 2018.Each circle represents the location of a measurement station, and the color represents the number of valid observations in the year 2018.

Figure 3 :
Figure 3: Regional performance of Random Forest model on test dataset; Station-wise.a) Pearson correlation of RF model trained by region; b) MB of RF model trained by region; c) RMSE of RF model trained by region.

Figure 4
Figure 4: a) Scatter plot of estimated PM2.5 by RF (aggregate of region models): left panel-training, right paneltesting.Black dotted line represents 1:1 line whereas, blue dotted line represents linear fit to the data; b) Mean bias (µg m -3 ) in estimated PM2.5 by RF (aggregate of region models) as function of observed PM2.5.c) Mean bias (µg m -3 ) in estimated PM2.5 by RF (aggregate of region models) as function of estimated PM2.5.[Bias and RMSE in µg m -3 ]

Figure 5 :
Figure 5: Scatter plot comparison of: a) MERRA-2 estimated PM2.5; b) ML model (RF) estimated PM2.5; c) Daily estimated PM2.5 for MERRA-2-ML of all stations combined; d) Monthly estimated PM2.5 for MERRA-2-ML of all stations combined and; e) Yearly estimated PM2.5 for MERRA-2-ML of all stations combined; The color bar represents the number density.The black and blue dotted lines represent the 1:1 line and line of linear fit for the data respectively.Units are in µg m -3 .

Figure 7 .
Figure 7. Density scatter plot of observed and ML estimated MERRA2 surface PM2.5 over global region.In this plot, data from both training and testing are combined.Solid black line is 1:1 line.Units are in µg m -3 .

Table 1 :
List of parameters from MERRA-2 reanalysis used as input to train ML models.
Table2lists ML performance statistics for correlation, MB, RMSE, and computational time.The regression models (OLS, Ridge, Lasso) were fitted within a fraction of second and had similar RMSE and MB compared to other methods, but these models demonstrated very low correlation and high percentage error.KNN and SVM had low percentage error and better correlation than regression methods, but the computational time was much longer.GB and XGB had high percentage error and correlations of 0.66 for training and 0.61 for testing data.Adaboost was the least successful ML model.RF exhibited the highest correlations: 0.97 for training and 0.69 for testing.RF also

Table 2 :
Performance metrics of all ML models used in the study.

Table 3 :
Performance of all 10-folds on testing sets.

Table 4
88±14.06 µg m -3 ) compared to other regions (8.45±8.62 µg m -3 ).Supplementary figureS2shows the frequency distribution of observed PM2.5 by region.Puerto Rico and region 8, which includes Colorado, Utah, Montana, Wyoming, and North and South Dakota, have correlations ~0.88 for training and ~0.80 for testing.In the Northeast, regions 1 and 2 had training correlations of 0.76 and 0.77, respectively, and about ~0.70 for testing.The lowestperforming regions were from the Midwest (region 7: Kansas City) and Southeast (region 4: helps in better training of model due to uniform distribution of data-points both spatially and temporally.Additionally, even though Hawaii and Alaska have lesser number of data-points it was amongst the best performing region due to uniform PM2,5 variability throughout the region.Region 9, which includes California, Nevada, and Arizona, had the highest RMSE due to greater PM2,5 concentrations (10.

Table 4 :
Regional comparison of Random Forest model with best set of training parameters for each region.