Inferring the main drivers of SARS-CoV-2 global transmissibility by feature selection methods

Identifying the main environmental drivers of SARS-CoV-2 transmissibility in the population is crucial for understanding current and potential future outbursts of COVID-19 and other infectious diseases. To address this problem, we concentrate on the basic reproduction number $R_0$, which is not sensitive to testing coverage and represents transmissibility in an absence of social distancing and in a completely susceptible population. While many variables may potentially influence $R_0$, a high correlation between these variables may obscure the result interpretation. Consequently, we combine Principal Component Analysis with feature selection methods from several regression-based approaches to identify the main demographic and meteorological drivers behind $R_0$. We robustly obtain that country's wealth/development (GDP per capita or Human Development Index) is the most important $R_0$ predictor at the global level, probably being a good proxy for the overall contact frequency in a population. This main effect is modulated by built-up area per capita (crowdedness in indoor space), onset of infection (likely related to increased awareness of infection risks), net migration, unhealthy living lifestyle/conditions including pollution, seasonality, and possibly BCG vaccination prevalence. Also, we argue that several variables that significantly correlate with transmissibility do not directly influence $R_0$ or affect it differently than suggested by naive analysis.


Introduction
Despite the unprecedented worldwide campaign of mass immunization, due to the relatively slow vaccine rollout and to the appearance of new, more contagious (Tegally et al., 2020), and maybe even more deadly SARS-CoV-2 strains (Mallapaty, 2021), COVID-19 still takes its toll on human lives, stifles the world economy, and forces the majority of countries to keep unpopular lockdowns. In the absence of a prompt solution to the first pandemic of the century, the goal to identify the main environmental and demographic parameters that influence the dynamics of infection transmission remains as important as ever.
Investigating which factors influence the rate of COVID-19 expansion is a highly involved matter, primarily due to many possibly relevant modes of SARS-CoV-2 transmission. It is generally undisputed that COVID-19 transmission occurs by droplets between people in close proximity (Rahimi et al., 2021). Consequently, various factors that influence the frequency of interpersonal contacts and intensity of social mixing between people arriving from different/distant areas will necessarily strongly impact the rate of the epidemic spread. However, other means of transmission may also play important roles. It has been established that contact with surfaces in the immediate environment of infected persons (or with objects used by them) can lead to infection. Also, there is an increasing body of evidence that the airborne transmission of COVID-19 cannot be neglected and, moreover, that small air pollutant particles may aid the long-range virus transmission (Coccia, 2020b;Rahimi et al., 2021). Some authors even suggest the relevance of contaminated wastewater in the infection spread (Rahimi et al., 2021). Such various modes of transmission open up the possibility not only of direct influence of numerous environmental factors (e.g., temperature, humidity, wind speed, air pollution) (Sarkodie and Owusu, 2020) but also of nontrivial interactions between these factors (for example, the interplay between the pollution, wind speed and population density is discussed in (Coccia, 2021b)). It must be also taken into account that the level of personal susceptibility to the given virus can strongly affect the rate of an epidemic, and thus a comprehensive study must also consider more subtle demographic/medical variables (Notari and Torrieri, 2020).
We recently published a comprehensive study of the correlation of 42 different demographic and weather parameters with COVID-19 basic reproduction number R0 across 118 world countries . R0 is a well-established epidemiological measure of virus transmissibility. Its major advantage is independence on the testing policy/capacity and intervention measures, which can be highly variable (and almost impossible to consistently control) between different countries . In , we selected all the countries that exhibited regular exponential growth in the case numbers before the introduction of intervention measures , from which their R0 values can be reliably extracted. Tracking a wide range of countries allows achieving a maximal variability in the dataset, i.e., a maximal possible range in the values of analyzed variables, as another advantage of this study. This dataset will be used as a starting point in this work.
Our present goal is to go a step further than this previous study by applying and combining various advanced data analysis methods. Namely, while ) covered a broad scope of variables and countries, it focused on establishing pairwise correlations between R0 and each of the studied factors, ignoring the fact that many of these variables are highly mutually correlated. This is most obvious in the case of the weather parameters such as e.g. temperature and UV radiation (which both reflect the local climate in a similar way and follow comparable seasonal trends), but also in the case of many demographic parameters, e.g. the strong positive correlation between the Human Development Index (HDI) and cholesterol levels. Based on pairwise correlations alone, it is thus hard to estimate which of these variables might be truly influencing the spread of the disease, to what extent, and in which direction.
Therefore, in the present paper, our task is to go beyond mere pairwise correlations and, by using a combination of principal component analysis (PCA) and four different regression-based approaches, establish which of these factors significantly affect R0 (by at the same time controlling for the impact of other factors). To achieve this, the number of variables necessary to explain the virus transmissibility needs to be reduced to only a few without losing predictiveness. However, this is not the only challenge, because of variable redundancy. In particular, one may select different combinations of variables accounting together for a similar proportion of variance in the virus transmissibility, which presents an ambiguity that is difficult to resolve. Consequently, there is a challenge to narrow down the possibilities and illuminate important contributions of the seemingly small differences between highly correlated variables. Notably, numerous studies examined the correlations of several selected (Lin et al., 2020;Ran et al., 2020; or many different Hassan et al., 2021;Salom et al., 2021) sociodemographic and meteorological factors with the magnitude of the COVID-19 epidemic. However, only a few studies tried to select a handful of key factors whose combination can explain a large portion of the variance between regions (Allel et al., 2020;Coccia, 2020d;Gupta and Gharehgozli, 2020;Notari and Torrieri, 2020). Even a smaller number of studies included data from multiple countries (Allel et al., 2020;Notari and Torrieri, 2020). Furthermore, a similar effort to relate the rate of exponential case growth to a smaller set of variables by PCA was made (Notari and Torrieri, 2020;Notari, 2021). To conduct a thorough investigation and reach robust conclusions, we partially decorrelate 24 different factors and apply several specialized feature selection methods independently. While disentangling the various effects on the epidemic spread is undeniably a challenging task, by comparing the results obtained by different methods one can identify the variables that are selected by all or most of them, which provides valuable evidence in favor of their true significance.
The main idea of this study is to develop a novel approach, which can robustly identify the most important predictors of R0. The development of such an approach will i) provide a straightforward solution to the known problem of selecting important among the highly correlated variables, ii) enable a better understanding of which environmental and demographic variables may dominantly and/or independently influence the progression of the COVID-19 epidemics, and what is the direction of this influence. With these goals, we organize the study as follows: 1. The variables are first naturally split into two groups. The first group comprises six meteorological parameters, sampled and averaged (for each country) during the initial stage of the local epidemic outbreak: air temperature (T), precipitation (PR), specific humidity (H), ultra-violet radiation index (UV), air pressure (P), and wind speed (WS). Eighteen (broadly-speaking) demographic parameters form the second group: human development index (HDI), percentage of the urban population (UP), gross domestic product per capita (GDP), amount of the built-up area per person (BUAPC), percentage of refugees (RE), net migration (i.e., the number of immigrants minus emigrants, I-E), infant mortality (IM), median age (MA), long-term average of PM2.5 pollution (PL), prevalence and severity of COVID-19 relevant chronic diseases in the population (CD), average blood cholesterol level (CH), the prevalence of raised blood pressure (RBP), the prevalence of obesity (OB), the prevalence of insufficient physical activity among adults (IN), BCG immunization coverage (BCG), alcohol consumption per capita (ALC), smoking prevalence (SM), and the delay of the epidemic onset (ON). 2. Due to strong correlations between parameters within each group (as well as across the groups, but at a lower extent), on each group, we will perform the PCA (Jolliffe, 2002). This step will allow us to notably reduce the dimensionality, i.e., proceed to work with a smaller number of (mostly) uncorrelated variables. Such dimensionality reduction will significantly simplify the further analysis and improve the reliability of the results. 3. The linear regression analysis will next be performed in four independent ways, ranging from our custom-developed to more formal regression-based approaches, to select important variables. In our custom-developed approach, multiple linear regressions are applied, first separately to demographic and meteorological principal components (PCs) to narrow down the number of relevant PCs within each of the two groups, before doing overall linear regression with the remaining PCs to assess their importance in explaining R0. A major advantage of such analysis is an intuitive understanding of the data structure and its relation to R0. This analysis is next independently redone by more formal feature selection methods, commonly employed in bioinformatics and systems biology: Stepwise regression and regressions utilizing both regularization and variable selection -Lasso (Least Absolute Selection and Shrinkage Operator) and Elastic net (Tibshirani, 1996;Zou and Hastie, 2005;Hastie et al., 2009). Lasso and Elastic net are regressions based on regularization, which can shrink coefficients exactly to zero, allowing variable selection. Such feature selection is quite important, as the variables that do not affect the response (R0) may introduce significant noise in the model. This would lead to overfitting (high variance), which is exacerbated by a small/limited dataset and correlated predictors (as applicable here). Note that regressions with regularization are not ordinary fits, e.g. hyperparameters that control coefficient shrinkage have to be carefully tuned through cross-validation. Overall, this comprehensive analysis will ensure the consistency and robustness of the reported results. 4. Finally, an intuitive interpretation of the obtained results will be presented. This will permit a much more specific understanding of COVID-19 transmissibility, by focusing on the main driving factors behind the disease spread in the population.

Sample and data
Data for demographic and meteorological parameters were assembled as described in . Briefly, the data correspond to six meteorological and eighteen demographic variables outlined above. The differences between this dataset and the one used in  is the following: IMS (Social security and health insurance coverage), Prevalence of ABO and Rhesus blood groups, and Ambient levels of different pollutants (NO2, SO2, CO, PM2.5, PM10) are not used in this analysis, as they contain too many missing values. Instead of the pollutant levels measured from air pollution monitoring stations during the epidemic's exponential growth (available for only ~40 countries), we use the yearly average PM2.5 pollutant levels in 2017 (World Bank, 2020b). Also, we consider GDP per capita (GDPpc), taken from (World Bank, 2020a), as a more direct (average) indicator of a country's economic wealth/productivity.
There are no missing values in the meteorological data, while we substitute the missing values in the demographic data (which were sparse for the used variables) with the median values of the respective variables. The discarded variables have at least 30% missing values (occurring for blood groups) and going up to 72% (for CO). On the other hand, for the variables that we retained, the missing values are sparse. Specifically, the maximal fraction is 6% (for refugees).
Basic reproduction number (R0), i.e., a measure of SARS-CoV-2 transmissibility in a fully susceptible population and in the absence of intervention measures (social distancing, quarantine), was also taken from , where it was inferred from non-linear dynamics modeling. Overall, demographic data, meteorological data and R0 were assembled for 118 different countries from which we could reliably infer R0. We used the following criteria to select the countries for which R0 was inferred and consequently used in the further analysis: i) We initially selected 165 world countries, with the following criteria: more than 1000 tests performed by 12.06.2020, with the ratio between performed tests and detected cases larger than 5. This accounts for reliable testing capacity, i.e., in this way, we avoid countries with low testing capacity. ii) R0 is inferred from the regime which has reached the deterministic limit, i.e., the starting time is such that the number of detected cases is ~10 or more. Each country is processed/inspected manually to determine the time period that is used in the analysis (i.e., exactly which interval corresponds to the exponential growth) -this is necessary as we found that this interval significantly varies for different countries. The manual inspection corresponds to recognizing a straight line on a semi-logarithmic plot, which is straightforward and reproducible. iii) Clear exponential growth is observed for at least seven days, corresponding to the straight line in the number of detected cases vs. time on the log scale.
In this way, we avoid the cases where too small (or irregular) testing is preventing reliable inference of R0. Two `filters' insure this: The first is a condition i) which directly accounts for those countries with too small testing capacity. Secondly, the countries with `irregular' testing are also filtered by condition iii), as they exhibit an irregular behavior in the case count numbers, rather than regular exponential growth, which is robustly observed across other countries. Finally, condition ii) ensures that stochastic fluctuations (and possible very early testing inconsistencies) do not influence results, while the manual inspection for each country ensures adequacy of the time interval used in the analysis.

Measures of variables
Several variables, particularly among demographic data, show a significant deviation from normality when visually inspected. Such deviations generate large outliers and would significantly impact the necessary normality of the model error residuals. We consequently transform the data (where necessary) to make the resulting distributions closer to normal, by using standard transformations that reduce the right and left skewness. We chose the strength of the applied transformations (e.g., square root, cubic root, or log) so that skewness of the transformed distribution is as close as possible to zero. Applied transformations are provided in Table 1. Each transformed variable whose direction was changed by the transformation was taken with a minus sign, so that the original and the transformed variable are oriented in the same direction, allowing for easier result interpretation.
After transformations, the remaining (now sparse, less than 2% of the dataset) outliers were removed by substituting them with the median of each variable; the outliers were identified as having more than three scaled median absolute deviation (MAD) from the (transformed) variable median. Removal of the outliers is important, as they may substantially (both quantitatively and qualitatively) obscure the multivariate regression analysis, including regressions with regularization (Hastie et al., 2009).
The dimensionality of the transformed data was next reduced and the data decorrelated through PCA (Jolliffe, 2002). PCA was done separately for demographic and meteorological variables to allow for a more straightforward interpretation of the obtained PCs. Since different variables are expressed in different units and correspond to diverse scales, each variable in the dataset was standardized (the mean subtracted and divided by the standard deviation) before PCA. For both datasets, we retained as many PCs (starting from the most dominant one) as needed to (cumulatively) explain >85% of the data variance. It was inspected that PCs reasonably follow a normal distribution (as expected, based on the transformation of the original variables). Few remaining outliers were then substituted by medians. For easier interpretation of PCs and their contribution to R0, each PC was oriented in the same direction as the variable with which it has a maximal magnitude of Pearson correlation (i.e., when needed, the sign of the PC was flipped to render the positive sign of this correlation).

Data analysis
Custom regression analysis was done by applying multiple linear regression (PC regression) to only demographic PCs (Hastie et al., 2009). Only linear terms were included in the regression to allow straightforward interpretation, i.e., selection of PCs that significantly affect R0. Significant PCs were selected as those appearing in the regression with P<0.05, where the significance in the regression was estimated in the standard way (through F-statistics) (Alexopoulos, 2010). The same regression was then repeated with only meteorological PCs, and those significant in explaining R0 were retained. Finally, multiple linear regression was performed with all retained demographic and meteorological PCs. The significant PCs from this last step were recognized as PCs relevant for the R0 explanation. Before regression, each PC was standardized so that coefficients obtained in the regression provided a measure of the variable importance in explaining R0. For both the custom analysis and stepwise regression, OLS (Ordinary Least Squares) were used as the regression metrics.
Stepwise regression was used to select PCs that significantly affect R0. In Stepwise regression, as well as in LASSO and Elastic net described below, all PCs (demographic and meteorological) were included in the regression. Briefly, starting from a constant model, at each step a term is added to the model if its significance (calculated with F-statistics) meets the condition P<0.05 (Pope and Webster, 1972).
Only linear terms are added to the model (i.e., interaction and quadratic terms are not considered) to allow for straightforward interpretation which PCs significantly affect R0. All PCs are standardized before regression so that contributions of the terms (PCs) in the model can be assessed by the magnitude of the regression coefficient.
Lasso regression (Tibshirani, 1996;Hastie et al., 2009) was used to implement L1 regularization. As needed with the Lasso regularization, all PCs were standardized before regression, which allowed direct comparison of the coefficients obtained by the regression. The value in Lasso was treated as the hyperparameter, i.e., min value was determined through cross-validation, so that MSE (Mean Squared Error) on the testing set was minimal. A total of 100 values were put on the grid, corresponding to the geometric sequence, where the largest value produces all zero terms. Note that larger corresponds to a sparser model, i.e., a smaller number of non-zero components in the regression, while the small limit corresponds to OLS regression. To obtain the maximally sparse model, SE = min + SE, where SE corresponds to the standard error of MSE obtained by crossvalidation, was used. 1000 cross-validations were performed -in each repetition, 20% of the data were randomly selected for the testing set, with the remainder used for training. All non-zero terms, and the corresponding coefficients obtained through Lasso, were reported.
Elastic net regression was used to implement a combination of L1 and L2 regularization (Zou and Hastie, 2005). Analogously to our Lasso analysis, i.e., as needed due to regularization, all PCs were standardized. In the regression, both and were treated as hyperparameters, i.e., their optimal values were found by cross-validation. Cross-validation was repeated 1000 times. In each repetition, testing and training sets were formed in the same way as for Lasso. and values were put on a grid consisting of 100 and 100 values. values on the grid were chosen uniformly in the range [0,1]approaching zero corresponds to Ridge (L2) regression, and 1 corresponds to Lasso regression. For each value, values were chosen as described for the Lasso regression. For each repetition of crossvalidation, a combination of α and λ that leads to the minimal MSE was chosen. and values in ( , ) pairs from each cross-validation run were then standardized so that and values are on the same scale and centered to the origin of the − plane. ( min , min ) was then chosen as the ( , ) point closest to the origin. With this ( min , min ) value, the model was then retrained on the entire dataset. All non-zero terms and the corresponding regression coefficients were reported.

Results
PCA was first applied to the dataset consisting of 18 demographic and health factors for 118 countries. Cumulative data variance that is explained jointly by the first n PCs is shown in Figure 1A (with n represented on the x-axis). In particular, Figure 1A shows the first PC alone already accounts for 45% of the variance, while the first 9 PCs (PC1 -PC9), which we retain in further analysis, explain more than 85% (precisely, 89%). To obtain a basic interpretation of these nine PCs, we related each PC with the original (transformed) variable it is most correlated with. The corresponding associationswith the values of correlations coefficients presented on the y-axisare shown in Figure 1B (however, one should have in mind that some PCs are highly correlated with more than one original variable, as we discuss in more detail below). Among all principal components, the PC1 and the PC5 have the highest correlation coefficients (close to 1) with individual demographic factorsthe HDI and the BCG immunization coverage, respectively. Moderately high correlation coefficients (~0.75) characterize the relations between the PC2 and the prevalence of smokers, and the PC3 and the percentage of refugees, while the coefficient values of ~0.5 were obtained for the correlations of the PC4, the PC6, the PC7, the PC8 and the PC9 with, respectively, the prevalence of obesity, the prevalence of insufficient physical activity, the amount of the built-up area per person, the percentage of refugees, and the epidemic onset. In particular, the first PC, accounting alone for the largest portion of the variance in the demographic data, is almost perfectly correlated with the Human Development Index (Fig. 1C). On the other hand, the HDI variable itself strongly correlates with several other demographic variables (Fig. 1D), most prominently with per capita GDP, infant mortality, and cholesterol levels. As elaborated in the Discussion section, such extremely high correlations will eventually preclude us from differentiating between the separate effects of each of these variables on R0. On the other hand, the prevalence of obesity, the built-up area per person, and the epidemic onset are significantly correlated with the HDI (Fig. 1D), and thereby the PC1 (Fig. 1C), but they are markedly featured also in separate principal components (Fig. 1B), namelythe PCs 4, 7 and 9. This will help us to infer whether their specific (additional) contributions to the variance in the data (apart from that along the PC1) impact the virus transmissibility.
For meteorological factors for 118 countries was reduced similarly as for the demographic dataset. PCA generated six uncorrelated, orthogonal principal components. Thereby, the first PC alone explains 62% of the variance, while the first three PCs (PC1-PC3) capture 95%, which is significantly above the targeted 85% of the total variance ( Fig. 2A). Pairwise correlations showed that the retained three PCs have the highest correlations with the temperature, the wind speed, and the air pressure, respectively (Fig. 2B), where the correlation of PC1 with the temperature is close to 1 (Figs. 2B and 2C). There are also notable correlations of the temperature with humidity, the levels of UV radiation, and precipitation (Fig. 2D). Therefore, PC1 presents seasonality, i.e., a set of mutually correlated meteorological variables that can be related to yearly weather changes. PCA, thus, effectively separated the impacts of seasonality (PC1), the wind speed (through the PC2), and the air pressure (through the PC3). The variables determining the PC1 are also correlated with the HDI. These inter-dataset correlations are not resolved at this level by our PCA and represent the trade-off that allows interpreting the PCs more easily within each of the two smaller, thematic groups of factors. After PCA, we applied the linear regression analysis using four different methods, as explained in Methods. The first, "custom" method included the additional step of "preselecting", i.e., further narrowing down the number of PCs that will enter the final regression analysis. The multiple linear regression, applied on the group of 9 demographic PCs, selected 1 st , 4 th , 7 th and 9 th component as the most relevant predictors of R0 (the remaining 5 PCs appeared in the linear regression with p values above 0.05 threshold and were consequently excluded from the further analysis, see Table 2. Analogously, the "preselection" of meteorological PCs singled out the 1 st component as the only statistically relevant predictor of R0 from this group (see Table 3). The multiple linear regression was then applied on these five selected PCs (four demographical and one meteorological) and yielded a regression model with the corresponding linear coefficients represented in Figure 3A (see also  Fig 3A, the demographic PC1 has the most dominant influence on R0a robustly obtained result throughout all four methods (see below). We have already related each of these four PCs with the dominantly correlated variable ( Figure 1B), but a more detailed interpretation of the results is obtained if all significant correlations (not just the dominant one) are taken into account. In addition to the very high correlation with HDI, demographic PC1 is also highly positively correlated with GDP, cholesterol levels, median age, and percentage of the urban population, while it is highly negatively correlated with infant mortality and the prevalence of chronic diseases ( Figure 4A). Such strong correlations with HDI, GDP, IM, MA, and UP show that this component indeed expresses an overall, both social and financial, the prosperity of the country (which seemingly also goes hand in hand with high average cholesterol levels and low prevalence of COVID-19 relevant chronic diseases). Similarly, by considering the correlations of demo PC4 with all demographic variables, we see that this component is significantly positively correlated not only with obesity but also with smoking, physical inactivity, and air pollution ( Figure 4B)in other words, with major indicators of an unhealthy lifestyle and living conditions. Apart from its correlation with the BUAPC parameter, the component demo PC7 is also significantly positively correlated with net migration ( Figure 4C). In the case of the demo PC9 component, its only significant correlation is with the onset variable. Results of the custom method can therefore be summarized as follows: the country's prosperity, as well as unhealthy living conditions and lifestyle, tend to increase the value of R0, while the larger built-up area per person and the later epidemic outbreak tend to slow the spread of the disease. Also, the results seem to indicatevia demo PC7 componenta surprising diminishing effect of the net migration on the rate of epidemic progress (though the sign of this variable may not be easy to interpret, as the net migration is a difference of two quantities).
Equivalently to Figure 3A, Figures 3B, 3C, and 3D represent the results of, respectively, Stepwise, Lasso, and Elastic net regression. Results (and the corresponding graph) of the Stepwise method almost coincide with the results of our custom methodin spite that in the Stepwise regression (as well as in Lasso and Elastic net methods), there is no intermediate "preselection" step.
Lasso results, shown in Figure 3C, find two additional PCs as relevant: demo PC5 and meteo PC1 (in addition to demo PC1, demo PC4, demo PC7, and demo PC9). The component demo PC5, appearing in Lasso results with a small negative coefficient, is significantly correlated only with the BCG variable, hinting at possible beneficial effects of BCG vaccination. Meteorological principal component meteo PC1 reflects seasonality (see above). In addition to supporting the conclusions of the custom and stepwise methods, the Lasso method thus also implicates a significance of seasonality changes and (to some extent) BCG vaccination in reducing the rate of SARS-CoV2 spread.
The results of the Elastic net method, shown in Figure 3D, are again a bit more restrictive. While further bolstering our confidence in the importance of demo PC1, demo PC7, and demo PC9, these results also reinforce that the seasonal weather variables influence the COVID-19 epidemic (in agreement with the Lasso method) but, for the first time, we do not find an indication of the relevance of the unhealthy lifestyle and living conditionsas revealed by the absence of demo PC4 component in Figure 3D.
Finally, as much as the PCs appearing in Figure 3 are important, the absence of the remaining PCs in the results can be of comparative significance for some of our conclusions. For example, we note that PCs highly correlated with the urban population, alcohol consumption, and chronic diseases do not show up as relevant in any of the methods used. While it is true that these variables are moderately correlated with demo PC1, absence in the results of additional PCs tied with these variables supports the view that these variables are not directly influencing R0 value, but only via indirect relation to the country's prosperity.

Discussion
Our goal was to identify the most predictive factors influencing the risk of the SARS-CoV-2 virus spreading in a population in the absence of any epidemic mitigation measures. Since many potentially relevant factors strongly correlate with each other, we divided them into two groupsmeteorological and sociodemographicand applied the PCA to the variables in each group. In this way, we were able to decorrelate variables within each group while still retaining intuitive interpretation for the new variables (demographic and meteorological PCs) used in further analysis. While a similar approach was proposed in (Notari and Torrieri, 2020;Notari, 2021), dimensionality reduction and predictor decorrelation through PCA were, in our study, combined with different variable selection and regularization techniquesnamely, Lasso and Elastic net -to select PCs that are most predictive of R0 for COVID-19 epidemics. Examining correlations of these PCs with the original variables allowed pinpointing the main drivers of COVID-19 transmissibility. Therefore, this approach, which is reminiscent of the analysis of complex data in systems biology and bioinformatics, to our knowledge represents the most thorough effort to disentangle true COVID-19-related effects from indirect correlations.
Three principal components are robustly selected as the most important predictors by all the methods. Of these, the prosperity of the country has the most significant influence on R0: the spread of the epidemic is faster in economically more developed countries. Specifically, this is the most dominant PC from the demographic group of variables, which is by far most important in explaining R0, and very strongly correlated with HDI (Pearson's correlation coefficient r=0.95) and GDP (r= 0.94)therefore effectively reflecting prosperity and wealth. The second PC is dominantly related to the built-up area per person (BUAPC), and the third with the epidemic onset, where the increase of these reduces the infection spread. We also robustly obtained (by three out of four methods) that unhealthy living conditions/lifestyle is another important factor that exacerbates the epidemic -this PC is dominantly (and consistently positively) correlated with obesity, physical inactivity, smoking, and air pollution. The group of four weather conditions, represented by seasonality, was selected by two independent methods. One is the Elastic net, which is well adapted for selecting among correlated variables (Zou and Hastie, 2005;Hastie et al., 2009) -note that correlations between meteorological and demographic PCs were not abolished by our approach. The PC dominantly correlated with BCG immunization appears only in Lasso regression. In favor of our results, we notice that the analysis of 13 selected variables conducted by Notari and Torrieri (2020) pointed to several similar factors as significant, such as the epidemic starting date, temperature, and the factors characteristic of general poor health and, thus, lower resistance to the virus infection (i.e., smoking, obesity and lung cancer as related to the chronic diseases).
HDI (alternatively, GDPpc) shows the highest correlation with the first demographic PC, which singles out this variable as the main index quantifying the virus transmissibility risk. Higher HDI leads to a higher rate of social contacts and more intense population mixing (especially over long distances), as high HDI is strongly related to high GDPpc implying intensive economic activity, trade, and transportation, including large-distance flights (Allel et al., 2020;Gangemi et al., 2020). While it is true that the intensity of social binding might often be even higher in low-income societies, such interpersonal relations are prevalently of spatially-local character and may less contribute to a large scale epidemic than the intense business activity (trade, transport, tourism, education), which connects and mixes people over large distances/areas. Thus, in our view, much higher contact frequency (especially of contacts that mix individuals over large distances) in societies with higher HDI is likely the main cause behind the dominant role of the first demographic PC in explaining R0.
There is, however, no reason to expect this relation of HDI with R0 to uniformly hold across the entire range of HDI (prosperity PC) values: indeed, analysis performed in (Notari and Torrieri, 2020) shows that GDP (strongly correlated to HDI and our prosperity PC) ceases to be a significant predictor of COVID-19 transmission rate once the low-income countries are excluded. One possible interpretation is that above a certain threshold of HDI, there is a focus on fully developed modern societies. The impact of further increasing country prosperity on the frequency of social contacts then becomes overshadowed by other factors. This was obtained in our recent study of the rate of COVID-19 transmission in USA states , where different factors, in the first place pollution, become dominant in that dataset.
This explanation of the impact of prosperity PC is still a hypothesis, which should be tested by further investigating the relation of HDI to some more direct measures of frequency of actual physical contact between individuals (unfortunately, such measures are not readily available) (Bontempi et al., 2020). We think that the obtained strong association of the PC closely related with GDP/HDI with R0 is an important result (allowing a better understanding of epidemic risks), regardless of whether our interpretation (entirely) captures the true underlying causes of this connection.
An advantage of our approach is that it is based on the analysis of R0 rather than other measures used as transmissibility proxies. The most commonly used measure, confirmed case counts, strongly depends on the number of performed tests, which is generally much higher in high-GDPpc countries, so the analysis would become strongly influenced by testing policies. For example, in (Allel et al., 2020), the importance of HDI for predicting cumulative case counts was noted. However, this perceived effect may be due to the lack of testing in lower-income countries, as already noted in (Notari, 2021), rather than genuine HDI influence. Our results are, on the other hand, insensitive to the testing capacity differences. Namely, since our R0 estimation procedure relies on the scale-invariant slope of the case growth curve logarithm (in the distinct early exponential phase) , it does not depend on the percentage of the infected individuals that get tested/diagnosed (i.e., multiplying daily case counts by arbitrary number will not affect R0). As long as this percentage is roughly constant (i.e., testing is performed consistently) during the (relatively short) examined period, our R0 estimate remains valid irrespectively of whether the country must reserve testing kits only for symptomatic cases, or the country has the means to (preventively) test asymptomatic individuals. On the other hand, if testing policies/criteria happen to irregularly change during the observed period (e.g. due to a sudden lack of testing resources), this would be accounted for (i.e., filtered out) as described in Methods. Therefore, our analysis indeed strongly suggests that HDI/GDPpc are the main/genuine predictors of COVID-19 spread in the population.
Many correlations previously reported between SARS-CoV-2 transmissibility and various weather, sociodemographic, and health factors [see e.g. Salom et al., 2021)] may be captured by HDI. From our results, one can note that several demographic factors significantly correlate with both HDI/GDPpc and the first demographic PC, but are not noticeably related with other demographic PCs (4,5,7,9) that significantly contribute to R0. These demographic factors can be further divided into two groups using the correlation of BUAPC with HDI as the reference. The percentage of the urban population, the prevalence of alcohol consumption, and chronic diseases, which have similar (just somewhat higher) correlations with HDI compared to BUAPC, comprise the first group. Their absence from the independent PCs significantly related with R0 (in contrast to BUAPC that prominently appears in the demographic PC7), indicates that they do not have independent effects on R0. Consequently, their significant correlation with R0  is likely due to their generic correlation with HDI, rather than a consequence of the independent effect that they exhibit on R0. This result is especially interesting for the percentage of the urban population, whose relation with R0 is sometimes taken for granted (Carozzi, 2020). It also explains the previously obtained negative correlation of the prevalence of chronic diseases with R0, where one might expect the opposite, as it is generally known that people with chronic diseases are seriously affected by COVID-19 (Zheng et al., 2020). We can now claim that this result is due to a generically lower incidence of chronic diseases in more developed countries (i.e., due to their significant negative correlation with HDI) rather than a direct effect on R0.
A significant positive correlation with the first demographic PC is also exhibited by the net economic immigration (the difference between immigrants and emigrants), population median age, infant mortality, and the average blood cholesterol level. However, their correlation with HDI is very high (in distinction to the three factors mentioned before), i.e., visibly higher compared to the correlation of BUAPC with HDI. So, even though they do not appear in demographic PCs that significantly contribute to R0 other than PC1, we cannot make any reliable conclusion about their direct effect on R0 based on our analysis. Therefore, it is relevant to discuss evidence from other sources, i.e., possible mechanisms for distinguishing their direct influence on R0. Regarding infant mortality, a mechanism of its direct contribution to R0 is hard to imagine, so its involvement in PC1, and high negative correlation with R0, is almost certainly an indirect consequence of this variable being a proxy of HDI (Ruiz et al., 2015). On the other hand, the median age and the blood cholesterol level are real contenders for direct R0 modifiers, as mechanisms for their contribution to COVID-19 transmissibility have been proposed. Aging is generally associated with the weakening of the immune response to infectious diseases making the elderly more susceptible to the viruses like the SARS-CoV-2 (Pawelec and Larbi, 2008). Additionally, many of them due to some chronic diseases take ACE inhibitors and angiotensinreceptor blockers which cause an increased expression of ACE2 serving as a receptor for the SARS-CoV-2 virus entry (Shahid et al., 2020). Their residing in care homes, which is particularly common in high-income countries, also well suits the spreading of the infection (Kapitsinis, 2020). Similarly, high cholesterol levels can increase susceptibility to the infection by SARS-CoV-2 through systemic adverse effects on the immune and inflammatory responses and through direct implication in the virus life cycle (especially at the level of its endocytosis). To that end, statins, blocking cholesterol synthesis, were proposed for usage in COVID-19 treatment, which is supported by studies showing that previous statin usage is associated with a milder pneumonia outcome in the case of several other viral infections (Frost et al., 2007;Schmidt et al., 2020).
Other demographic PCs (4,5,7,9) that are selected to have a significant influence on R0 are by construction independent (decorrelated) from PC1. Variables associated with these PCs can be interpreted as effects on R0 independent from those related to PC1. These variables then importantly identify corrections to the main effect of HDI/GDPpc. Specifically, these are indoor area available to an individual and the net immigration (demographic PC7), the delay in the epidemic onset, which is associated with more awareness of the virus threat (demographic PC9), the prevalence of unhealthy lifestyle and environment (demographic PC4), and the weather seasonality (meteorological PC1). The slower spread of the virus with a larger built-up area per capita, as an independent and significant R0 predictor, is an interesting and new result, though intuitively plausible. It can be understood as having a less crowded indoor space (where the virus transmission dominantly happens) so that people are less exposed to each other and the virus. For example, on the Diamond Princess cruise ship, both the population density and R0 were estimated as four times greater than those in Wuhan (Rocklöv and Sjödin, 2020). On the other hand, a correlation of the virus transmissibility with the large territory population density is weakly established in the literature, whereby it seems that one should rather seek a correlation with a local population density, directly determining the number of contacts that an individual can make (Garland et al., 2020;Diao et al., 2021).
A positive contribution to the transmissibility is also made by the principal component strongly correlated with the onset variable, representing the number of days from February 15th to the epidemic's start in a particular country. The importance of the delay in the epidemic onset may be due to the psychological effect of hearing the news about the spread of COVID-19 in other countries (Khajanchi et al., 2020). Namely, the longer the epidemic was growing outside of a particular country, the larger impact this had on its people to change their usual behavior to prevent the infection, which could slow down the virus transmission even before the introduction of the official intervention measures .
Another distinguished principal component appears to encompass multiple indicators of an unhealthy lifestyle and environmentspecifically, the prevalence of obesity, physical inactivity, and smoking, together with the level of air pollution. We obtained that all these factors promote virus transmission. It is well established that they can impair immune function and adversely affect different organ systems. Furthermore, their association with mechanisms specifically facilitating the infection by the SARS-CoV-2 virus has been proposed (Domingo and Rovira, 2020;Heidari-Beni and Kelishadi, 2020;Haddad et al., 2021). Notably, the association of air pollution with COVID-19 transmission has been shown in several studies (Bashir et al., 2020;Coccia, 2020b;Milicevic et al., 2021). Two of the remaining relevant PCs are strongly determined by temperature (and/or three other highly related weather factors) and the prevalence of BCG vaccinated children, respectively. Although not selected by all the methods, the weather component seems important as it was chosen by the Elastic net algorithm (in addition to Lasso), which is specifically designed to deal with (highly) correlated variables, and yet it did not exclude this PC despite its correlation with the first demographic PC. Moreover, a decrease of the transmissibility with the temperature increase appears as a robust result in COVID-19 literature (Haque and Rahman, 2020;Rosario et al., 2020;Sarkodie and Owusu, 2020), although conflicting conclusions are also present (Xie and Zhu, 2020;Islam et al., 2021;Srivastava, 2021). Higher temperatures may shorten the period of virus viability in aerosols, enhance the immune system functioning, and/or impact the time that people spend together in poorly ventilated indoor spaces (Notari, 2021). Since temperature is highly positively correlated with the intensity of UV radiation, humidity, and the level of precipitation, we cannot exclude the possibility that some of these other factors are in a significant causal relationship with virus transmissibility. Importantly, some experimental findings support the inactivating effects of high temperature, humidity, and UV radiation on SARS-CoV-2 and related viruses (Casanova et al., 2010;Chan et al., 2011;Heilingloh et al., 2020;Sagripanti and Lytle, 2020;van Doremalen et al., 2020). Anyhow, our results suggest the dependence of virus transmissibility on seasonal weather variations related to our meteorological PC1 component. On the other hand, while some authors suggest the importance of wind speed in virus dissemination (Coccia, 2020b;Coccia, 2020a;Sarkodie and Owusu, 2020;Islam et al., 2021), our study could not confirm this connectionas reflected by the absence of the wind-related principal component meteo PC2 in our results. This might be a consequence of a strong interplay between wind speed and pollution effects (Coccia, 2021b) that our study is not suited to detect. The last demographic principal component occurred as important only in Lasso regression. It, however, closely follows the extent of BCG vaccination, which is known to provide some protection against various respiratory tract infections through the induction of the trained immunity (O'Neill and Netea, 2020), so BCG immunization may influence the SARS-CoV-2 spread, although, according to our results, to a lesser extent than the other discussed factors.
Our study is also an example of how assessing the effect of one factor while controlling for the presence of other relevant variables can change the obtained conclusions. We will illustrate this with four examples, where we obtained qualitatively different conclusions, compared to single-variable correlation analysis : built-up area per capita (BUAPC), net migration, air pollution, and raised blood pressure. BUAPC showed an absence of a significant correlation with R0 , which is a consequence of canceling the two effects. The first is its direct effect on R0 (exhibited through demographic PC7), which tends to slow the spread of COVID-19 in a population. The second is collinearity with PC1, which reflects a generic correlation of BUAPC with GDPpc, caused by more construction (higher built-up area) per capita with the increase in GDPpc. Our combination of PC and regression analysis revealed this non-trivial conclusion, which cannot (even qualitatively) be obtained from the pairwise correlation analysis.
Similar reasoning, though perhaps harder to understand intuitively, applies to net migration. Net migration is also significantly positively correlated with HDI (and consequently also with PC1), reflecting a generic tendency of immigrants to flow to countries with higher GDPpc. The direct effect of net immigration (exhibited through PC7) is harder to intuitively understand, as I-E negatively contributes to R0, so that faster spread (at least in the initial phase of the epidemic) appears to be associated with a higher number of emigrants. As these are economic migrations (to be distinguished from the movement of refugees), possibly the part of the emigrants returned to their countries with the pandemic's start. The significant effect of net immigration on R0 (inferred through our analysis) is again highly non-trivial and in the opposite direction from the positive pairwise correlation of R0 with I-E. Refuges (i.e., percentage of refugee population by country) exhibit high correlations only with PC3 and PC8, where neither significantly contributes to R0. There is also no significant pairwise correlation of refugees with R0, which robustly shows that this variable does not affect transmissibility.
Pollution negatively contributes to demographic PC1 (with the corresponding negative correlation with HDI) and positively to demographic PC4. The pairwise correlation between the pollution and R0 is negative (-0.31), which is counterintuitive, as it is generally expected that higher pollution should increase COVID-19 transmissibility. This is, however, an artifact of negative correlation between pollution and HDI, while its genuine (direct) effect on R0 is reflected through PC4. Our analysis, therefore, revealed the direct effect of long-term air pollution on transmissibility, which is consistent with previously published observations that it can damage the respiratory system and reduce resistance to infections (Domingo and Rovira, 2020;Fattorini and Regoli, 2020) but opposite to naive pairwise correlation analysis.
Finally, raised blood pressure also shows a statistically significant (counterintuitively negative) correlation with R0. However, in addition to PC1, raised blood pressure shows a notable correlation only with PC2, which does not significantly affect R0. This indicates that the negative correlation of this variable with R0 is a consequence of its generically negative correlation with HDI, instead of a direct effect on COVID-19 transmissibility.

Conclusion and Outlook
Numerous studies tried to assess the correlations of different factors with the SARS-CoV-2 virus transmissibility Notari and Torrieri, 2020;Salom et al., 2021), but the next step should be predicting the environmental risk of the high spreadability in a certain population (Allel et al., 2020;Coccia, 2020d;Gupta and Gharehgozli, 2020). Specifically, a relatively small number of the most influential meteorological and demographic factors should be selected for a predictive risk measure that is accurate enough and practical for use. Such risk assessment is very useful in guiding the future strategies of imposing epidemic mitigation measures (Coccia, 2021a).
We here demonstrated that taking into account joint effects of different factors can point to qualitatively different conclusions about their influence on the virus transmissibility than considering them individually (as in ). Utilizing a combination of PCA and feature selection techniques, we were able to disentangle with high confidence which variables independently (and significantly) influence the rate of the infection spread, and which have an only indirect influence or no influence at all (here found for alcohol consumption, chronic diseases, percentage of the urban population, raised blood pressure and refugees).
While PCA brings clear advantages to regression analysis, such as working with a smaller number of variables and abolishing collinearity, the main disadvantage is harder interpretation in terms of original variables. We were, however, able to interpret PCs that significantly affect R0, so that the main driving factors behind COVID-19 transmissibility are i) the country's wealth/development level, corrected by the available indoor space per person and net immigration), ii) pollution levels and some of the unhealthy living factors, iii) spontaneous behavior change due to developing epidemics, iv) weather seasonality and v) possibly (marginally) BCG vaccination. These conclusions, and the direction of the corresponding effects, crucially depend on the more complex analysis performed here.
Certain limitations of this study should also be addressed. When the alignment between certain variables is too high, even the analysis performed here cannot differentiate between the factors genuinely affecting R0 and mere accidental correlations. In such cases, further, specifically designed (such as targeted epidemiological) studies are needed. For example, based on this analysis alone and due to the very high correlation between the cholesterol levels and HDI/GDP, it cannot be excluded that cholesterol is a contributing factor to the observed significance of the PC1 component, in addition to the country's prosperity that likely mimics the contact rate in population (as a crucial disease transmission property). For this reason, our research suggests that a separate study of cholesterol levels in the COVID-19 context (e.g., by measuring cholesterol blood levels along with PCR tests) could be, potentially, of high value since a hypothetical unexpected discovery of inherent cholesterol importance could lead to novel treatments of SARS-CoV-2 infection. Similarly, studies that disentangle the effect of the overall country's prosperity from the intrinsic effects of median age on R0 would be also quite welcome. Furthermore, some of the variables that were not included in our study, but were found as significant by other studies (Notari and Torrieri, 2020), and are closely related to HDI, such as lung cancer prevalence (Youlden et al., 2008), or frequency of tourist arrivals (Anggraeni, 2017), may also contribute to the strong association of HDI with R0, which we observe. Another problem might occur due to possible untrivial interactions between different factors: for example, the effects of pollution on the airborne COVID-19 transmission can be significantly dependent on the wind speed (Coccia, 2020b;2021b). Unfortunately, our study is not well-suited to detect such effects (while, in principle, these effects could be included through higher-order terms in regression, this might lead to overfitting). A non-trivial future task may be a systematic attempt to unify (and standardize, to the extent possible) different approaches and variables used in the studies of the environmental factor effects on COVID-19 transmissibility.
Some suggestions that could improve COVID-19 mitigation policies may already follow from this study. Deliberately reducing (in the long run) the overall population prosperity, which we identified as the most important predictor of R0, in an attempt to indirectly reduce the frequency of social contacts is impractical. However, the widely implemented lockdowns, proven to inhibit the infection spread, are the measures that reduce the frequency of interpersonal contacts, but also economic activity, so the epidemic behavior seems consistent with the predicted HDI dependence. Our conclusions about the importance of HDI, as a predictor of R0, could be further tested by studies of epidemiological relevance of higher resolution HDI-analogs, such as Subnational HDI (SHDI) or City Development Index (CDI). If HDI and GDP parameters are confirmed to dominantly influence R0 values simply since they highly and naturally correlate with the frequency of social contacts (as we anticipate to be the case), identifying this as one of the major factors is not without implications. That is, recognizing the importance of this parameter can help make better predictions of the disease dynamic and locate in advance high-risk spots/areas.
Moreover, since we identified the built-up area per person as another relevant factor, future urban policies should put more emphasis on avoiding very high population concentrations (Coccia, 2020c). The relevance of the onset variable may indicate the importance of timely informing the public about all the risks and perils of the incoming epidemic. More obviously, pointing to unhealthy living conditions and lifestyle as a significant driver of COVID-19 pandemic, underscores the importance of measures to reduce obesity, smoking prevalence, physical inactivity, and air pollution (Coccia, 2021a), also by subsidizing energy from renewable resources (Coccia, 2020c). Overall, the results presented here emphasize the importance of understanding environmental factors that impact transmissibility for both COVID-19 and other (including potential future) infectious diseases.