Principal component analysis of polar cap convection

Authors


Abstract

[1] We apply a statistical technique called Principal Component Analysis (PCA) for examining underlying patterns of polar cap convection and illustrate potential applications of the PCA-based dimension reduction. Two principal components are identified: the first mode (PC1) is related to “uniform variation” of the flow speed at all MLTs, and is primarily governed by IMF Bz. The second mode (PC2) is related to “dawn-dusk asymmetry”, and is predominantly driven by IMF By. PCA gives the relative variance contribution of the two modes: PC1 giving ∼42% of the total variance and PC2 ∼17% of the total variance, which is about 40% of that from PC1. Due to the orthogonality of the principal components, the degree of dawn-dusk asymmetry can be represented byP2, where P2 is a component value when the observed data are projected along PC2. We identified P2 as proportional to IMF By, which leads to stronger dawn flows for By > 0 and stronger dusk flows for By < 0. The same primary modes are found regardless of the IMF orientation, implying that they are intrinsic properties of the average polar cap convection.

1. Introduction

[2] It is well known that the strength and pattern of high latitude ionospheric convection is primarily controlled by the interplanetary magnetic field (IMF) [e.g., Maezawa, 1976]. Convection is enhanced for more negative IMF Bz, while IMF By drives a dawn-dusk asymmetry in the global convection pattern [e.g.,Heppner and Maynard, 1987]. Also, solar wind dynamic pressure and ULF power in the IMF can drive substantial enhancements of the convection strength and the polar cap potential drop [e.g., Boudouridis et al., 2005; Kim et al., 2009, 2011; Lyons et al., 2009].

[3] As the entire polar cap convection is governed by input solar wind condition and internal dynamics of the ionosphere-magnetosphere system, flow speeds at all MLTs may vary collectively with some structure. Thus, one way to understand the evolution of polar cap convection is to presume that convection data at a given time can be represented by a few primary modes that retain main features of the data. Principal Component Analysis (PCA) is a useful statistical tool for evaluating this, since it identifies underlying patterns, which are hard to find in data of high dimension, and allows original variables to be replaced with a smaller number of independent variables associated with the primary modes, resulting in dimension reduction.

[4] A similar technique called Method of Natural Orthogonal Components (MNOC) has been successfully applied to analysis of geophysical data [e.g., Sun et al., 2008; Baker et al., 2003, and references therein]. Most of the previous studies with MNOC focused on identification of a set of orthogonal components and their physical interpretation. On the other hand, PCA frequently is used for dimensional reduction. Thus, in this study, we examine underlying modes of the polar cap convection and illustrate potential applications of the dimension reduction with the leading principal convection components.

2. Description of Principal Component Analysis and Data Preparation

[5] PCA finds an orthogonal transformation of the coordinate system where original data can be described by a new set of independent variables called principal components. The transformation usually involves rescaling and centering (zero-mean) of the observed variables, and it can be expressed as (seeChatfield and Collins [1989] for detailed mathematical deduction of the method):

display math

where X is the original data, C the mean value of X, S the scale factor, R the transform matrix, and P the coordinate values in the new coordinate system. R is composed of the normalized eigenvectors, which are also referred to as principal components, of the covariance matrix of the original data. Eigenvalues are equal to the variance explained by each principal component. In many cases, the sum of the variance of a few leading PCA modes are large enough to describe a substantial portion of the total variance. In that case, we can drop the PCA modes with small variance achieving dimensional reduction, improving tractability for further analysis without losing much information. This study conducts PCA by using the free statistics package R whose official website URL is http://www.r-project.org.

[6] We collected convection flow speeds at magnetic latitudes >75° observed from the SuperDARN radars [Greenwald et al., 1995] and the 1-min ACE data time shifted to just upstream of the magnetopause nose for years 2001 to 2005. For each measurement time, two-dimensional velocity vectors are obtained by fitting an expansion of the global pattern of electrostatic potential in terms of spherical harmonics to an ensemble of measured line-of-sight velocity vectors and vectors from a statistical model that is keyed to the IMF [Ruohoniemi and Baker, 1998]. In order to use only the most accurate velocities, we have selected those fitted vectors that are at positions having an actual radar measurement. The flow speeds were binned into 2-hr MLT intervals (i.e., 0–2, 2–4 MLT,…) and then averaged over each bin, and furthermore, as high temporal variations are not of interest here, the MLT-bin averages were running averaged with a 22-min window that was shifted every 6-min. Thus, in our analysis,X in equation (1)corresponds to a 12-component vector of flow speeds. The PCA process employed in this study presumes a complete data set, that is, a finite flow speed value is assigned for all MLT bins at a given time. Such complete observations were available for ∼10% of the examined 5-year period, and this complete set is used for the PCA analysis. Note that there is typically more complete coverage of convection during moderately active times, so our results may be more representative of patterns for active conditions.

3. Principal Modes of the Polar Cap Convection and Their Applications

[7] Pair-wise correlation coefficients between the 12 MLT-bin averages show strong correlation between nearby MLTs, indicating a collective behavior. The correlation coefficient gradually decreases with increasing MLT difference, leading to the minimum correlation between the dawn (4–8 MLT) and dusk (14–18 MLT) region, which implies that the flow speeds in the two regions may vary independently and, at times, can vary in an opposite direction.

[8] Underlying patterns of this data set can be identified from PCA, and the first three principal components are presented in Figure 1awhich shows eigenvector coefficients of each mode in the original coordinates (i.e., MLT bins). The eigenvector coefficient for each MLT bin indicates how important that MLT-bin measurement is for the particular mode. Note that the signs of the coefficients are arbitrary. The first mode (PC1) corresponds to a uniform increase or decrease of convection speed at all MLT bins as indicated by the same sign and relatively similar magnitude of the coefficients. The second mode (PC2) whose coefficients have opposite signs and extreme values at dawn and dusk represents an intrinsic dawn-dusk asymmetry. The third mode (PC3) is similar to PC2, but the extremes appear at noon and midnight, indicating day-night asymmetry. Eigenvectors of higher modes (not shown) have more complicated dependence on the original coordinates. Note that all the modes are completely independent of each other.

Figure 1.

(a) Eigenvector coefficients of the first three principal components. (b) Eigenvalue (or variance) of each mode.

[9] To illustrate the relative contribution of each mode to the total variation, in Figure 1b, the bar length indicates the eigenvalue (or variance) of each mode. The sum of the eigenvalues is equal to 12. PC1 explains ∼42%, PC2 ∼ 17%, and PC3 ∼ 9% of the total variance, followed by higher modes with small variances. There are a number of ways to determine the number of significant principal components. Based on the Kaiser rule [Kaiser, 1960], which recommends retaining only eigenvalues at least equal to one, we have that the first three components are significantly above noise level. However, considering its marginal variance, we will exclude PC3 for the rest of our analysis.

[10] To identify the key solar wind drivers for PC1 and PC2, we compute correlation coefficients between the component values of the two modes (i.e., P in equation (1) that represents projection of the original data along the PC modes) and the IMF and solar wind parameters. As shown in Table 1, PC1 is primarily governed by −Bz, and to a lesser extent by the magnitude of By and solar wind speed, while PC2 is predominantly correlated with IMF By, consistent with previous results [e.g., Heppner and Maynard, 1987]. Note that correlation with IMF |By| is examined for PC1, while By itself for PC2. As PC2 is a purely independent mode, the degree of dawn-dusk asymmetry can be represented by |P2|, where P2 is a component value when the observed data are projected along PC2, and we identified that it is roughly proportional to IMF By through the relationship P2  −0.16*By. Note the negative slope. When it is combined with the eigenvector coefficients of PC2 (note the signs in Figure 1a) in equation (1), we obtain stronger flows in the dawn sector for By > 0 and stronger flows in the dusk sector for By < 0, and the flow speed difference between the two sectors increases roughly linearly with the IMF By magnitude. Note that the IMF By contributes to both PC1 and to PC2, which reduces the variance contribution from PC2 relative to that for PC1.

Table 1. Correlation Coefficients
 −BzBy or |By|VswNsw
PC1 (uniform)0.420.270.15−0.07
PC2 (dawn-dusk)0.060.480.01−0.02

[11] To help visualization of the PC modes, Figure 2shows average flow speed at each MLT for various IMF By and Bz conditions. In the figure, one can intuitively understand PC1 as global modulation of average level, i.e., flow speed at all MLTs vary in the same direction (increase or decrease) in response to IMF changes. Similarly, one can note PC2 as dawn-dusk asymmetry, i.e., asymmetry of the flow speed in the dawn and dusk sectors, with the relative convection strength between the two sectors predominantly determined by the sign of IMF By.Figure 2clearly illustrates stronger dawn flows for By > 0 and stronger dusk flows for By < 0, and that dawn-dusk asymmetry is more severe for By > 0.

Figure 2.

Average flow speed at each MLT for various IMF By and Bz conditions: (a) for By > 0, (b) for By < 0.

[12] Next, we evaluate the goodness of the flow speed approximation with the two PCA modes. We first prepare the transformed coordinate values, P in equation (1) from the observed variables at any given time, and then retain only P values for PC1 and PC2. The corresponding approximate value X′ is computed in equation (1), and the mean percent-error for this approximation is calculated as the ratio of average error, |X-X′|, to average of X.The mean percent-error for employing only PC1 is ∼20% at noon/midnight and ∼25% at dawn/dusk. Incorporating PC2 does not reduce the error on noon/midnight, but the error at dawn/dusk is substantially reduced to ∼17%. Thus, for the value itself, by employing the two modes, we can approximate the flow speed with up to 83% mean accuracy. In the following, we present two demonstrative examples of possible applications of the two-PCA-mode based approach for polar cap convection analysis.

3.1. Flow Speed Interpolation Using Two PCA Modes

[13] As the PCA modes describe correlation information between the flows at different MLTs, it may be possible to make a reasonable guess of missing values using the component values, P, in equation (1) obtained from incomplete information. If we know the flow speed at k(>2) MLT bins, we can estimate the rest (12-k) of the flow speeds using the two-mode approximation,P1 and P2, for which we can write from equation (1)

display math

This is an over-determined linear system, which we solve by least squares. OnceP1 and P2 are calculated, the missing X′s are calculated using equation (2).

[14] One example of such reconstruction using our 5-year complete set is shown inFigure 3a, where we assume that data at a certain time are available only for the three MLT bins (here 0–2, 8–10, and 16–18) marked by black squares in the plot. Although the piecewise linear interpolation (green), cubic spline interpolation (blue), and the PCA two-mode reconstruction (red) showed roughly the same percentage error, the reconstructed series retains much of the structure of the original data (black), such as minimum at ∼6–8 MLT and maximum at ∼15 MLT.

Figure 3.

(a) An example of missing value approximation using the two PCA modes; observed data (black), linear interpolation (green), cubic spline interpolation (blue), and PCA reconstruction (red). (b) (top) Time series of the reconstructed flow speed for 4–6 MLT (red) and observed data (black). (bottom) Number of missing values at each time.

[15] Next, we applied the reconstruction technique to an actual flow speed time series. Black solid line in Figure 3b (top) represents the observed flow speed for 4–6 MLT during Jan. 14 to Jan. 19, 2003. The discontinued segments indicate missing observations that cover about 12% of the total time interval plotted. At each time step, the number of MLT bins that have missing values ranges from 0 to 8 with the mean of 2.3, as shown in Figure 3b (bottom). The red line in Figure 3b(top) presents the reconstructed data using the two mode approximation with no data gaps now, showing overall reasonable agreement with the observed data. Unlike conventional interpolations (such as piece-wise linear or cubic spline), which fill in data utilizing local information, the two mode approximation yields physically meaningful filling of missing values, because the PCA modes represent underlying physical modes reflected by statistical coherence of the observed data. On the other hand, the PCA-based modeling does not reproduce well spatially or temporally localized abrupt changes such as the ones indicated by gray arrows. Note that, while PCA facilitates description and modeling of the common modes with better statistical quality than studying a single observed variable separately, localized abrupt changes are not well reflected in the PCA modes. Thus, a hybrid approach may be promising, i.e., first, describe collective motion with the two PCA modes, and then take a direct approach to analyze their local characteristics if the difference between observed flow speed and PCA two-mode predicted flow value is large.

3.2. Simple Linear Regression With Solar Wind

[16] The PCA technique may give another option for statistical modeling of flow speed response to solar wind input. Instead of building a model between solar wind input and flow speed directly, we may first build a model between solar wind input and PCA components, and then compute flow speed using equation (1). For the stage 1, we employ a simple linear model of response of the two PCA modes:

display math

where jis 1 to 2. The two-stage approach can be beneficial for dimensional reduction and the enhanced input sensitivity, i.e., stronger solar wind correlation is observed for the two PCA modes than for all the individual MLT-bin average flow speeds.

[17] In order to compare the goodness of fit, Table 2presents R-squared (Rsq) for the fits from the direct modeling and from the two-stage modeling. The same linear modeling is conducted separately for data sets of three IMF cases, i.e., North (+) Bz, South (−) Bz, and all Bz. Note that for the data sets +Bz and −Bz, we find the same PC1 and PC2 modes, although their precise coefficients are different from those shown inFigure 1a for the all Bz cases, but their eigenvalues are not significantly different, as shown in Figure 4a. This indicates that the collective evolution of the MLT-bin average flow speeds retains the same underlying pattern regardless of the IMF Bz orientation, while the average flow vector pattern can be different with the IMF Bz sign. The first row is collected from the direct approach. We fit the linear model for the flow speed at each MLT bin, and take the best Rsq value. The second row shows Rsq for PC1 and PC2. The Rsq is quite small for all the fittings, which is possibly due to the employment of a simple linear model. However, what is worthy to note here is that for all three IMF cases, the PCA fitting gives almost equivalent, or better, result than the direct fitting.

Table 2. Goodness of Fit
 Rsq (+Bz)Rsq (−Bz)Rsq (all Bz)
Best Fit for direct model0.160.170.22
PC1, PC20.19, 0.230.21, 0.210.32, 0.23
Figure 4.

(a) Eigenvector coefficients of the first three modes and variance of each mode for (left) North Bz and (right) South Bz. (b) Rsqfor the direct fitting for the All Bz case (black) with x-axis label denoting a bin number, and Rsqfor the PCA mode fitting (red) with x-axis label denoting a mode number.

[18] The origin of the fitting suitability for PC1 and PC2 can be understood from Figure 4b where the black line denotes Rsqfor the direct fitting for all Bz cases, for each MLT bin, with a bin number as x-axis label, and the red denotes Rsqfor the PCA mode fitting, with x-axis label denoting a mode number. Note that the direct approach gives more-or-less uniform Rsq across the MLT bins. On the other hand, the PCA Rsq is higher for the first two modes, and sharply decreases to a negligible value for other modes. This again indicates that the flow speeds have collective motions which act as a common driver with a similar statistical behavior. Therefore, using the first two modes, we could obtain the PCA mode fitting with greater statistical confidence that comes at the cost of marginal statistical significance of the higher modes.

4. Summary and Conclusion

[19] The purpose of this paper is to describe PCA analysis of polar cap convection and illustrate its potential applications. Two primary modes of polar cap convection have been identified, which correspond to uniform modulation of the global convection and dawn-dusk asymmetry, respectively. The first mode explains ∼42% of the total variance, and the second mode explains about 40% of the variance accounted for by the first mode. PCA also identified the linear relationship between the degree of dawn-dusk asymmetry and IMF By, which leads to stronger dawn flows for By > 0 and stronger dusk flows for By < 0, with flow speed increasing roughly linearly with the By magnitude. Demonstrative examples suggest that the PCA-mode based approach with dimension reduction may be a useful tool for analysis of the polar cap convection, such as estimating missing values or modeling average response of flow speed to solar wind input. We also found that the same modes exist regardless of the IMF Bz orientation, implying that they are intrinsic properties of the average polar cap convection.

[20] Similar analysis could be applied to a data base that is also binned with magnetic latitude and covers a larger region of the ionosphere to identify primary underlying latitudinal-longitudinal structures (if any) of the ionospheric convection. Achieving dimensional reduction may allow a PCA-mode based modeling of the global ionospheric convection. For more reliable analysis, effect of measurement error on PCA will need to be properly treated, such as using the method called Maximum Likelihood Principal Component Analysis [e.g.,Wentzell and Lohnes, 1999].

Acknowledgments

[21] This work was supported in part at UCLA by NASA grant NNX09AJ72G and NSF grant AGS1042255. The OMNI solar wind data were obtained from the GSFC/SPDF OMNIWeb interface at http://omniweb.gsfc.nasa.gov.

[22] The Editor thanks two anonymous reviewers for assisting in the evaluation of this paper.