Engaging uncertainty in hydrologic data sets using principal component analysis: BaNPCA algorithm



[1] Principal component analysis (PCA) is the most widely used method for dimensionality reduction, data reconstruction, feature extraction, and data visualization in geosciences. However, in its standard form, PCA makes no distinction between data points for which the associated measurement errors vary in both space and time. Using the backdrop of sea surface temperature (SST) data, a Bayesian variant of noisy principal component analysis (BaNPCA) was developed to incorporate observation uncertainty when performing PCA. The algorithm was first assessed using synthetic data sets. Comparison of BaNPCA results with current PCA techniques showed that BaNPCA has lower data reconstruction error; that is, for a given number of principal components, it explains more variance in SST data. Using the automatic relevance determination method, BaNPCA could correctly identify the appropriate number of principal components in the data. BaNPCA was shown to exhibit distinct advantages in filling missing values in the data when compared to existing methods. In addition, the extracted principal vectors from BaNPCA were found to be smoother and more representative of large-scale signals like El Niño–Southern Oscillation and Pacific Decadal Oscillation. To classify extreme states of all India summer monsoon rainfall, we used robust optimization that utilizes the PCs along with computed uncertainty from BaNPCA algorithm as inputs, thus engaging uncertainty in data. Results from this study demonstrate the value of utilizing uncertainty information available with hydrologic data sets.