Improved Forecast Skill Through the Assimilation of Dropsonde Observations From the Atmospheric River Reconnaissance Program

Landfalling atmospheric rivers (ARs) over the western US are responsible for ∼30%–50% of the annual precipitation, and their accurate forecasts are essential for aiding water management decisions and reducing flood risks. Sparse coverage of conventional observations over the Pacific Ocean, which can cause inadequate upstream initial conditions for numerical weather prediction models, may limit the improvement of forecast skill for these events. A targeted field program called AR Reconnaissance (Recon) was initiated in 2016 to better understand and reduce forecast errors of landfalling ARs at 1–5 days lead times. During the winter seasons of 2016, 2018, and 2019, 15 Intensive Observation Periods (IOPs) sampled the upstream conditions for landfalling ARs. This study evaluates the impact on forecast accuracy of assimilating these dropsonde data. Data denial experiments with (WithDROP) and without (NoDROP) dropsonde data were conducted using the Weather Research and Forecasting model with the Gridpoint Statistical Interpolation four‐dimensional ensemble variational system. Comparisons between the 15 paired NoDROP and WithDROP experiments demonstrate that AR Recon dropsondes reduced the root‐mean‐square error in integrated vapor transport (IVT) and inland precipitation for more than 70% of the IOPs, averaged over all forecast lead times from 1 to 6 days. Dropsondes have improved the spatial pattern of forecasts of IVT and precipitation in all 15 IOPs. Significant improvements in skill are found beyond the short range (1–2 days). IOP sequences (i.e., back‐to‐back IOPs every other day) show the most improvement of inland precipitation forecast skill.

Despite the growing awareness of the key roles of landfalling ARs in extreme weather and water events, operational NWP models still have a significant margin for improvement when forecasting the landfalling metrics (Martin et al., 2019;Wick et al., 2013). To improve the sampling of ARs and the forecast skill of landfalling events over the western US, the Atmospheric River Reconnaissance field program (AR Recon, Ralph et al., 2020) was undertaken over several winter seasons to collect more observational data within and near ARs over the Northeastern Pacific. AR Recon collected dropsonde and ancillary data in dynamically active regions where moisture convergence, precipitation, and latent heating influence frontogenesis (Lackmann, 2002) and cyclogenesis (Davis, 1992;Zhang & Ralph, 2021) which eventually modify downstream precipitation. Since 2018, observation targeting based on sensitivity analysis methods has been implemented using adjoint techniques (Demirdjian et al., 2020;Doyle et al., 2014Doyle et al., , 2019Errico, 1997;Reynolds et al., 2019) and ensemble-based methods (Ancell & Hakim, 2007;Chang et al., 2013;Torn & Hakim, 2008;Zhang et al., 2007;Zheng et al., 2013). Ensemble-based sensitivity analyses emphasize synoptic-scale differences between ensemble members that are associated with significant weather features at the initial time or an earlier forecast time. This technique has been used successfully in past field campaigns, particularly for hurricanes and severe weather events (Majumdar, 2016;Romine et al., 2016). The moist adjoint sensitivity calculations available from the Naval Research Laboratory (NRL) Coupled Ocean-Atmosphere Mesoscale Prediction System (COAMPS) are demonstrated to be robust in identifying valid areas for targeted observations of ARs (Doyle et al., 2014;Reynolds et al., 2019).
Most of the previous targeted missions showed overall positive impact on forecast skill for a variety of high impact weather events (Aberson, 2010;Feng & Wang, 2019;Joly et al., 1999;Langland, 2005;Langland et al., 1999;Pu et al., 2008;Romine et al., 2016;Schindler et al., 2020;Szunyogh et al., 2000;Weissmann et al., 2011) specifically in short-range (<3 days) forecasts. For example, dropsonde observations collected from the North Atlantic Waveguide and Downstream Impact Experiment (NAWDEX) campaign reduced mean forecast error in the 500-hPa geopotential height by 1%-3% over Europe and the North Atlantic domain with the European Centre for Medium-Range Weather Forecast (ECMWF) global model (Schindler et al., 2020). Other missions showed neutral impact on model forecast skill Tong et al., 2018). Lavers et al. (2018) used the independent AR Recon dropsonde data to show that the water vapor flux forecast errors in ARs using the ECMWF Integrated Forecasting System (IFS) are most strongly associated with the initial conditions in 850 hPa wind, which is typically near the top of the planetary boundary layer, and the amplitude of the model errors is about 20% of the mean observed water vapor flux. Stone et al. (2020) found that AR Recon soundings collected during the winter of 2018 have a significant positive impact in the Navy Global Environmental Model (NAVGEM) using a method called the Forecast Sensitivity Observation Impact (e.g., Lorenc & Marriott, 2014), and found that the per observation impact of AR Recon soundings in NAVGEM is more than double that of the North American radiosonde network. A recent study by Zheng, Delle Monache, Wu, et al. (2021) found that AR Recon can fill observation gaps from near the surface to the middle troposphere by providing high-quality in-situ observations within and near ARs, where all-sky radiances either have degraded quality or have been rejected by the data assimilation system due to heavy precipitation. This enhances observations over regions with the highest adjoint 10.1029/2021JD034967 3 of 25 sensitivity for downstream 1-2 days precipitation forecasts . These works argue that high-vertical resolution profiles from AR Recon dropsondes can fill critical data gaps and bring new information to operational model analyses, and have the potential to improve forecasts of landfalling ARs.
The goal of this work is to investigate the impact of assimilation of these AR Recon data on the model forecast skill for landfalling ARs and their associated inland precipitation in a regional implementation of the Weather Research and Forecasting (WRF, Skamarock et al., 2008) model called West-WRF that is tailored to predictions over the western US (Martin et al., 2018). A set of data denial experiments based on three winter seasons of observations for 15 IOPs (Table S1 in Supporting Information S1) from AR Recon (2016Recon ( , 2018Recon ( , and 2019 were conducted to examine the impacts of this data set (Zheng, Delle Monache, Wu, et al., 2021). Specifically, this study explores how AR Recon data ingested by the model modify the initial conditions in the Northeastern Pacific Ocean and affect forecasts of ARs and precipitation over the western US. Below are the three motivational questions: • How do AR Recon data modify initial conditions in the presence of ARs in the regional West-WRF model? • What is the impact of AR Recon data on the model forecast skill for ARs? • Have the AR Recon data improved the overall precipitation forecast skill over the US West?
Section 2 describes the model configuration and experiment setup. Section 3 investigates the data impact in a case study in February 2018. Section 4 quantifies the overall data impact on the forecasts of ARs for the complete data set using several forecast and precipitation metrics. The discussions in Section 5 examine the variation of impact across cases and offer interpretations of the results. Section 6 summarizes the main conclusions from the study and discusses future work.

Model Configuration
The model we used for this study is a research version of the Center for Western Weather and Water Extremes (CW3E) operational West-WRF model, which is tuned to optimize predictions for the western US region (Martin et al., 2018). West-WRF is based on the Advanced Research WRF (WRF ARW) model, version 3.9.1.1. West-WRF adds value to forecast fields compared to the global forcing model, especially for the orographic precipitation and horizontal transport of moisture that are frequently associated with landfalling ARs (see Martin et al., 2018 for more details). Two domains were configured using horizontal resolutions of 9 and 3 km (denoted D01 and D02 in Figure 1a). Both domains are configured with 48 layers in the vertical with a model top at 10 hPa. The main configuration is as follows: (a) new Thompson microphysics scheme (Thompson et al., 2008); (b) Yonsei University planetary boundary layer scheme (Hong et al., 2006); (c) Rapid Radiative Transfer Model for long-wave and short-wave radiation (Chou & Suarez, 1994;Mlawer et al., 1997;Matsui et al., 2018); (d) Noah-MP land surface model (Niu et al., 2011); and (e) Grell-Freitas scheme (Grell & Freitas, 2014) for cumulus parameterization (only for the outer 9-km domain). The National Center for Environmental Prediction (NCEP) Global Forecast System (GFS) operational analysis/forecast products at 0.25° × 0.25° lat-lon grids are used for initial and boundary conditions of the outer domain D01. The inland precipitation for the 3-km domain is validated across all AR Recon Intense Observation Periods (IOPs) over a common domain as shown in Figure 1b. The verification region for integrated water vapor (IWV) and vapor transport (IVT) extends from 105°-165°W to 15°-60°N. IVT and IWV are calculated following the definitions proposed by Zhu and Newell (1998) where g is the gravitational constant, q is the specific humidity, V h represents the horizontal wind vector, and p sfc is the surface pressure.

Data Assimilation System and the Experiment Design
For this study, we used the Gridpoint Statistical Interpolation (GSI) hybrid four-dimensional ensemble-variational (4D-EnVar) data assimilation system (Kleist & Ide, 2015;Wang & Lei, 2014). This hybrid method combines the flow-dependent background error-covariance matrix with the static error-covariance matrix, and 4D indicates that it can generate a 4D analysis trajectory to allow for the appropriate assimilation of observations that are not taken at the same time. The ensemble input for the hybrid 4D-EnVar utilizes dynamically downscaled forecasts at a 9-km horizontal resolution forced by the 20 member NCEP Global Ensemble Forecast System (GEFS) and the 20 member Canadian Ensemble Forecasts from the Meteorological Service of Canada (MSC), both of which are available at 0.5-degree horizontal grid spacing. Data assimilation has been performed for a 6-h time window centered at 0000 UTC for each AR Recon IOP (Table S1 in Supporting Information S1). The first guess data are the hourly outputs from West-WRF initialized from 1800 UTC on the previous day. Seven separate instances in time are used for the first guess and ensemble input during the 6-h assimilation window to capture the fast-moving nature of ARs. The ensemble input from the aforementioned global models were initialized from 1200 UTC on the previous day, which allows for the growth of perturbations as a reliable representation of flow-dependent forecast errors.
The observations assimilated for each IOP include conventional observations in the PrepBUFR file from the NCEP Global Data Assimilation System (GDAS), remotely sensed refractivity from GNSS Radio Occultation (RO), and atmospheric motion vector winds (AMV, Velden et al., 2005). The observations of temperature, humidity, and horizontal wind from the AR Recon flight-level data and dropsondes are included in the GDAS PrepBUFR file. Note that satellite radiances are not assimilated in this analysis. Similar observational network was used in previous studies to investigate data impact for other campaign programs such as the Mesoscale Predictability Experiment (MPEX, Romine et al., 2016).
Two sets of parallel experiments were performed: (a) WithDROP, which assimilated the PrepBUFR data, that included AR Recon flight-level and dropsonde profiles, GPS RO refractivity, and AMVs, and (b) No-DROP, which assimilated the same observations as WithDROP with the exception of AR Recon data, which are removed from the PrepBUFR data. Assimilation occurs on the 9-km domain only, which will provide the initial conditions for the 3-km domain, and subsequent forecasts are initialized on both the 9-km and 3-km grid from that assimilation procedure. Initial condition and boundary condition data for both experiments can be accessed from . Extended 144-h forecasts are produced on both the 9-km and 3-km grids beginning at the 0000 UTC initialization for each IOP. Note that cycling of data was not performed in these numerical experiments so that a clean impact from each individual AR Recon flight can be seen. In other words, any differences in these two parallel simulations are predominantly attributable to the ingestion of the AR Recon data. While cycling allows observations from prior cycles to inform the current analysis, these regional modeling experiments do not assimilate data from the entire global observing system. Therefore, boundary condition errors, particularly errors over the southern boundary, can interact with the signals (i.e., dropsonde impacts) after a few cycles and interfere with the interpretation of signals from the dropsonde and make it hard to separate signals from these model errors. Hodyss and Majumdar (2007) specifically discussed the contamination of data impacts in a global model by initially small mesoscale instabilities. With the non-cycled runs, we are trying to avoid contamination with accumulated boundary errors that might complicate the conclusions of observation impacts.

Verification Data
To verify the initial conditions and the forecasts of different experiments from the West-WRF/GSI system, the fifth ECMWF Reanalysis (ERA5) product (Hersbach et al., 2020;Simmons et al., 2020) is chosen. IVT and IWV are analyzed to characterize the kinematic and moisture signatures of ARs. Both variables are integrated vertically from the surface to 300 hPa (Cordeira et al., 2013). Additionally, key meteorological variables such as temperature, horizontal winds, and moisture are also investigated.
The comparisons with ERA5 will be presented for forecast validation. The overall verification results with the NCEP GFS Final (FNL) Analysis (NCEP, 2015) are generally consistent with that of ERA5 (not shown). Though ERA5 is not the "truth," the model and assimilation system that was used to generate this product is acknowledged to be the state of the science and has outperformed other reanalysis products overall for other weather phenomena (e.g., Mahto & Mishra, 2019). Note that the ERA5 product assimilated temperature and wind collected during AR Recon in 2018 and 2019. The Stage-IV hourly precipitation product (Lin & Mitchell, 2005) at 4 km horizontal resolution was utilized to validate the precipitation forecast skill over the verification domain (110-125°W, 30-50°N; Figure 1b).

Synoptic Overview of IOP7
The case we are investigating in detail is an AR that made landfall over the Pacific Northwest from February 3 to 4, 2018. This case was chosen for three reasons: (a) it was a classic landfalling AR that brought precipitation to the Pacific Northwest, where the close relation between a landfalling AR and inland precipitation is well established in the literature (e.g., Rutz et al., 2014); (b) this IOP sampled most of the sensitive targets detected by the NRL adjoint model, whereas in some other cases the targets were too far away for an aircraft to sample or only a limited number (one or two) of aircraft were available; (c) this IOP had one NOAA G-IV and two Air Force C-130 aircraft so that impacts for observations from all three aircraft could be represented. Note that this case is a prototype IOP that represents positive impact IOPs. Results for all IOPs with positive impacts and neutral/negative impacts are summarized and discussed in Sections 4 and 5.
At 1200 UTC February 2 ( Figure 2a), a large broad AR was located between a deep Aleutian low centered near 167°W, 43°N and a high pressure system near 131°W, 38°N. The maximum IVT was ∼1,100 kg m −1 s −1 near 146°W, 43°N, with the upper-level North Pacific jet core ∼5° longitude west of the IVT core. At this time, the leading edge of this AR reached Vancouver Island (VI). A nearby low-pressure system was located off the coast of British Columbia (BC) with its associated frontal system extending further inland. 12 hours later (Figure 2b), the magnitude of the AR over the primary IVT core had reduced by ∼100 kg m −1 s −1 . In contrast, on the northeastern side, the AR had strengthened from a peak value of 450 to 800 kg m −1 s −1 , forming a secondary AR core offshore of VI. The leading edge of the AR had a plume of IVT extending further inland over Oregon and Washington. Its associated extratropical cyclone moved to the north of VI with a central pressure of 1,008 hPa. A total of 86 dropsondes were deployed for assimilation at 0000 UTC February 3. The dropsonde flights transected the AR structure six times (Figure 2b). At 1200 UTC February 3 (Figure 2c), the AR continued making landfall and the IVT orientation became more northwesterly directing more moisture further inland from the northeastern Pacific. A short-lived trough of low pressure formed in the Pacific Northwest, and together with the enhanced IVT, created moisture convergence in the lower troposphere that was favorable for inland precipitation. Secondary frontal waves started to develop downstream of the Aleutian low at 0000 UTC February 4 ( Figure 2d). Even though the intensity of the broad AR weakened overall, the leading edge of the AR over the Pacific Northwest remained similar in amplitude and extension. After 12 h (Figure 2e), the AR split into two parts, the main part had a core IVT of ∼800 kg

Innovation and Analysis Increment
The impact of a given observation type depends on the observation errors and the difference between the given observations and the first guess. The latter is often referred as the "innovation." In this study, we defined the "innovation" as the differences between the first guess and an observation to show the model errors. Accordingly, we define "analysis residual" as the difference between the model analysis and an observation. Figure 3 shows the innovation and analysis residual for the dropsonde observations used for this case in the WithDROP experiment. Model state vectors were converted to dropsonde observation space so that the model first guess and analysis can be compared directly to the observations. Compared to the dropsondes, the first guess was colder below the 300 hPa pressure level, with the mean difference of −1K near 950 hPa ( Figure 3a); while a warm difference of ∼0.5K was observed above the 300 hPa pressure level. The largest root-mean-squared error (RMSE) was also found near 950 hPa and showed greater values in the upper levels than the assigned observation errors, suggesting the dropsondes contained new information for the model background. After assimilating the observations that included AR Recon data, the temperature mean difference from 200 to 800 hPa was completely removed and the maximum bias near 950 hPa was reduced by 40% ( Figure 3b). The RMSE was reduced by ∼40%-50% from 200 to 700 hPa and ∼15%-30% below 700 hPa pressure level ( Figure 3b). The specific humidity showed that the first guess was moister than the observation below the 500 hPa pressure level and the assimilation process removed most of the bias except for the layer at 600-700 hPa where the maximum difference existed (Figures 3c and 3d). The first guess RMSE for humidity peaked from 550 to 850 hPa and decreased by ∼20%-40% after the assimilation (Figures 3c and 3d). After assimilation, over 75% of the wind differences (Figures 3e-3h) were removed except for the southeasterly difference at the lowest levels (950-1,000 hPa). The RMSE of wind was maximized near 800 hPa in the lower troposphere and near 250 hPa in the upper troposphere (Figures 3e and 3g). It is noteworthy that the RMSE was much larger than the observation error throughout the troposphere for horizontal winds, indicating that there was valuable new information contained in the wind observations from dropsondes. The largest RMSE reduction attributable to assimilation was 40%-60% from 200 to 500 hPa for horizontal winds (Figures 3f and 3h). The lower troposphere exhibited 15%-30% reduction of RMSE for winds below 500 hPa (Figures 3f and 3h).

Impact on Thermodynamic Variables and IVT
To examine how the dropsonde observations modify the initial conditions, we have analyzed the difference between WithDROP and NoDROP. Figure 4 shows a representative cross-section along a flight track made from the cold side of the AR to the warm side to sample the AR sector (Cobb et al., 2021;Ralph et al., 2017;Zheng, Delle Monache, Wu, et al., 2021). The NoDROP experiment showed a moist error overall with respect to ERA5 from 700 to 900 hPa, specifically near regions with large vertical and horizontal humidity gradients such as 150°-152°W and 145°-146°W (Figure 4a). Note that ERA5 and the model output have different resolutions, which might have a slight impact on how gradients are represented at each grid. The lowest level (∼975 hPa) in NoDROP showed weak dry bias west of 147°W and moist bias east of it. The assimilation of dropsonde data corrected the initial conditions by reducing the moist error overall from 600 to 975 hPa (Figures 4b and 4c). For example, the correction near 151.5°W at 850 hPa was ∼60% of the initial moist error in NoDROP and near 145°W at 600 hPa was ∼50% (Figures 4a and 4c). However, the initial weak moist error near 148.5°W from 500 to 600 hPa, where the dry intrusion exists, was amplified in WithDROP by 40%-50% (Figures 4a and 4b).
The NoDROP experiment overall showed weaker wind speed than the ERA5 near the upper-level jet (ULJ) core and the lower-level jets (LLJs), but it showed stronger wind between 145°W and 150°W near 275 hPa ( Figure 4d). The assimilation of dropsonde data reduced the positive wind error right below the ULJ core by ∼30% and the positive error near the LLJs by ∼20% (Figures 4e and 4f). However, the stronger wind error near 275 hPa was not significantly modified (Figures 4e and 4f). The most striking impact from dropsonde data was seen in temperature (Figures 4g-4i). A coherent region of sloping cold bias of 1-4K in the NoDROP run prevailed along the path from 275 to 400 hPa while a region of warm error of a comparable amplitude was just above it at 200 hPa ( Figure 4g). The cold error was also seen in the lower troposphere near 600 and 900 hPa (Figure 4g). The assimilation of dropsondes corrected ∼50% of the bias in the upper level from near 200 to 500 hPa, ∼30%-70% of the bias from 550 to 700 hPa, and ∼20%-35% of the bias from 750 to 925 hPa.
In addition to the standard meteorological variables, the layer IVT cross-section was also made along the same path to represent the dynamical and moisture structure of the AR. Figure 5a shows the layer IVT difference between NoDROP and ERA5. For this case in the ERA5 reanalysis, 20 kg m −1 s −1 of layer IVT represents the threshold for the 20th percentile of the AR moisture and wind ( Figure 5a) and is used to define the limits of the AR. The layer IVT was highest in one branch of the LLJ around 148.5°W at 800 hPa and a secondary IVT LLJ was near 152.5°W. The NoDROP run underestimated IVT in both LLJs around 800 hPa ( Figure 5a) and overestimated the layer IVT between the two jets near 151°W at the same level. The largest overestimation for layer IVT was over the southern flank of the AR near 750 hPa and above the stronger jet in the AR core between 500 and 600 hPa (Figure 5a). The dropsonde data corrected both the underestimated IVT near the AR core and the overestimated IVT above the southern flank by ∼20%-30% (Figures 5b and 5c).
The total IVT amplitude in NoDROP was generally overestimated within the domain except near the areas with magnitude greater than 750 kg m −1 s −1 (Figure 5d). The dropsondes reduced the IVT amplitude along  the AR flanks and increased the IVT value near the AR core (Figures 5e and 5f). The absolute error predominantly decreased by ∼30%-75% along both northern and southern flanks of the AR, attributable to the assimilation of AR Recon dropsondes (Figure 5g).

Impact on Temperature and IVT
The forecast of this case was initialized at 0000 UTC February 3, 2018. During the AR Recon operations, the major targeting guidance for this IOP was to improve the forecasts of the landfalling AR and its associated precipitation over the Pacific Northwest from 1200 UTC February 3 to 1200 UTC February 4 based on the areas of highest sensitivity identified using the NRL adjoint sensitivity tool. Therefore, we assessed the impact of the dropsondes on the 24-h forecast for IVT valid at 0000 UTC February 4, which is in the middle of the targeted precipitation window. At the initial time, the 300-hPa temperature in the NoDROP experiment showed a cold bias above the AR region (Figures 4c and 6a) that seemed to be associated with the uncertainty in simulating the upper-level clouds ( Figure S1 in Supporting Information S2). The dropsonde data helped reduce the cold error by ∼50% near 151°W, 45°N and over the eastern part of the AR (Figures 6c  and 6e). The dropsondes from the western flight track created a dipole structure of temperature anomaly (Figure 6e). In the 24-h forecast, the cold error in NoDROP was amplified and moved eastward close to the North American coast (Figure 6b). The WithDROP run showed ∼30%-50% less cold error than the NoDROP run (Figures 6d and 6f). Over Washington, the cold error of ∼1.5K was reduced to ∼0.7K (Figures 6b, 6d, and 6f), where the precipitation impact was also clearly observed (see next section).
The initial temperature difference between the NoDROP experiment and the ERA5 at 850-hPa was not as large as that at 300 hPa ( Figure S2a in Supporting Information S2) over the oceanic AR region. A cold error of ∼0.5-2K was seen in NoDROP near 152°W, 48°N and 142°W, 43°N ( Figure S2a in Supporting Information S2) where the WithDROP run showed 30%-50% less error ( Figure S2c in Supporting Information S2).
After 24 h, the NoDROP run continued to show a cold error over the Northeastern Pacific Ocean ( Figure  S2b in Supporting Information S2). Note that a warm error extended from VI to western Washington over the landfalling part of the AR shown in Figure 2c. The WithDROP run had half of this warm error reduced due to the assimilation of dropsondes at the initial time (Figures S2d and S2f in Supporting Information S2), leading to less warm air convergence over western Washington and increased stability. Consequently, this contributed to reduce precipitation in the WithDROP run around this forecast time.
For IVT, the NoDROP run showed large forecast errors around both cold and warm sides of the landfalling AR ( Figure 7a). The IVT amplitude over the inland penetration of the AR was over-forecasted in NoDROP (Figures 7a and S3a in Supporting Information S2). The WithDROP run had ∼50% less error over VI and the west of Washington (Figures 7b, 7c, S3a, and S3b in Supporting Information S2). The WithDROP run also reduced the negative bias near 147°W, 52°N by ∼25% and the overprediction errors near 143°W, 46°N by ∼50% (Figures 7b-7d), where a cold front can be identified (not shown).

Impact on Precipitation
The maximum precipitation associated with the landfalling AR occurred between 1200 UTC February 3 and 1200 UTC February 4 over northwestern Washington in both the Stage-IV data and the forecasts. The maximum precipitation in Stage-IV was about 65 mm near 121.6°W, 47.7°N (Figure 8a). A second precipitation center was further south near 121.9°W, 46.9°N (Figure 8a). The NoDROP experiment generally overpredicted the amount of precipitation within the domain, and displaced the center of the heaviest precipitation seen in Stage-IV ∼100 km southward, creating a peak in the precipitation difference (shown in pink in Figure 8d) with overestimation error of ∼40 mm near 122.0°W, 47.0°N (Figures 8b and 8d). The precipitation in the WithDROP experiment was more realistic and closer to Stage-IV than that in NoDROP in both intensity and distribution (Figures 8c and 8e). The overprediction error was almost completely corrected near the peak in precipitation, which can be attributed to the assimilation of dropsonde data. The RMSE averaged over the inland precipitation common domain (Figure 1b) was reduced by 12.8% during 12-36 h.
The overprediction of precipitation in the NoDROP run was associated with the colder cloud tops at the upper levels ( Figure 6b) and the enhanced water vapor transport from offshore over the leading edge of the AR (Figures 7a and S3a in Supporting Information S2) that can be traced back to initial condition errors (Figures 6a and S2c in Supporting Information S2). Dropsonde observations reduced the cold bias in the cloud top and corrected the IVT overestimation at the initial time, leading to a significant improvement in the regional precipitation forecast. Zheng, Delle Monache, Wu, et al. (2021) showed that the operational observation system has difficulty in measuring the lower to middle troposphere within an AR, indicating that a greater impact from dropsondes would be expected in the lower to middle troposphere. The analysis here demonstrated that these data can also significantly correct the upper-level temperature error specifically near the cloud tops over an AR, and this correction is important for the accurate prediction of precipitation as well. Hanna et al. (2008) found that lower cloud-top temperature is associated with increasing rainfall intensity. In this case, the cold bias of cloud-top temperature, extending inland from offshore the Pacific coast, partly caused an overestimation of the precipitation over western Washington, and after reducing the cold bias by assimilating the dropsonde data, the overpredicted precipitation was reduced by ∼50% near Mount Rainier.

Impact on IVT and IWV
To systematically investigate the data impact of AR Recon data, we computed the differences in forecast skill of water vapor transport and inland precipitation for all IOPs between WithDROP and NoDROP experiments. At the initial time, all of the 15 IOPs showed improved IVT representation based on the RMSE reduction metrics with respect to ERA5 reanalysis with the median value ∼3.8% and the 75th percentile improvement was ∼8.2% (Figure 9a). The median value for the RMSE reduction was a maximum of ∼4.8% for the 12-h forecast and continued to be positive but decreased through 72 h. The RMSE reduction was overall neutral from 78 to 102 h, with the 90-96 h forecast showing small negative median values and more degraded IOPs than improved IOPs. Beyond 108 h, the RMSE reduction metrics were positive, with over two-thirds of the IOPs improving from 126 to 144 h. The data impact on the spatial pattern of IVT, represented by the correlation between NoDROP error and the difference of NoDROP and WithDROP, was positive across all the 15 IOPs at all forecast times (Figure 9b), indicating that the dropsondes were improving the initial condition in all IOPs, even if the improvement size was too small to change the RMSE by much in some IOPs. The error-difference correlation was maximized during the 6-12 h forecasts with the value between 0.2 and 0.5. The trend of impact based on the error-difference correlation was generally consistent with that for the RMSE reduction: with a decreasing positive impact from the initial time to ∼60 h, an impact close to neutral from 90 to 102 h, and a positive impact again from 114 to 144 h (Figure 9b). Figures 9a and 9b showed that the impact on IVT did not linearly decrease with forecast time. Instead, a second positive impact emerged For IWV, the median values across the 15 IOPs for all lead times were positive (Figure 10a). This is different from the IVT validation, which only showed a positive impact in 12 out of the 15 IOPs throughout the 6-day forecasts (Figures 9c and 9d). Since IVT is a combination of humidity and wind while IWV is a pure humidity variable, improvements at more lead times in IWV indicate that the dropsondes may have a larger positive impact on humidity than on wind for 2-4-day forecast. This can be explained by the results found by Zheng, Delle Monache, Wu, et al. (2021), which demonstrates that AR Recon dropsondes provide a higher percentage of available direct humidity observations than that wind observations. The significant data gaps for direct humidity set up the dropsonde to play a more important role in correcting the errors and improving the humidity forecasts. The spatial error-difference correlations for IWV are positive at all time steps for all the IOPs, again suggesting the improvement of spatial pattern error is more robust than the intensity error for IWV ( Figure 10).

Impact on Precipitation Forecasts
One important goal for assimilating AR Recon data is to improve the precipitation forecast over the US West Coast. Figure 11 shows the data impact on 24-h accumulated precipitation for all IOPs over the common inland precipitation verification domain (Figure 1b) with Stage-IV precipitation as the ground truth for RMSE. For the first 24 h, improvement was found in the precipitation RMSE over the domain in 13 out of 15 IOPs with the median value of 3.2% (Figure 11a). Positive impacts on IVT/IWV may not always produce    (Figure 11a). For the error-difference correlation, the median value was positive for all the lead times, demonstrating that the improvement of the spatial correlation was greater than the RMSE improvement ( Figure 11b) with the impact being consistently positive for both metrics. The median value was positive in 11 out of 15 IOPs using RMSE metrics while 3 IOPs showed negative values (Figure 11c). Although positive median values are observed in the majority of cases, it is noteworthy that negative values exist in some forecast times for all the IOPs. The spatial correlation generally showed improved skill at more lead times than that of the RMSE metric based on the IOP grouping (Figure 11d).
The Fractions Skill Score (FSS), a neighborhood method, was also employed to objectively assess the spatial skill at different scales of precipitation forecasts for comparing the NoDROP and WithDROP experiments. Unlike traditional methods which only compare forecast precipitation to observed precipitation in the same grid box, neighborhood methods match forecasted precipitation that is nearby the grid point of observed precipitation (Ebert, 2009;Roberts & Lean, 2008;Schwartz et al., 2009). The calculations of FSS require that the predicted and observed precipitation be on the same grid. Specified thresholds (P, e.g., P = 10 mm) are selected to define the precipitation events and a radius of influence (R, e.g., R = 20 km) is chosen. FSS is computed for both NoDROP and WithDROP outputs with Stage-IV precipitation as the observed precipitation fields. The forecast is considered perfect if FSS is 1 and no skill if FSS is zero.
Using a 5-mm minimum precipitation threshold, the FSS values from the NoDROP and WithDROP runs are very close in the 24-h forecast (Figure 12a). However, a small (<5%) but significant positive impact was seen in 11 out of the 15 IOPs. Only three IOPs showed a very small (<1%) degraded impact. The IOP3 showed a very small FSS because it did not rain much over the region and the sample size is too small to compute a meaningful FSS value. Using a 10 mm threshold, the FSSs in both runs are smaller than that for 5 mm threshold (Figure 12c). IOPs 7, 9, and 15 show more FSS improvement in WithDROP than in Figure 11. RMSE and error-difference correlation for common area verification for precipitation based on the forecast hour (a and b) or IOPs (c and d).
NoDROP (Figure 12c). FSS for the accumulated precipitation from 72 to 96 h was also investigated. With a 5-mm threshold, NoDROP showed an overall decrease in skill when compared to the 0-24 h forecast, particularly for IOPs 1, 5, 7, 9, 13-15 ( Figure 12b). However, FSS was improved by ∼10%-60% with dropsondes assimilated in IOPs 1, 5, 7, 13, and 15 (Figure 12b), indicating that dropsondes are particularly important in poorly forecasted AR cases. With the 10 mm threshold, FSS in NoDROP decreased significantly when compared to that with a 5-mm threshold in IOPs 7, 11, and 14-15 (Figure 12d). The difference in FSS between WithDROP and NoDROP for precipitation with threshold greater than 10 mm shows an even greater positive impact from dropsondes than was seen using a 5-mm threshold, specifically in IOPs 7, 10-11, and 14 ( Figure 12d).
The FSS results highlight that dropsondes are more important when the precipitation forecast has poorer skills specifically with higher precipitation threshold. Note that for all FSS shown for the "degraded" cases, the differences in FSS between WithDROP and NoDROP are typically negligible (i.e., less than 2% of the NoDROP FSS). (c) same as (a) but for P = 10 mm; (d) same as (b) but for P = 10 mm. Horizontal gray dashed lines are the "zero lines" for both y-axes on each panel. The vertical bar on the red line represent the bounds of the 95% bootstrap confidence interval (CI) based on FSS differences between WithDROP and NoDROP. For the differences between WithDROP and NoDROP to be statistically significant, the bounds of 95% CI must not contain zero, otherwise, the differences are insignificant. Green, magenta, and gray filled squares are the statistically significantly improved, significantly degraded, and neutral IOPs, respectively.

Impact of Consecutive Flights and/or Multiple Flight Paths
Results from Section 4 demonstrated an overall positive impact on AR-specified variables and precipitation from assimilating AR Recon dropsonde data, even though the positive impact is not uniform. The dependence of dropsonde impact on the weather situation and the lead time is consistent with the conclusions from the literature (e.g., Langland et al., 1999;Romine et al., 2016;Stone et al., 2020). Table S1 in Supporting Information S1 summarizes the synoptic conditions for each IOP and the overall impact on IVT and precipitation.
We found that during a sequence of consecutive flights, which are defined as two IOPs less than 3 days apart, the positive impact on precipitation over the US West Coast was larger in the subsequent missions and was maximized in the last mission of a sequence (IOPs 2,7,12,and 15 in Figures 11c,11d,12c,and 12d,Table S1 in Supporting Information S1). In contrast, the first missions in a sequence had a relatively larger impact on IVT and smaller impact on precipitation over West Coast, particularly if only one aircraft was employed or flew over a remote area that is far away from the coast (IOPs 5,8,and 11 in Figures 12c and 12d, Table S1 in Supporting Information S1).
The enhanced impact on precipitation from consecutive flights could be associated with the first guess and boundary conditions that inherited accumulated benefits from pre-existing AR Recon missions in the forcing data. We hypothesize that assimilating the targeted observations in the current cycle amplifies the positive impact from the first guess, which has information from upstream sampling in the previous cycle. The accumulated impact is greatest toward the end of the sequential flight period. The effect is strong enough to dominate despite the cold start for each IOP. Consistent results were implied in previous studies (e.g., Schindler et al., 2020;Stone et al., 2020). Specifically, Stone et al. (2020), who assimilated the dropsondes in a global model along with data from the entire global observing system, reported that the largest total observational impact was from observations collected on February 3, 2018, which is the IOP7 in this work and the last mission of a sequence of flights from January 27 to February 3 during 2018 AR Recon.
Ongoing research is focused on examining this hypothesis with varying flight scenario experiments. In addition, multiple aircraft in a single IOP also increase the positive impact because the first-order impacts from dropsondes, like forecast errors, are flow-dependent. For example, assuming there are two flight paths along 130°W and 150°W (e.g., the IOP on January 29, 2018 in Figure 13), the impact from the eastern flight path reaches the West Coast first and then decays after 1 day. Depending on the phase speed of the weather system, the impact from the western flight path may advect toward the coast, bringing an interval of higher dropsonde impact (e.g., Figure 12b, IOP5) on subsequent days. When a Rossby wave packet is present over the western flight path, the impact could also propagate at the group velocity of the wave packet (Majumdar et al., 2010;Zheng et al., 2013). Nevertheless, cycled data denial experiments based on a large number of AR Recon cases are needed to confirm the impact of consecutive flights and multiple aircraft.

Targeting Tools, Data Assimilation Techniques, and Model Physics
The impact of targeted observations on model analysis and forecasts can depend on the detection of targeted regions during the flight planning stage (Ancell & Hakim, 2007;Bishop et al., 2001;Buizza et al., 2007;Majumdar et al., 2002;Wu et al., 2009). For example, if target areas are not correctly selected, the sampling of the real-time initial condition sensitivity areas will not be enhanced, thereby leading to negligible impact on downstream forecasts. The primary tools employed during AR Recon operations are the adjoint sensitivity tool from COAMPS (Errico, 1997;Doyle et al., 2014Doyle et al., , 2019Reynolds et al., 2019) and ensemble sensitivity analysis (Ancell & Hakim, 2007;Chang et al., 2013;Torn & Hakim, 2008;Zhang et al., 2007;Zheng et al., 2013) based on the ECMWF, GEFS, and MSC ensembles. These tools quantify how a forecast at a valid time over a verification domain is dependent on the initial state variables or the atmospheric state at an earlier time.
Both methods have their strengths and limitations. Neither the COAMPS adjoint nor the ensemble analysis employs the same forecast system as the WestWRF/GSI system used in this study. The adjoint sensitivity was based on analyses and forecasts from COAMPS, which provides an accurate, efficient, and physically meaningful method to identify the most sensitive areas. The initial condition perturbations that will influence the downstream precipitation the most are not necessarily the same perturbations needed in West-WRF. Another limitation of adjoint sensitivity is that it is a deterministic method and therefore does not take into account the probabilistic guidance for forecast uncertainty. On the other hand, the ensemble sensitivity analysis considers the forecast uncertainty across ensemble members and provides a flexible and probabilistic tool to identify the sensitive areas but is limited by the quality of the ensemble. Ideally, if the ensemble is calibrated, the spread would reliably characterize the uncertainty over a number of cases. However, the ensemble spread in current operational ensemble products is far from perfect, due to ensemble size, observation constraints, and ensemble generation methods (e.g., Zheng et al., 2019). The limitations of the targeting tools mean that the AR Recon dropsondes are not guaranteed to be in the optimal locations for improving forecast skill in a particular model. For this reason, a variety of tools were used in flight planning to maximize confidence in identifying the sensitive areas.
Another contributing factor for dropsonde impact is the data assimilation technique (Majumdar, 2016). The dropsonde data used for this study are similar to what has been used for the NCEP GFS model, which are quality controlled data at reduced levels and only include a small portion of the raw dropsonde profiles particularly near the PBL (see Figure 14d of Zheng, Delle Monache, Wu, et al., 2021 for more details). Preliminary tests (not shown) using data assimilation techniques tailored to assimilate dropsondes suggested that if the dropsonde release time information was not considered during assimilation, for example using the 3DVar and hybrid 3DVar, results show degraded impact from dropsonde assimilation. This is particularly important for fast-moving atmospheric flows, which is also consistent with the findings from Zhang and Pu (2020) for hurricane forecasts. Parameter settings such as the localization scales and the observation error assignment will also influence the data impact. The choice of ensemble input, used to derive the flow-dependent error covariance, is also critical to the data impact. One strength for this set of experiments is the use of the combined GEFS and CMC ensemble. AR Recon data typically modified the IVT amplitude by 10%-40% (Figures 13 and S4 in Supporting Information S2). We think the overall positive impact we saw from the initial time is partly attributed to the use of the multi-ensemble that improved the ensemble error-spread skills compared with an individual ensemble . In follow-up work, we will investigate whether the EnKF will further improve the data impact compared with the hybrid 4DEnVar.
Errors in the NWP model employed is another limiting factor for dropsonde impact, including the errors in model first guess and the model physics. On one hand, if the model is perfect and the initial condition has little deviation from the dropsondes, the room for improvement from dropsonde assimilation is limited. For example, Hamill et al. (2013) showed that the Winter Storm Reconnaissance data did not significantly improve forecast skill and considered that limited impact could be due to the advanced assimilation system employed as well as a denser observation network, rendering little room to improve upon the numerical model. On the other hand, if the deficiency in the model is too big and dominates the error growth with increasing forecast time, the correction of initial conditions might be negligible beyond certain forecast times or for certain weather systems. IOPs 4 and 13 appear to be such examples (Figures 13 and S4 in Supporting Information S2, Table S1 in Supporting Information S1). Although both IOPs showed improvement in initial conditions and short-term forecast (<18 h), the prediction for 24-48 h precipitation was not improved, partly due to the complex topography and cloud physics. Lavers et al. (2018) found that wind errors could be the major source of the IVT uncertainty in the ECMWF IFS and implied that the wind component of IVT might be more challenging to correct than the humidity component in NWPs, which might partly explain why we see more improvements in IWV than in IVT. To what degree the practical predictability can be improved requires further investigation through, for example, perturbed initial conditions, multi-physics experiments, and studies of more cases to separate initial-condition dominated errors and model-physics dominated errors.
One caveat is that the assimilated data did not include satellite radiances. Zheng, Delle Monache, Wu, et al. (2021) showed that most of the clear-sky radiance data were not used under AR conditions, leaving data gaps in the lower to middle troposphere. The all-sky radiance data for the GFS operational system were rejected as well if precipitating clouds are present, leaving observing gaps over the precipitating areas. Furthermore, assimilated microwave radiances were assigned large observation errors and often have limited impact on influencing forecasts over the Northeastern Pacific (Zhu et al., 2016(Zhu et al., , 2019. Extensive coverage of AMV winds was assimilated in our experiments to reflect a portion of the information from satellite observations. Nevertheless, future work will be focused on the assessment of dropsonde impact in the context of all-sky radiance data with the regional model and the improvement of data assimilation techniques for data-sparse regions.

Conclusions
Landfalling ARs have a large impact on both hazardous weather and managing valuable water resources over the western US. Significant observational gaps still exist over the upstream Northeastern Pacific, leading to uncertainty in initial conditions that impact forecast skill for these AR events. Atmospheric River Reconnaissance (AR Recon, Ralph et al., 2020) was initiated in 2016 as a way to improve the forecast skill for the western US. The analysis presented here investigated the overall impact of the dropsonde observations collected during AR Recon in 2016, 2018, and 2019 using a GSI 4DEnVar system for implementation of the WRF model tailored for the western US named West-WRF (Martin et al., 2019).
Paired data denial experiments were performed, including WithDROP runs that assimilated dropsondes with other conventional and remotely sensed data, and NoDROP runs with the same assimilated observations as the WithDROP experiments except removing AR Recon dropsonde data. The difference between these paired runs showed a statistically significant impact from AR Recon flight-level and dropsonde data, especially in cases that were found after the fact to have lower-skill forecasts. In a representative case study initialized on 0000 UTC February 3, 2018, the assimilation of dropsonde data significantly reduced the moisture bias in the critical layer of the AR (∼700-950 hPa) and the cold bias in the cloud top aloft around 300 hPa, and thereby reduced the IWT transport (IVT) bias by 30%-75% along the flanks of the AR at the initial time. The benefit of initial condition corrections in the WithDROP experiment propagated to the downstream areas, which clearly eliminated a localized precipitation error centered near Mount Rainier. The precipitation forecast error over western Washington was improved by ∼50% at a lead time of 12-36 h. The case study demonstrated that the dropsonde assimilation can not only improve the lower-level water vapor transport within the AR but can also improve the cloud temperature in the upper levels, both contributing significantly toward improving precipitation predictions.
Results based on the 15 Intensive Observation Periods (IOPs) of AR Recon showed that dropsondes improved the spatial pattern forecast for both IWV and IVT in all 15 IOPs for all forecast times out to day 6 over the extended Northeastern Pacific and Western US domain. Improvement for IVT and IWV based on the reduction of the median RMSE value was seen in over 80% of the IOPs and 85% of the forecast times out to day 6. Consistent forecast improvement in IVT and IWV was found in the first 3 days with the peak impact occurring at 6-18 h. The impact on accumulated 24-h precipitation over the US West Coast was generally consistent with the IVT/IWV but slightly smaller. It was found that the improvement for precipitation forecasts was maximized during subsequent consecutive flights, suggesting that consecutive missions can enhance the impact on forecasting downstream West Coast precipitation. The precipitation improvement based on the Fractional Skill Score tended to be greater in cases with large forecast error, such as IOPs 5, 7, and 15 for the forecast times of 72-96 h, with threshold greater than 10 mm showing an even greater positive impact from dropsondes than was seen using a 5-mm threshold, which suggest that AR Recon data are more important in poorly forecasted medium to heavy precipitation events than for light precipitation and good forecasts.
This study sheds light on how initial condition corrections associated with AR Recon dropsonde assimilation can improve forecast skill over the Northeast Pacific and the US West Coast with the advanced 4DEnVar method and a well-tuned ensemble as flow-dependent background errors. Given these promising results, future work will include assessing the impact in experiments with the full observational system, concentrating on events with poor forecast skill.

Data Availability Statement
The experiments were completed at the San Diego Supercomputer Center under the research allocation ATM150010. The West-WRF data can be accessed via UC San Diego Library Digital Collections from https:// doi.org/10.6075/J0445MMG . The AR Recon dropsonde profiles have been made available publicly and interactively via the web interface at https://cw3e.ucsd.edu/ arrecon_data. ERA5 verification data can be retrieved from https://cds.climate.copernicus.eu/#!/search?-text=ERA5ERA-5&type=dataset. Stage-IV precipitation data can be accessed via https://doi.org/10.5065/ D6PG1QDD.