Evaluation of MODIS and Himawari‐8 Low Clouds Retrievals Over the Southern Ocean With In Situ Measurements From the SOCRATES Campaign

Aircraft observations collected during the Southern Ocean Cloud Radiation Aerosol Transport Experimental Study in January‐February of 2018 are used to evaluate cloud properties from three satellite‐imager datasets: (1) the Moderate Resolution Imaging Spectroradiometer level 2 (collection 6.1) cloud product, (2) the CERES‐MODIS Edition 4 cloud product, and (3) the NASA SatCORPS Himawari‐8 cloud product. Overall the satellite retrievals compare well with the in situ observations, with little bias and modest to good correlation coefficients when considering all aircraft profiles for which there are coincident MODIS observations. The Himawari‐8 product does, however, show a statistically significant mean bias of about 1.2 μm for effective radius (re) and 2.6 for optical depth (τ) when applied to a larger set of profiles with coincident Himawari‐8 observations. The low overall mean‐bias in the re retrievals is due in part to compensating errors between cases that are non‐ or lightly precipitating, with cases that have heavier precipitation. re is slightly biased high (by about 0.5–1.0 μm) for non‐ and lightly precipitating cases and biased low by about 3–4 μm for heavily precipitating cases when precipitation exits near cloud top. The bias in non‐ and lightly precipitating conditions is due to (at least in part) having assumed a drop size distribution in the retrieval that is too broad. These biases in the re ultimately propagate into the retrieved liquid water path and number concentration.

Cloud properties, such as clouds effective radius (r e ), optical depth (OD, τ), liquid water path (LWP), and cloud droplet number concentration (N d ) are central in understanding the physics of MBL clouds and their radiative effect. Visible and infrared observations from geostationary and polar-orbiting satellites have long been used to retrieve MBL cloud microphysical characteristics and for the study of SO clouds, cloud-aerosol interactions, and for the evaluation of global models (e.g., Bodas-Salcedo et al., 2016;Haynes et al., 2011;McCoy et al., 2015;Meskhidze & Nenes, 2006;Vergara-Temprado et al., 2018). However, the accuracy of satellite retrievals over the SO is questionable, as satellite retrievals have been infrequently evaluated against in situ measurements in this region, due in part to the remoteness of the region and a paucity of in situ measurements. The validation, empirical relationships, and Apriori data used in satellite retrieval algorithms are mostly based on data collected in the Northern Hemisphere and might not be applicable over the SO. In general, low-level SO clouds are thought to be more frequently multilayered, mixed-phase, and contain more supercooled liquid water than in the Northern Hemisphere, conditions which pose significant challenges for satellite retrievals (Huang et al., 2014;Morrison et al., 2011).
A direct evaluation of satellite cloud retrievals can be made using in situ measurements from aircraft, and many such studies have been done over the years, including in recent years for the Southeast Pacific (King et al., 2013;Min et al., 2012;Painemal & Zuidema, 2011) and Northeastern Pacific (Noble & Hudson, 2015). There have been a few cases where in situ measurements have been collected from aircraft over the SO. Four transects over the SO were made during the HIAPER Pole-to-Pole Observations (HIPPO) experiment (Wofsy, 2011). HIPPO confirmed the existence of extensive supercooled liquid water in the region, but collected insufficient data to directly evaluate coincident satellite microphysical retrievals. More recently, in situ measurements from 20 flights were made over the SO to the west and south of Tasmania (43-45°S, 145-148°E) during the austral winter between 2013 and 2015 by Ahn et al. (2017). These flights focused on the microphysical properties of low-level clouds, which were found to be commonly precipitating, patchy, and mixed-phase. Ahn et al. (2018, hereafter A18) compared in situ observations from 11 of these flights to cloud products from Moderate Resolution Imaging Spectroradiometer (MODIS) and found an overestimation of r e in comparison with in situ measurements. In addition to, providing estimates of the biases of satellite cloud retrievals, these evaluation studies also provide insights on the error sources, which can be further used to improve the satellite retrieval algorithm. Satellite retrievals invoke assumptions about cloud structure and microphysics, and errors arise when these assumptions are violated in the real world. Some error sources explored by past studies include but not limited to: (1) subpixel inhomogeneity and three-dimensional radiative effects, that is, clouds are not plane-parallel and the scattering of light from clouds is frequently not well modeled using one-dimensional radiative transfer model as is assumed in the retrieval (e.g., Marshak et al., 2006;Zhang et al., 2012); (2) assumptions about the shape of the cloud droplet size distribution (DSD, e.g., Hansen, 1971;Platnick et al., 2017); (3) assumption about the cloud vertical structure (e.g., Bennartz et al., 2007;Borg & Bennartz, 2007;Wood & Hartmann, 2006) and; (4) satellite viewing geometry and solar zenith angle (e.g., Grosvenor & Wood, 2014;Maddux et al., 2010). Moreover, the in situ measurements that are used to evaluate satellite retrievals also have uncertainties.
More recently, SO Cloud Radiation Aerosol Transport Experimental Study (SOCRATES) collected airborne in situ measurements over the SO (McFarquhar et al., 2020). During SOCRATES, NSF deployed the Gulfstream-V (GV) research aircraft to Hobart, Tasmania from January to February of 2018. From Hobart, the GV flew a total of 15 research flights over the SO as far as 62°S, sampling aerosol, cloud and precipitation properties in situ, as well as remotely with a W-band cloud radar and high spectral resolution lidar. SOCRATES provides an opportunity to evaluate satellite cloud products and retrieval assumptions during the austral summer over the SO. In this study, we expand upon the earlier evaluations of cloud properties from satellites and evaluate low altitude cloud microphysical properties retrieved from satellites using airborne in situ measurements collected during SOCRATES. After describing the datasets and methods in Section 2; in Section 3, we compare the satellite retrievals of effective radius (r e ), OD (τ), LWP, and cloud droplet number concentration (N d ) from three datasets that are based on observations from MODIS (Platnick et al., 2003) and Himawari-8 (geostationary weather satellite; Bessho et al., 2016) with in situ measurements. This is followed by Section 4, a more detailed examination of the retrieval assumptions and other factors that are responsible for differences between the satellite retrievals and in situ observations. Discussion and concluding remarks are given in Sections 5 and Section 6, respectively. SOCRATES specifically targeted stratocumulus, primarily overcast or closed-cell stratocumulus, that reside in the cold sectors of the low-pressure system, and the SOCRATES data do not represent a meteorologically unbiased set of conditions. Nonetheless, stratocumulus clouds are a significant fraction of all SO low clouds (Wood, 2012), and our focus on these clouds during SOCRATES results from a recognition that these clouds lay at the heart of difficulties that many models are having in simulating SO climate. Based on the results from other regions, one expects that satellite retrievals for these relatively spatially homogenous clouds should work well (e.g., Painemal & Zuidema, 2011;hereafter PZ11). In Section 5, we discuss conditional sampling issues and how results obtained here are related to previous evaluation studies over the Pacific and over the SO (e.g., A18; Zhao et al., 2020) in more detail.

SOCRATES Flights and In Situ Measurements
During SOCRATES, the GV was equipped with a suite of instruments measuring aerosol, cloud, and thermodynamic variables. A total of 15 research flights were flown over the SO, which are marked by the black lines in Figure 1. Typically, the GV sampled clouds with several modules in each flight, which consists of a combination of ramp ascents and descents, as well as level (fixed altitude) legs above, below, and in cloud. Here, we focus on vertical profiles of cloud microphysical properties constructed from flight segments, where the aircraft completely ascended or descended through low-altitude clouds, and multiple low-level cloud layers were occasionally present.
KANG ET AL.  This study uses the cloud microphysical properties measured by the particle-sizing instruments, as listed in Table 1. SOCRATES GV data are available via the Earth Observing Laboratory data archive (https://data. eol.ucar.edu/project/552), with links to specific datasets given in Table 1. Here we rely primarily on (i) the Cloud Droplet Probe (CDP): an optical instrument that measures the concentration of cloud droplets in 30 size bins with diameters ranging from 2-50 µm, by measuring the light forward scattered by individual cloud droplets, as they pass through a laser beam oriented across the aircraft flight direction, (ii) the two-dimensional stereo (2DS) probe: an optical array probe that records the images of hydrometeors using two orthogonal laser beams that cross in the middle of the sample volume and measures particle size based on the shadow (blockage of the lidar beam) for particle diameters (maximum dimension if irregular) ranging from about 10 to 1,280 µm with a 10 µm bin-width as they cross the optical array, and (iii) the two-dimensional cloud (2DC) probe: an optical array probe that measures hydrometeors ranging from 37.5 to 1612.5 µm with 25 µm bin-width.
We combined DSD from the CDP with that from the 2DS (or 2DC) to calculate cloud microphysical properties. The in situ effective radius is calculated from the merged DSD from all the CDP bins (which includes particles up to 50 µm) and 2DS bins larger than 50 µm, the same approach as used by King et al. (2013). More details and uncertainties associated with instruments and the spectra-merging processes are discussed in Section 4.4. Specifically, r e is computed as the ratio of the third to the second moment of a DSD (1) where r e is the effective radius at a given time, r i the droplet radius of each bin, n i the droplet concentration (#/cm 3 ) per bin, and N the total number of the bins. Cloud droplet number concentration (N d ) is computed as: Cloud OD (τ) is calculated by vertically integrating the volume extinction coefficient (β) where the extinction efficiency Q e is assumed to be 2.
where the  w is the density of liquid water. We calculate LWP by integrating LWC from the cloud base to the cloud top. Following Wood et al. (2011) and PZ11, cloud top and cloud base are defined as the highest and lowest altitude with LWC greater than 0.03 gm −3 . The calculated value of LWP is not sensitive to this threshold.
In order to identify the cases (vertical profiles) where the precipitation is present, we calculate LWP for the droplets with diameters larger than 50 μm from 2DS probe (i.e., the vertically integrated precipitation liquid water path), which we will denote as precipitation water path (PWP). Following the definition of King et al. (2013), clouds profiles are categorized into three groups: nonprecipitating (PWP ≤ 2 g m −2 ), lightly precipitating (2 g m −2 < PWP ≤ 10 g m −2 ), or heavily precipitating (PWP > 10 g m −2 ). The phase of the clouds is determined using the ice phase fraction μ ice (Korolev et al., 2017), defined as μ ice = IWP/(IWP + LWP), where ice water path (IWP) is the vertical integral of ice water content (IWC) obtained from the 2DS and only includes ice particles ≥200 µm, as derived by Wu and McFarquhar (2019). Later in the article, we will discuss the implications of this restriction. Figure 2 shows the verticals profiles of in situ r e , LWC, and N d as a function of the normalized height (position within cloud normalized such that 1 is cloud top and 0 is cloud base). Here only profiles of single-layered clouds are shown (meaning profiles with multiple low-level clouds layers are not included). The thin lines are from individual aircraft penetrations (dots shown in Figure 1), while the thick line and purple shading shows the average profile and standard error, respectively. The standard error is the standard deviation divided by the square root of the number of profiles and is provided to give a sense for the one-sigma (66%) uncertainty in the average.
On average, both r e and LWC increase roughly linearly with height, while N d remains relatively constant with height, which is what one expects for cloud under an adiabatic assumption. We note that while r e increases linearly with height, the value of r e is not especially small at cloud base, and the total change in r e (on average) is only a few microns. LWC near the cloud top deviates from a linear increasing LWC, which may be due to entrainment, but could also be due to the aircraft passing through clouds with a horizontally varying cloud top height or cases where a thin-cloud-layer exists above a thicker-layer that is not resolved by the aircraft sampling. The data used here are sampled at 1 Hz, which is roughly equivalent to a horizontal sampling distance of 137 m, while ascents and descents rates were typical about 5-7 m s −1 , yielding a vertical resolution of about 6 m. The thickness of cloud layers varied from 88 to 2,421 m, with multilayer clouds often featuring multiple thin layers. The light lines in Figure 2 shows that individual aircraft penetrations KANG ET AL.
10.1029/2020EA001397 5 of 30 Purple shadings indicate the standard error. Spikes in the effective radius plot, in particular, occur when the Cloud Droplet Probe (CDP) records only a few small (∼10 µm) cloud-size droplets, but some precipitation size particles are present from two-dimensional stereo (2DS).
do not always show an adiabatic-like profile, but of course the aircraft profiles we are creating are not necessarily sampling individual updrafts or downdrafts within a cloud, and any individual profile does not represent the actual profile of cloud properties at any specific location. Nonetheless, the horizontal distance sampled by the aircraft is roughly consistent with the 1-to-few pixel size being used in the satellite retrievals.
In Section 3, we compare satellite retrieved r e , τ, LWP, and N d with in situ values from individual profiles.
In that analysis, the in situ cloud drop size distribution are first aggregated from the top of the cloud to the point where the OD reaches one (unless noted otherwise), and average DSD are used to compute r e . Since both τ and LWP are integrated quantity, they are computed over the entire cloud layer by vertically integrating LWC and , respectively. Despite of the variability in the cloud field, this approach is still reasonable to estimate the mean τ and LWP of the cloud field. The mean value of N d for each profile is used (rather than the value near cloud top) to reduce sampling uncertainty. In the plots in Section 3, the variability of the in situ r e is shown by the standard deviations of the values over the top 1 OD of the cloud, and the variability of the in situ N d is shown by the standard deviations of the values taken over the cloud profile. In order to estimate the uncertainty associated with the LWP and τ, we fit a set of lines to individual profiles that bound the vertical variations in LWC and , with details given in the supporting information.

Satellite Products and Collocation
MODIS level 2 Collection 6.1 cloud products from Aqua platform (MYD06) are evaluated in this study. Detailed descriptions for the MODIS cloud product can be found in Platnick et al. (2017). Data are available at https://ladsweb.modaps.eosdis.nasa.gov/. The product includes cloud microphysical information with 1-km resolution at nadir (directly below the satellite) based on a bi-spectral method using a non-absorbing visible-wavelength channel and one absorbing shortwave infrared channel, following the approach developed by T. Nakajima and King (1990). MODIS provides three sets of retrievals, based on three different absorbing channels at 1.6, 2.1, and 3.7 µm. During the evaluation, we mainly focus on the retrievals using the 3.7 µm channel since r e retrievals from this band are expected to be less influenced by three-dimensional effects (more on this later in the document) and often show the best agreement with the in situ measurements (e.g., King et al., 2013). Comparisons between 1.6 and 2.1 µm are also discussed in Section 3.2. In general, the comparison of satellite retrieved r e and in situ measured r e requires consideration of the vertical penetration of the photons into the cloud. As reported in the previous studies (King et al., 2013;T. Y. Nakajima et al., 2010;Platnick, 2000;Zhang & Platnick, 2011), one expects that r e retrieval at 3.7 μm is more sensitive to the cloud droplets near cloud top due to the stronger absorption (smaller penetration depth), while r e retrievals at 1.6 and 2.1 μm are more representative for the droplets deeper into the clouds due to the relatively weaker absorption (larger penetration depth). Thus (as described in Section 2.1), we calculated in situ cloud top r e by averaging the cloud DSD over 1 OD at the top of the cloud. We also have examined the impact of using a threshold of two and three ODs, but found this had little effect on the results. In addition to MYD06 product, we also used MODIS level 3 MYD03 product for the geolocation fields, and MODIS level-1B data set MYD02QKM for the calibrated radiances to calculate the heterogeneity index (Section 4.1).
In addition to, the operational MODIS retrievals, we also evaluated the CERES-MODIS Cloud Product Retrieval Edition 4 (Minnis et al., 2020;Trepte et al., 2019). This retrieval product is produced by the CERES team at NASA Langley and is used in generating CERES radiative flux products (Kato et al., 2013). Although CERES-MODIS, pixel level data are not publicly available (publicly available data are limited to gridded level 3 products), we include in the supplementary material (Table S2) the mean of the CERES-MODIS retrievals collocated with the aircraft vertical profiles, which is used in all of the analysis presented here. While the microphysical properties of low clouds are also based on the bi-spectral technique, the underlying codes were developed independently and apply different techniques to account for absorption due to above cloud water vapor and different criteria to identify low clouds and when to apply the bi-spectral retrieval. CERES-MODIS algorithm processes MODIS radiance data with every other scanline and every 4th pixel from the original MODIS 1-km resolution (i.e., 339 pixels per scanline, instead of 1,354 pixels).
This study also evaluates cloud retrievals produced by the NASA SatCORPS group based on Himawari-8 observations (Smith and Minnis, 2018;Minnis et al., 2020;Trepte et al., 2019). Data are available at https://data. eol.ucar.edu/dataset/552.027. Himawari-8 is a Japanese geostationary meteorological satellite launched in October 2014. The SATCORPS Himawari-8 retrievals have 2-km resolution at nadir and are available every 10-min during GV aircraft flight dates. Details of the cloud property retrieval methodology are given in Minnis et al. (2011Minnis et al. ( , 2008 and largely follow the approach used for CERES-MODIS and use near-infrared imagery at 3.9 μm. All three satellite products provide retrievals for r e and τ, based on one-dimensionalradiative transfer calculations and the satellite retrieved r e and τ can be used to derive LWP. The formulation varies depending on the assumed vertical structure (profile shape) of the cloud LWC and r e . For a vertically homogeneous cloud having a constant LWC and r e with altitude (Borg & Bennartz, 2007), one obtains: while for an adiabatically stratified cloud having a linearly increasing LWC and r e with altitude (Wood & Hartmann, 2006), one obtains: where  w is the density of water, and again the extinction efficiency Q e is assumed to be 2. These two expressions differ by a constant factor, with the vertically homogenous assumption giving a 20% larger LWP. Both MODIS and CERES-MODIS operational algorithm calculate and provide LWP based on Equation 5, while SatCORPS Himawari-8 data set includes LWP calculated with the Equation 6. Previous studies (e.g., King et al., 2013) have found that LWP computed using the vertically homogeneous formulation is usually positively biased for marine stratocumulus, which is not surprising given the overall adiabatic-like profiles of oceanic boundary layer clouds (Seethala & Horváth, 2010). We likewise find that this assumption provides a better match with the observations, and in the later discussion use Equation 6 assuming adiabatically stratified clouds except where specifically stated otherwise.
Although, N d is not provided in any of the three satellite products, it can likewise be derived from passive satellite observations using r e and τ assuming a one-dimensional cloud and following the assumption that clouds have an adiabatic-like profile, in which LWC increase linearly from cloud base to cloud top, given by (Bennartz, 2007;Grosvenor et al., 2018): which is basically the product of the ratio of  1/2 and 5/2 e r and a constant C. The constant C is determined by several parameters, with  e Q 2, and k, c w , and f ad given by: The k parameter is a measure of the droplet spectrum width and is given by the third power of the ratio between volume radius (r v ) to r e, where . k is often assumed to be a constant with a value of 0.8 in retrievals. Later in this study, we calculate k values using the in situ r v and r e values for the cloud profiles. c w is the rate of increase of LWC with height (i.e., the condensation rate), which is a weak function of temperature and pressure, and is often assumed to be a constant ranges from 1 to 2.5 g m −3 km −1 (Al-brecht et al., 1990;Min et al., 2012). Again, later in this study, we examine the mean and variability of c w . c p = 1,004 J K −1 kg −1 is the specific heat of dry air at constant pressure, L v = 2.5 × 10 6 J kg −1 is the latent heat of vapourization, Γ d and Γ m are the dry and moist adiabatic lapse rate, respectively. f ad is called the adiabaticity factor and describes how close the observed cloud is to a true adiabatic cloud layer (while still assuming the liquid water content increases linearly with height). Typically, this factor is assumed to be a constant 0.8 (e.g., Bennartz et al., 2007). Again, later in this study, we calculate f ad values for the cloud profiles.
In comparing satellite retrievals with in situ measurements, a key step is collocation. We used the coordinates of the aircraft at the time when the aircraft crosses cloud top and found the corresponding satellite pixels surround this location. Here, we averaged the satellite pixels around the location of the in situ profile within a 5 pixels × 5 pixels box for MODIS, 3 pixels × 3 pixels box for CERES-MODIS, and 3 pixels × 3 pixels box for Himawari-8. Changing the size of the averaging box by a factor of 2 has a negligible impact on the results. While in most cases, all of the satellite retrievals within these boxes correctly identified the cloud as low-level (<3 km) liquid clouds; in a few cases, there were some scattered high clouds in the vicinity. In our box averages, we include only those satellite pixels which are identified as low-level liquid clouds by the retrieval algorithms. We did reject a few cases (match up points) because the apparent cloud-top-height (CTH) did not match the in situ aircraft measurement within 1 km (satellite reported CTH > 3 km). While there are not a sufficient number of the poor CTH cases to quantify errors for these cases, we note that all of the satellite imager retrievals assume single layer clouds. Situations in which an optically thin high-altitude (ice) clouds overlays an optically thicker low-altitude (liquid) clouds is a long-standing problem for imager-based retrievals, but filtering for CTH < 3 km appears to be satisfactory for the present analysis.
The in situ aircraft measurements and satellite retrievals do not necessarily occur simultaneously. In our analysis, we account for the time offset by adjusting the box location for cloud advection, and we set a maximum time offset between the in situ and satellite data to be 1 h for MODIS (and CERES-MODIS) and 10 min for Himawari-8. Specifically, we account for the distance clouds traveled by averaging the in situ measured wind speed near the cloud top, an approach which is similar to that employed by PZ11.
After the above filtering and processing, there remained 20 in situ cloud profiles (from eight flights) closely aligned with Aqua MODIS overpasses, and 51 profiles (from 14 flights) closely aligned with Himawari-8 products. In total, 53 in situ cloud profiles are used in this study and statistics are provided in Table 2. The circles marked on Figure 1 show the location of these profiles, and Table S1 in the supporting information lists in situ properties for each profile.

In Situ and Satellite Retrievals Comparisons
In this section, satellite retrievals from MODIS, CERES-MODIS, and Himawari-8 are compared with the in situ measurements of τ, r e , LWP, and N d . Statistics summarizing the comparison between the in situ and three satellite products are provided in Tables 3-5, respectively. We begin the analysis with τ and r e , after which we focus on LWP and N d , which are derived from τ and r e .
KANG ET AL. Notes. N is the number of the data points. For each cell, value in front of parentheses is the statistics for all the collocated profiles, while the first, second, and third values are that for nonprecipitating, lightly precipitating, and heavily precipitating cases, respectively.

Cloud Optical Depth
Figure 3 compares the in situ derived τ with satellite-retrieved τ. The vertical bars show the standard deviation of τ for the pixels within the collocated satellite match-up box that is used for averaging (see Section 2). In many cases, the vertical bars exceed five, showing that there is typically a large horizontal variability in τ on the satellite pixel-scale. MODIS τ 3.7 correlates well with in situ values (R = 0.91), having a mean bias of only 0.1. CERES-MODIS τ also correlates wells, R = 0.91, having a mean bias of 1.5, which is not statistically different from zero at a 95% confidence level (as the one-sigma uncertainty in the mean, that is the standard error, is about 1, while bias that is 2 times smaller than the standard error is not significant at 95% level of confidence). Himawari-8 τ is not well correlated with R = 0.79 and a mean bias of 2.6, which is nominally significant at 95% confidence. However, the Himawari-8 data yield about the same as the mean bias as CERES-MODIS, when restricted the 18 cases common to all three datasets (Figure 3d), with R = 0.86 and a mean bias = 1.88. Thus, the overall lower performance suggested by the full set of Himawari-8 match-ups is due to having more difficult cases. In particular, there are more cases with multiple low-level cloud layers (that is, multiple layers below 3 km; gray filled dots) in the Himawari-8 set, and in general, cases which are more spatially variable (notice the larger vertical uncertainty bars in panel c). As will be discussed further KANG ET AL. . Satellite N d is retrieved with typically assumed constants (k = 0.8, and f ad = 0.8), Nd_obs-mean-k_fad is retrieved by setting k and f ad to the mean of the in situ values, and Nd_obs_cbc_k_fad by using case-by-case in situ value of k and f ad . During the N d retrieval, condensation rate(c w ) is calculated using the satellite-retrieved cloud top temperature and pressure.    in Section 5, a mean bias near 2.5 with a one-sigma certainty of near 1 is reasonably good performance and is consistent with expectations based on previous studies. Tables 3-5 also lists the statistics for cases in different precipitation regimes (nonprecipitating, lightly precipitating, and heavily precipitating), and we will discuss these results in more detail in the context of the LWP retrieval in Section 3.3, after examining the effective radius.

Effective Radius
The comparison between satellite-derived r e and in situ r e is shown in Figure 4. Here, the in situ r e is derived from the merged spectrum of CDP and 2DS. MODIS r e3.7 correlates well with in situ r e (R = 0.9) and has a mean bias of 0.0 μm. In spite of being for the same set of cases, perhaps surprisingly the correlation between CERES-MODIS and in situ r e is not quite as good as that of MODIS, with R = 0.78. Nonetheless, the mean bias of CERES r e is small at 0.2 μm and not significantly different from zero at the 95% level of confidence.
As for Himawari-8, the overall results are similarly good with the correlation between retrieved r e and in situ r e being 0.84, though the retrieved r e are generally larger than in situ r e , with a mean bias of 1.2 μm (which is significantly different from zero at 95% level of confidence, assuming the data to be normally distribution). However, as was the case for OD, the difference in the Himawari-8 bias is due to additional cases analyzed, and the bias reduces to −0.29 μm when restricted to the set of cases common to all three retrievals (see Figure 4d).
KANG ET AL.
10.1029/2020EA001397 11 of 30 As shown by the red symbols in Figure 4, larger negative errors are associated with some heavily precipitating cases (PWP > 10 gm −2 ), while most non-and lightly precipitating cases have a small positive bias.
To demonstrate further how the error in r e retrieval is related to the presence of precipitation, in Figure 5 the r e retrieval error is plotted as a function of PWP. For MODIS r e retrievals (panel a), there is a large negative bias associated with four cases, all of which have a PWP greater than 12 g m −2 . The same four cases are also negatively biased in CERES-MODIS and Himawari-8 retrievals. When considering all cases, there is more variability in the Himawari-8 retrieval error when PWP is greater than about 10 g m −2 than when PWP is less than about 10 g m −2 . The mean bias of Himawari-8 r e for these heavily precipitating cases is small (−0.3 μm), while the bias for non-and lightly precipitation is 1.4 μm. When restricted to the 18 cases common to all three datasets, the mean bias for the non-and lightly precipitating cases is similar and statistically significant in all three datasets, with values of 0.78, 1.52, and 0.62 μm for MODIS, CERES-MODIS, and Himawari-8 r e retrievals, respectively. The presence of precipitation is clearly an important factor, and this will be explored in greater depth in Section 4.
As mentioned in Section 2.2, MODIS r e retrievals are also available based on observations at 1.6 µm and 2.1 µm in addition to 3.7 µm. The difference in the three MODIS r e retrievals is influenced by the different absorption in different bands, with the photon penetration depth being largest at 1.6 µm and smallest at 3.7 µm. Figure 6a shows a comparison between all three MODIS r e retrievals with in situ r e . Both r e2.1 and r e1.6 correlate well with in situ r e , with R = 0.83 and R = 0.84, respectively, being slightly smaller than that of r e3.7 (R = 0.91). As is also shown in Figure 6, there is one case (marked by cross) with an unusually large difference among the three channels. This difference is likely due to the inhomogeneity of cloud scene, as will be discussed in Section 4.1. When this case is excluded, the mean bias in r e2.1 and r e1.6 (taken across all cases) is only 0.30 μm and 0.17 μm, respectively. However, as the case for r e3.7, there is marked variation with amount of precipitation. Similar to Figure 5, circles in Figures 6c and 6d show retrieval error of r e2.1 and r e1.6 as a function of PWP. Overall, a positive bias still exists for non-and lightly precipitating cases, and the four cases associated with large negative bias in r e3.7 (Figure 5a) continue to show a negative bias in re 2.1, though to a smaller extent.
To compare the MODIS r e retrievals from different wavelengths, Figure 6b shows the difference between r e3.7 and r e2.1 (or r e1.6 ) as a function of PWP. In general, r e2.1 is larger than r e3.7 with most points (orange circles) having a positive difference (located above the zero line), and this positive difference becomes more obvious for the heavily precipitating clouds. A similar positive difference is found for r e1.6 -r e3.7 , with r e2.1 typically being closer to r e3.7 than r e1.6 . As the amount of precipitation tends to increase with depth into the cloud, the increase in particle size for the precipitating cases is consistent with the expectation since photons at 2.1 μm can penetrate deeper into the cloud than at 3.7 μm. This does not explain, however, why r e2.1 or r e1.6 is larger for the nonprecipitating cases (where one might expect the opposite behavior), suggesting that factors other than vertical variation, penetration depth, and precipitation are important in the difference.
KANG ET AL.
10.1029/2020EA001397 This result, is consistent with previous studies that show r e2.1 (or r e1.6 ) tend to be larger than r e3.7 (e.g., PZ11; King et al., 2013). Figure 7 shows a comparison between in situ LWP and satellite-derived LWP, calculated using Equation 6, which assumes clouds are adiabatic. MODIS LWP correlates well with in situ LWP (R = 0.83) and has a mean bias of only 1.6 g m −2 , while for CERES-MODIS, R = 0.79 and the mean bias is 12.2 g m −2 (which is not significantly different from zero at the 95% level). For Himawari-8 using 51 cases, R = 0.64 and mean bias is 16.1 g m −2 (not significant at 95%), with better performance for single-layered cases in Himawari-8 retrievals (R = 0.8 and mean bias = 15.8 g m −2 ), and with similar performance to MODIS when restricting to the set of cases common to all three satellite retrievals (Figure 7d).

Cloud Liquid Water Path
In the literature, satellite-derived values for LWP are sometimes obtained by assuming a vertically homogeneous cloud (Equation 5), rather than an adiabatic cloud (Equation 6). Tables 3-5 provide mean bias and other error statistics using this alternative formulation. As one might expect given that the in situ profiles of LWC (see Section 2) do show an adiabatic-like profile, the adiabatic formulation for LWP produces better overall results, whereas the vertically homogeneous assumption results in a statistically significant overestimation of LWP.
KANG ET AL.

10.1029/2020EA001397
13 of 30  (d), the color code is the same that in Figure 5 and earlier figures. The two vertical dashed lines in panels (b) and (c) denote the thresholds of 2 and 10 gm −2 used to define non-and lightly precipitating categories.
One expects a positive error in r e , or τ (meaning the retrieved value is too large) will result in a positive error in LWP (regardless of which of the two LWP formulations is used), and indeed we find this to be true, as shown in Figure 8. For all three satellite products, the bias in LWP is positively correlated with bias in r e , with the R of 0.52, 0.69, and 0.59, respectively, and positively correlated with bias in τ, with the R of 0.92, 0.88, and 0.91. Note that there are more black and blue points (associated with non-and lightly precipitating profiles) in the upper right quadrant in Figures 8b and 8c. In Section 3.2, it was noted that non-and lightly precipitating cases have a small positive (satellite > in situ) mean bias in r e in all three retrieval datasets. Likewise, the OD for the non-and lightly precipitating cases is also slightly biased in the CERES-MODIS and Himawari-8 datasets, as is evident in Figures 8e and 8f, which show fewer points in lower left quadrant than upper right quadrant (see also Tables 4 and 5). Together the positive bias in r e and τ creates a small (but statistically significant) bias of 19.22 and 21.58 g m −2 in the LWP. In the operational MODIS MYD06 product, on the other hand, there is no significant LWP bias associated with non-and lightly precipitating cases; and these points have a mean bias of only 2.93 g m −2 . This is because the bias in r e is countered by a small compensating error in τ of about −0.6 for MODIS for lightly precipitating cases (note the points in lower left of Figure 8d). The small bias of −0.6 is not itself statistically significant, and so it is ambiguous as to whether this compensation is coincidental. If coincidental, one expects that MODIS LWP would also have a small bias in LWP for non-and lightly precipitating clouds given that it appears to have a similar bias in r e , but based on the data we have, all we can conclude is that there is no significant bias in LWP.
KANG ET AL.

10.1029/2020EA001397
14 of 30 While there is no statistically significant bias associated with the heavily precipitating cases (red circles), there is considerable variability with these cases having largest positive and negative errors in r e , τ, and LWP. The standard error (uncertainty in the mean) is greater than 20 g m −2 for the heavily precipitating cases in all three datasets. In particular, the handful of cases identified as having large negative error in r e (retrieved r e is too small) have the largest underestimate in LWP. Figure 9 compares the satellite-derived N d with the in situ values. When considering all comparison points, the MODIS, CERES-MODIS, and Himawari-8 N d retrievals are biased by only −9.1, −7.9, and −1.6 #/cm −3 , respectively. These biases are not significantly different from zero at the 95% level of confidence and are small or modest relative to the overall mean of 86.8 #/cm −3 ( Table 2). As was the situation for LWP (discussed above in Section 3.3), the impact of precipitation on the bias in N d retrievals is complicated by the correlation between errors in r e and τ and is somewhat different in each of the three datasets and also depends to amount of precipitation present. In all three satellite datasets, the errors in r e and τ tend to cancel out, producing relatively little bias in N d . The only statistically significant bias we find are for the lightly precipitating category, where MODIS and CERES-MODIS retrievals have underestimated the N d by about 30-40 #/cm −3 ,and Himawari-8 retrievals have underestimated the N d by −17.7 #/cm −3 (from an overall mean of about 100 #/cm −3 ). We note that the correlation between the retrieved and in situ values is poorer for N d (ranging from 0.49 to 0.77) than for r e , τ, and LWP. At the end of this section, we examine in more details the effect of random errors (variability from profile-to-profile) in the retrieved N d .

Cloud Droplet Number Concentration
KANG ET AL.

10.1029/2020EA001397
15 of 30 Perhaps equally importantly, we find a large bias error in cases with multiple low-level clouds for Himawari-8, with a mean bias of 23.4 #/cm −3 . There are only 10 cases where multiple low-level clouds are present, but the difference is significant because these cases have smaller droplet concentration (mean value about 52.4 #/cm −3 with a mean-absolute deviation of 27 #/cm −3 ). The MODIS and CERES-MODIS retrievals include only one such multilayer case, and we cannot directly assess if the results would be similar for multilayer cases for these two datasets, but given the similar physical basis of the retrievals, it seems likely that the MODIS-based retrievals would have similar difficulty. Unfortunately, it is difficult to identify when multiple low-level cloud layers are present from satellite VIS-IR imagery alone; however, other measurements such as CALIPSO lidar backscatter might be used to detect the presence of such layers in combined retrievals algorithms. When multilayer clouds are removed from the set of cases examined, the three datasets have similar mean biases of −9.7, −8.6, and −7.7 #/cm −3 for MODIS, CERES-MODIS, and Himawari-8, respectively.
The above assessment for N d is based on an assumed value for k of 0.8, f ad of 0.8, and using c w value calculated using Equation 9 with satellite retrieved cloud top temperature and pressure. Using in situ measurements, f ad can be calculated using Equation 10. Doing so, we find a mean value of 0.74 for the 43 single-layered profiles. Likewise, the k factor can be calculated using the SOCRATES data based on Equation 8. The value for k is generally not constant over the depth of the clouds, but typically is larger toward cloud top because the DSD is narrower. In Figure 10, we plot histograms of the calculated k factors near cloud top (integrated extinction from cloud top less than 1) and for all vertical levels. The averaged k for cloud top is 0.76 ± 0.08, which is slightly larger than averaged k for the whole cloud layer 0.73 ± 0.09. Using both the KANG ET AL.

10.1029/2020EA001397
16 of 30 mean cloud top k of 0.76 and mean value f ad of 0.74 has little net effect on the retrieval, with the resulting mean bias for N d from MODIS, CERES-MODIS and Himawari-8 becoming -8.1, −7.0, and −0.5 #/cm −3 . We also find that, if one uses values of k and f ad obtained from the in situ data on case-by-case basis for the N d retrieval, there is likewise little change in the mean bias (−7.2, −5.9, and 0.1 #/cm −3 ). The small net change in the bias occurs because the impact of decreasing f ad opposes (or compensates) for the effect of decreasing k in Equation 7. That is, what is important for the retrieval is the ratio sqrt (f ad )/k, which remains nearly constant and produces no net bias (systematic error) in the retrieval.

Uncertainty Analysis for N d
Following Grosvenor et al. (2018) and Bennartz (2007), one can estimate the contribution of random errors (or uncertainty) in input variables in Equation 7 to the random error in N d , using a Gaussian error propagation formulation as shown in Equation 11. The derivation assumes the input errors are normally distributed and uncorrelated with each other.
In short, the expected fractional error in N d would be given by square root of the sum of the squares of the fractional errors in the input terms on the right-hand side of Equation 11. For each input variable, we have calculated the fractional error for the inputs using the case-by-case (profile-by-profile) SOCRATES single-layered collocated profiles. For example, for Himawari-8, we approximate  e r as the standard deviation of (retrieved r e -in situ r e ) which equals 1.93 μm and e r as the mean in situ value of 11.73 μm, and so  with terms on the right-hand side of Equation 11 as input values. As one might intuitively expect from Equations 7 and 11, errors in N d are sensitive to changes in r e since r e is raised to the power of 5/2. Our estimates show that error in r e is indeed the largest source for N d error, with highest relative error contribution, followed by error in τ. As for assumed constants, variability in c w , k and f ad can also contribute to N d error, but based on variability observed during SOCRATES the impact is smaller than that of r e , though we note the SOCRATES samples data are limited to summertime stratocumulus. One might notice that the sum of the expected percent fractional error doesn't    . Nonetheless, it seems safe to conclude that error in r e have a relatively large impact on the uncertainty in the N d retrieval as compared with other sources, with a total (case-to-case) uncertainty between about 40% and 55%.

Error Analysis
Satellite imager retrievals examined in this article invoke several assumptions about cloud structure and microphysics, and errors are likely to arise when these assumptions are violated in the real world. In this section, we focus on errors in the effective radius retrieval, which are arguably the most statistically robust errors identified in Section 3, to assumptions in the bi-spectral retrieval, as well as examine some uncertainties in our analysis approach. Specifically, in Section 4.1, we examine errors related to the assumption of horizontally homogeneous (i.e., plane-parallel or one-dimensional) clouds. The bi-spectral retrieval also assumes that the shape of the cloud DSD can be represented by a simple function with a single mode. In the case of the MODIS, CERES-MODIS, and Himawari-8 bi-spectral retrievals examined in this article, a modified gamma distribution with a fixed effective variance is assumed. Larger liquid droplets absorb more SWIR radiation than smaller droplets, and at its core, the bi-spectral technique is using the difference in absorbed radiation (between the visible and SWIR) to determine particle size. In simple terms, the larger the droplets are (on average), the larger the absorption is, and the smaller the ratio of SWIR reflectance to visible wavelength becomes. The retrieval therefore also has some sensitivity to the width of the DSD. In Sections 4.2, we show that when there is large contribution from larger precipitating droplets near cloud top, these cases are associated with significant underestimate in the effective radius, and in Section 4.3, we examine errors associated with the assumed width for the size distributions for the non-and lightly precipitating cases. Last, in Section 4.4, we discuss uncertainties related to the in situ probes and analysis technique.

Horizontal Inhomogeneity
Standard cloud remote sensing techniques rely on two basic assumptions: First, clouds are assumed to be plane-parallel and homogeneous within each satellite pixel. Second, pixels are assumed independent and the net horizontal radiative transport between pixels is neglected. Standard cloud remote sensing techniques rely on two basic assumptions: First, clouds are assumed to be plane-parallel and homogeneous within each satellite pixel. Second, pixels are assumed independent and the net horizontal radiative transport between pixels is neglected. Standard cloud remote sensing techniques rely on two basic assumptions: first, clouds are assumed to be plane-parallel and homogeneous within each satellite pixel. Second, pixels are assumed independent and the net horizontal radiative transport between pixels is neglected. This bias increases with pixel size as the amount of subscale inhomogeneity is increasing Standard cloud remote sensing techniques rely on two basic assumptions: first, clouds are assumed to be plane-parallel and homogeneous within each satellite pixel. Second, pixels are assumed independent and the net horizontal radiative transport between pixels is neglected.
The three satellite-imager datasets evaluated in this study are based on a bi-spectral technique, which assumes clouds are horizontally homogeneous (i.e., plane-parallel or one-dimensional). Of course, in reality, the cloud fields often exhibit significant horizontal variability, and the breakdown of the one-dimensional assumption can lead to systematic errors during the retrieval (e.g., Marshak et al., 2006;Zhang et al., 2012).
To assess the impact of horizontal inhomogeneity on the retrieval error, we examine the relationship between heterogeneity in the satellite visible imagery and errors in effective radius using the H σ index (Liang et al., 2009), defined as: which is the ratio of the standard deviation to the mean of the reflectance within the domain. Our adoption of this metric stems from previous research suggesting that clouds with H σ < 0.3 are sufficiently homogeneous that errors due to one-dimensional assumption are likely small is this situation, while larger values of H σ associated with more heterogeneous cloud fields have significant retrieval biases (Zhang et al., 2012;Zhang & Platnick, 2011). For MODIS, we calculated  H using the MODIS radiance (MYD02QKM product) at 0.86 μm for the same 5 × 5 pixel analysis box used in the comparisons in Section 3. Similarly, we calculate  H for Himawari-8 reflectance at 0.8 μm for using the same 3 × 3 pixel analysis box. The MODIS radiance is observed at 250 m (nadir) resolution, which is finer than the 1 km grid used for the MODIS cloud property retrievals. The results shown here are based on the 250 m data, but we find our results do not differ appreciably if the radiance data are first reduced to 1 km resolution. Figure 11 shows the error in the retrieved r e from MODIS and Himawari-8 as a function of H σ . Overall most points have a value for H σ smaller than 0.3, and there is no clear dependence in the biases for these points. However, there are a few points with H σ > 0.3. For the one case with H σ ∼ 0.7, there is a large difference in the three r e retrievals from MODIS (based on different SWIR bands), which motivated us to remove this point from the analysis in Section 3.2. For this heterogeneous point, the MODIS 3.7 μm band retrieval has the least error, which is consistent with Zhang and Platnick (2011), and other studies that have suggested that this band is less susceptible to three-dimensional effects. For Himawari-8, there are two cases with H σ > 0.5 that show relatively large error in r e . Overall, most of the cases we evaluated are relatively homogenous with no dependence on H σ , which suggests that horizontal heterogeneity is not a dominant source of r e error for our evaluation result. We also examined whether errors in retrieved τ show any dependence on H σ since previous studies suggested that the retrieved τ can be smaller than the actual τ due to heterogeneity . We found that retrieved τ error likewise shows no clear dependence on H σ for our cases (figure not shown).

The Presence of Precipitation at Cloud Top
The presence of precipitation can significantly impact the r e retrieval. Minnis et al. (2004) and Zhang (2013) show that the presence of precipitation can result in underestimation in retrieved r e. In Section 3.2, we find that r e is underestimated for some (but not all) heavily-precipitating cases. To further assess the contribution of the droplets larger than 50 µm, we calculated the ratio of mean LWC over the top 1 OD of the cloud for droplets with diameters >50 µm (i.e., precipitation water content, PWC) and droplets with diameters < 50 µm (i.e., cloud water content, CWC). Figure 12 shows difference between satellite retrieved r e and in situ r e as a function of this ratio PWC/CWC. timation of r e was found for three cases with large contributions from precipitation particles (ratio > 0.2). This demonstrates that it is not the presence of precipitation in the cloud (characterized by PWP) but the presence of precipitation near cloud top that is important. We note that if we ignore particle larger than 50 µm and calculate in situ r e only from the CDP, the difference between the satellite retrieved r e and in situ r e (showing as crosses in Figure 12) are smaller for these three heavily-precipitating cases, but the satellite retrieved r e still shows underestimation, especially for two of the cases, demonstrating that this effect is not an artifact resulting from the merging of the CDP and 2DS (more on this in Section 4.4).

Droplet Size Distribution Width (for Non-and Lightly Precipitating Clouds)
Satellite retrievals algorithms typically make assumptions regarding the shape of cloud DSD. The MODIS, CERES-MODIS, and Himawari-8 retrievals examined here assume a modified gamma distribution which can be written as: (Hansen, 1971) where r is the droplet radius, N 0 is a constant, and e v is effective variance given by (Hansen, 1971) r r r n r dr v r r n r dr For the gamma distribution one can show that k = (1 − v e ) (1 − 2v e ). Thus, the width of the DSD can be assessed using v e or k factor. In the retrievals, MODIS assumes a modified gamma distribution with a fixed variance v e of 0.1 (Platnick et al., 2017), as do CERES-MODIS and Himawari-8 (W. L. Smith, personal communication, 2020). v e = 0.1 corresponds to k = 0.72. Of course, the actual DSD may not be well approximated by a gamma distribution with v e of 0.1, and this will impact the retrieved r e (Arduini et al., 2005).
In order to, explore the width of the cloud DSD with respect to precipitation amount, we plot the in situ estimated k factor as a function of PWP in Figure 13a. Here k is calculated using Equation 8 and no assumption regarding the shape of the DSD is made. Consistent with our earlier analysis and focus on values needed for the retrieval, here the k factor is determined by averaging the observed DSD over 1 OD at the cloud top, and for simplicity, we only consider single-layered clouds. The k factor tends to decrease (the distribution becomes broader) with increasing PWP. The mean k factor for nonprecipitating, lightly precipitating and KANG ET AL.
10.1029/2020EA001397 20 of 30 Figure 12. (a) MODIS r e3.7 error as a function the ratio of mean liquid water content (LWC) over the top 1 OD of the cloud for droplets with diameters >50 µm (i.e., precipitation water content, PWC) and droplets with diameters <50 µm (i.e., cloud water content, CWC). (b) and (c) are the same as (a) except for CERES-MODIS and Himawari-8 r e . Only cases with H σ < 0.3 are considered here. Colors and symbols are same that in Figure 3, with open circles representing the difference between satellite retrieved r e and in situ value calculated using merged DSD from Cloud Droplet Probe (CDP) and two-dimensional stereo (2DS) , while cross represent the difference between retrieved r e and in situ r e calculated using the CDP only.
heavily precipitating cases is 0.80, 0.77, and 0.70. In particular, the observed DSD width of the nonprecipitating and lightly precipitating cases is narrower than the assumed value (that is, k is greater than 0.72). We likewise calculated v e for those nonprecipitating and lightly precipitating cases using Equation 14 with the cloud DSD from CDP probe averaged over 1OD. The mean value of v e is for these nonprecipitating and lightly precipitating cases is about 0.068, which is narrower than the assumed v e = 0.1 in the retrieval.
We discussed the impact of bias and uncertainty in the k factor on N d in Section 3.4. A quantitative assessment of the impacts of uncertainty (or bias) in k (or v e ) on the r e retrieval is more difficult and arguably requires detailed radiative transfers calculations using a variety of values for k (or v e ). However, we can gauge the impact of the droplet width on r e retrieval based on result published by PZ11, who examined the impact of the distribution width on the r e retrieval using a log-normal σ (σ log ). We estimated σ log of the in situ measured DSD using a least squares minimization. We opted to use a minimization approach to obtain a best fit for a log-normal distribution to the bulk of the observed cloud particles and to minimize the impact of unusually small or large particles (outliers in the data), which we found to significantly broaden the estimated σ log . Details are given in the supplementary material.
As shown in Figure 13b, σ log is negatively correlated with k because broader DSD means smaller k and larger σ log , while a narrower DSD means larger k and smaller σ log . The σ log for non-precipitating and lightly-precipitating of single-layered cases averaged over the top 1 OD is 0.16. PZ11 undertook radiative transfer simulations to understand how the retrieved r e is impacted when the true value σ log is smaller than the value assumed in the radiative transfer calculation. They found that when actual σ log is smaller than the assumed value of 0.35 (equivalent to v e = 0.1), the retrieved r e is also larger and would be overestimated (biased high). Specifically, PZ11 compared the retrieved r e assuming σ log = 0.35 and 0.2 and found retrieved was r e overestimated by as much as 0.58 μm. This result is broadly similar to result published by Chang and Li (2001), who found that a change of ±0.15 in σ log resulted change of about ±1 μm in the mean of the r e retrievals (starting from a nominal value of 0.35 for σ log with  e r 10 μm).
We concluded, therefore, that much of the positive-bias in r e for the non-and lightly precipitating cases (shown in Section 3.2 to range from 0.5 to about 1.0 μm) is likely due to having an assumed effective variance that is a bit too large, or stated more generally, an assumed DSD in the retrieval which is too wide for the SO clouds observed during SOCRATES. As a caveat, we note that the solar and view geometries in this study are not identical to those in previous studies that examine the width of the DSD and its impacts on the retrieval. We do not expect this is a significant factor for the solar and view geometry during SOCRATES, as the profiles were taken during the Southern Hemisphere summer primarily in the afternoon when the sun is reasonably high with a solar zenith angle less than 60°. Nonetheless, the above conclusion should perhaps be quantified using full radiative transfer calculations for the precise conditions observed during SOCRATES, and more generally evaluated over the range of solar and view geometries encountered over the SO to more fully assess the impact, especially as regards possible seasonal impacts. Such is beyond the scope of the present study and is left as a topic for future work.

Uncertainty due to Instrumentation
In the preceding analysis, we calculated in situ r e from the DSD obtained by merging measurements from the CDP and 2DS. Specifically, we used all the CDP bins (includes particles up to 50 µm) and combine it with the DSD from the 2DS for bins larger than 50 µm, the same approach as used by King et al. (2013). We have also explored merging the CDP and 2DC, as well as a second alternative (ALT) approach for merging the CDP and 2DS, in which we use the DSD from CDP for bins smaller than 25 µm, the DSD from the 2DS for bins larger than 50 µm, and use the larger values between the two probes for the intermediate bin (25-50 µm). Figure S3 in the supplementary material shows an example of the CDP, 2DS, and 2DC spectra and the result merged DSD. Note. Since 2DC probe is not available for research flight RF02, only data from other flights with collocated profiles are considered here for comparison. For each cell, value in front of parentheses is the statistics for all the collocated profiles, while the first, second and third, value are that for nonprecipitating, lightly precipitating, and heavily precipitating cases.  research flight RF02, only data from other flights are considered. For the 19 profiles available for MODIS, mean in situ r e calculated using different probes or merging methods varies from 11.54 µm with the CDP only to 13.21 µm with CDP+2DS (ALT). To visualize the difference of in situ r e, Figure 14 shows box plots of in situ r e for cases collocated with different sensors. Overall, CDP+2DS (ALT) gives largest in situ r e in all precipitation regimes. In situ r e from CDP+2DS (King) is smaller than CDP+2DS (ALT) because counts in the intermediate bin (25-50 µm) from the CDP are typically smaller than that from 2DS. In situ r e from CDP+2DC tend to be smaller than that from CDP+2DS, and close to that from CDP only, as counts from 2DC bins are usually smaller than that from 2DS.
Naturally, the impact of using different probes or merge approach is much more important for the heavily precipitating cases than for the non-or lightly precipitating cases. Nonetheless, even for the non-or lightly precipitating cases, using the CDP+2DS (ALT) merging increases the r e and can (at least partially) offset the estimated error (see Figure 15). Taking the MODIS r e3.7 as an example, for light-precipitating cases, the mean error in MODIS r e3.7 is about 0.98, 0.98, 0.71, and 0.07 µm when compared with in situ r e from CDP, CDP + 2DC, CDP + 2DS (King), and CDP + 2DS (ALT), respectively. Using CDP + 2DS (ALT) appears to eliminate the bias for the lightly precipitating cases. The bias for nonprecipitating cases is, while not eliminated, reduced from 1.27 µm with CDP only to 0.66 µm with CDP + 2DS (ALT). However, the bias for heavily precipitating cases gets worse, going from 0.34 µm estimated using CDP + 2DC to -2.41 µm using CDP + 2DS (ALT).
Thus, regardless of how we merged the CDP and 2DS data, there is a fundamental difference in the bias for the different precipitating categories. If one calculates the bias across all precipitating categories, the CDP + 2DS (ALT) formulation produces the smallest error but does so only by balancing the errors across the different categories. This same pattern is weaker in the CERES-MODIS and Himawari-8, but is qualitatively similar.
Past studies (e.g., King et al., 2013) suggest that the counts in the CDP below 50 µm are more reliable, and we have therefore focused on using CDP + 2DS (King) formulation in our analysis. But we note there is a measurement issue here that needs to be addressed for future field campaigns, specifically that efforts are needed to reduce the uncertainty in measured number concentration for particles between about 20 and 100 µm.

Summary and Discussion
Satellite retrievals of cloud properties have been widely used to study clouds over the SO, but our confidence in these retrievals has been limited by a lack of verification. In this study, cloud properties observed from aircraft during the SOCRATES in January-February of 2018 are used to evaluate retrievals of cloud properties for SO stratocumulus based on the widely used visible shortwave infrared bi-spectral technique. In par-KANG ET AL. ticular, three data sets are examined: (i) the Moderate Resolution Imaging Spectroradiometer (MODIS) level 2 (collection 6.1) cloud product, (ii) the CERES-MODIS edition 4 product, and (iii) the NASA SatCORPS Himawari-8 product. Our analysis focused on the use of vertical profiles of cloud properties constructed from individual aircraft penetrations through the stratocumulus. Analysis of the cloud vertical structure shows that SO stratocumulus have an adiabatic-like structure on average. Moreover, the stratocumulus examined were largely closed-cell (or at least overcast) and homogeneous (with heterogeneity index <0.3).
When the effective radius (r e ) is evaluated, we find a small positive bias in r e for non-or lightly precipitating cases of about 0.5-1 µm in all three datasets (satellite retrievals are slightly too large), though we caution that this bias is somewhat sensitive to whether and how the CDP measurements are merged with 2DS (see Section 4.4). This small positive r e is due (at least in part) to the assumed DSD width being too wide in the retrievals. In the retrievals, the DSD is assumed to be a modified-gamma distribution with an effective variance (v e ) of 0.1, which is larger than the value calculated from in situ measurements for non or lightly precipitating cases of 0.068. Previous studies of polarimetric data have also suggested that v e for the marine clouds is likely to be narrower than that is assumed in the satellite retrievals (e.g., Benas et al., 2019;Di Noia et al., 2019). We also find that the width of DSD increases (the k factor decreases) as the PWP increases, suggesting it might be possible to parameterize this relationship as part of a combined imager-radar retrieval, in which the radar would constrain the PWP. Collectively, cases with relatively heavy precipitation (PWP > 10 gm −2 ) have a negative bias (opposite in sign to the non-and lightly precipitating cases). Not all heavily precipitating cases are negatively biased but rather large biases occurred when significant precipitation was found near cloud top (PWC/CWC > 0.2). In these few cloud-top-precipitation cases, biases in r e ranged from about −2 to −6 μm (satellite retrieved values are too small -see Figure 12). As with the bias for nonprecipitating clouds, the bias for the cases with heavy precipitation is not qualitatively dependent on whether and how the CDP and 2DS data are merged in the calculation of r e , but quantitatively the size of the bias does depend on whether and how 2DS data are merged. However, a key point is that regardless of how we merged the probe data, we cannot simultaneously reduce the magnitude of the bias in both the nonprecipitating and heavily precipitating cases. Reducing the positive bias in the nonprecipitating cases (making the in situ r e larger on average by merging the data such that it maximizes the potential contribution from precipitation) also makes the magnitude of the negative bias in the heavily precipitating cases larger; and vice versa, reducing the magnitude of the negative bias in the heavily precipitating cases (by making in situ r e smaller by minimizing the contribution of precipitation) makes the positive bias in the nonprecipitating cases larger.
As for the cloud OD (τ), CERES-MODIS and Himawari-8 are found to have a small positive bias in τ of about 2-3 to (satellite retrievals are too large) for non-and lightly precipitating cases. This bias is close, but not significant at the 95% level of confidence. On the other hand, MODIS (MYD06) do not appear to be biased for these cases and instead was found to have a small negative bias for lightly precipitating clouds.
Satellite retrievals of LWP are derived based on τ and r e with an assumption about the cloud vertical structure. LWP retrievals based on the assumption of an adiabatic cloud structure compare well with the in situ observations and are unbiased when averaged over all cases, while the assumption of a constant profile in LWC results in a significant overestimate in the LWP (∼+20%). For non-and lightly precipitating cases, the small positive bias in r e and τ for CERES-MODIS and Himawari-8 combine to produce a statistically significant bias of about +20 g m −2 in the LWP for these cases. On the other hand, MODIS LWP was not biased by its small positive bias in r e because of the small compensating bias in τ (about −0.6) for the same lightly-precipitating cases. Heavily precipitating cases do not show a significant bias in τ or LWP for any of the three data sets. However, in all three datasets, there is larger variability associated with the heavily precipitating cases, with these cases having both the largest positive and largest negative errors in r e , τ, and LWP. In particular, the handful of cases identified as having large negative errors in r e (due to significant precipitation near cloud top) had the largest underestimate in LWP.
We also used in-situ measured N d to evaluate satellite retrievals of N d , which is derived using a formulation based on τ and r e . This formulation assumes the cloud is sub-adiabatic, meaning the total LWP is smaller than that for a true adiabatic cloud by a factor f ad , but the LWC still increases linearly with altitude about cloud base, while N d is constant. The formulation also depends on the DSD width (expressed via the k factor) and condensation rate (that depends on pressure and temperature). Overall, the N d retrieval works rea-sonably well for our SO cases, as long as one uses the condensation rate that is appropriate for the SO (and this can be estimated reasonably well from the cloud top temperature and pressure). Errors in r e and τ tend to cancel out producing relatively little bias in N d . The profile-to-profile uncertainty based on ∼5 km × 5 km spatial averages of the N d retrieval was found to be 40%-55%, driven primarily by errors in r e (see Table 6).
Using assumed values of 0.8 for both f ad and the k causes little bias in the retrieval because there is a cancellation of error between f ad (observed mean = 0.74) and k (observed mean at cloud top = 0.76). However, using k and f ad on a case-by-case basis does improve the correlation between the retrieved and in situ N d . With respect to our precipitation classification, the only statistically significant bias in N d that we find is in the lightly precipitating category, where MODIS and CERES-MODIS retrievals have underestimated the N d by about 30 to 40 #/cm −3 , and Himawari-8 retrievals have underestimated the N d by 17.7 #/cm −3 (from an overall mean of about 100 #/cm −3 ). Perhaps more problematically, we also found a bias of about 23.4 #/ cm −3 in the N d retrieval for cases featuring multiple low-level (<3 km) clouds for profiles collocated with Himawari-8. The presence of optically thin layers with low droplet concentrations was found in 10 of the 53 profiles with collocated Himawari-8 data. Only one such case occurred in the set of cases with collocated MODIS data.
Overall, our results broadly agree with the past evaluation studies of the MODIS bi-spectral retrievals technique for overcast stratocumulus. For instance, PZ11 reported that the MODIS retrieved r e2.1 was overestimated by 15%-20% (mean bias of 2.08 μm) in comparison with cloud top r e using 20 profiles (from mostly non-and lightly precipitating subtropical stratocumulus) over the Southeast Pacific (to the west of South America) during The VAMOS Ocean-Cloud-Atmosphere-Land Study (VOCALS), while Min et al. (2012) reported a mean bias of 1.75 μm using 17 nonprecipitating cases from VOCALS. While we focused on the r e3.7 , we likewise find the MODIS r e2.1 is overestimated, though by a slightly smaller amount of ∼10% (mean bias of 1.12 μm) for 12 homogeneous non-and lightly precipitating cases.
Closer to our region of study, A18 evaluated MODIS retrievals in wintertime stratocumulus over the SO near Tasmania. Like us, A18 finds that MODIS underestimates the r e of heavily precipitating clouds and overestimates the r e of nonprecipitating clouds, and like us A18 identify the width of the drop size distribution as a possible factor impacting MODIS retrieval. However, A18 found an overestimation of r e2.1 by ∼13 μm on average for nonprecipitating clouds. While a variety of factors contribute to this rather large r e bias (see discussion A18), the major factor is likely to be the broken and patchy nature of the clouds they observed, which were primarily open-cell or disorganized stratocumulus. The MODIS and Himawari-8 bi-spectral retrievals are based on an assumption of one-dimensional radiative transfer and are known to work poorly for broken and spatially heterogenous clouds and to substantially overestimate r e on average for broken clouds (e.g., Marshak et al., 2006). A18 did include two flights with overcast (closed-cell) stratocumulus. According to their Table 1, the average in situ r e for these two cases were 8.6 and 7.5 μm (which is consistent with the smaller values we observed during SOCRATES for nonprecipitating clouds), while the MODIS retrieved values of r e3.7 are near 12.6 μm on both flights (which is within the range we found for nonprecipitating clouds but toward the high side), resulting in a bias of 4-5 μm (which is several μm bigger than our bias for this cloud type). Our SOCRATES cases included only one nonprecipitating case with a bias larger than 4 μm, and this case was one of our cases with a relatively large cloud heterogeneity index. Thus, we speculate that the somewhat larger bias found by A18 for their overcast cases might be a consequence of cloud heterogeneity. We note that A18 do report a heterogeneity index for their cases, but the index they use is the standard MODIS product index which looks at variability of 250m pixel radiances within each 1 km pixel used in the OD retrieval, and does not characterize the variability of the larger scene or collocation box used in the analysis. We also note that the observations A18 use in their analysis are not restricted to the region near cloud top. One expects the r e in non-precipitating clouds will be smaller below cloud top and this might well have reduced the magnitude of the in situ estimates (and increase the apparent bias) by a few microns.
Very recently, Zhao et al. (2020), hereafter Z20, evaluated MODIS and Himawari-8 r e using SOCRATES measurements for a subset of the flights that we have analyzed. Their results differ from ours in several key respects. Their analysis was based on two approaches: (1) measurements taken when the aircraft was flying horizontally (level legs) that are nominally within about 200 m of cloud top and (2) vertical profiles created from aircraft ramps through the cloud (which is similar to our study). Based on the horizontal flight data, Z20 report a mean bias with Himawari-8 of 4.39 μm for liquid phase clouds and 2.24 μm for mixed-phase clouds (see their Figure 4), while for MODIS r e3.7 they report a bias of about 2 μm for both liquid and mixed-phase clouds (see their Figure 7). It is not clear from their manuscript whether the comparison for Himawari-8 is based on only CDP or the combination of CDP + 2DS (while their MODIS comparison is clearly based on the combination) which might explain some of the difference between their Himawari-8 and MODIS results, but more importantly, in both comparisons the collocated in situ data with Himawari-8 never has an r e value greater than about 11.5 μm and the in situ data collocated with MODIS never has a value for r e larger than about 9.4 μm. This fundamentally differs from what we find. We frequently find in situ values for r e are larger than 12 μm for profiles that contain precipitation (which is common place) and this seems consistent with previous studies. We note that in their analysis of aircraft profile data, Z20 find their profiles (1) have vertical mean values for r e that is larger than the average for their horizontal flight legs and (2) the profile values near cloud top suggest a bias for Himawari-8 that is near (or below) 1 μm (see their Figures 6). As such, their vertical profile data are consistent with our results and inconsistent with the horizontal flight data they present. We speculate that when creating their 10s horizontal leg data averages that periods with low or no condensate (with small values of r e ) or perhaps drop-outs in the data have somehow biased the 10s averages. In general, we suggest that averages of r e should either (1) be weighted by liquid-water-content or total number concentration or (2) better yet, a single DSD should be summed (generated) from the measured counts for the full averaging period to calculate r e and other parameters that characterize the distribution from this single DSD.
As noted above, Z20 subdivide their results between liquid and mixed-phase clouds. They identify mixedphase, as those where the ratio of liquid water content from the CDP (where presumably all CDP observed particles are assumed to be liquid) divided by the total condensed water (estimated from measurements by a Closed-Path Hygrometer, CLH-2) is less than 0.85. We suggest that the approach used by Z20 is problematic because it relies on measurements from two different instruments, where each measurement has a nominal uncertainty of 10 to 15%, and the instruments can (and do) have different response times and sensitivities to icing in supercooled environments. This means that the measurement uncertainty alone can easily cause the ratio of liquid-to-total condensate to be less than 0.85. In fact, we have been unable to reproduce Z20's results in this regard and find that in many of our aircraft profiles LWC for the CDP is greater than TWC from the CLH-2 such that the ratio has unphysical values greater than one. Consequently, we have examined the ratio of ice-to-total condensate for precipitation based on the 2DS only, whose imagery has been processed following Wu and Mcfarquhar (2019) to identify ice particles >= ∼ 200 um. Whereas Z20 find that the majority of the cloud is mixed-phase, we find that only 4 out of 53 of our profiles contain even 10% ice from the perspective of the 2DS ( Figure S4). Of course, it could well be the case that numerous small-ice particles are present and the 2DS-only estimate that we use is substantially underestimating the contribution of ice. But one expects those small ice particles will very rapidly grow in size via the Wegener-Bergeron-Findeisen process, such that (while our 2DS-only) estimate might underestimate the mass of ice, we would detect its presence. Overall, we find no distinction between cases that contained large-ice from those without large-ice, in any significant way, for any choice of the ice-mass-fraction. Ultimately Z20 conclude that phase does not matter (bias is about the same for liquid and mixed-phase), and in this sense we agree. Nonetheless, we do not believe the majority of the cloud should be considered mixed-phase. At present, evaluation of cloud phase (across the full range of SOCRATES instruments) remains an ongoing area of research by SOCRATES instrument teams, and more work is needed to understand the performance of instruments under the challenging conditions encountered.

Conclusions
We conclude there is a consistent pattern between studies which show there are small but statistically significant biases associated with the MODIS and Himawari-8 bi-spectral retrievals of r e for overcast SO stratocumulus as compared with in situ aircraft measurements, even when comparisons are appropriately restricted to near cloud top observations. At least here, and in A18, the bias depends significantly on precipitation within the cloudy column, and we conclude that the presence of precipitation near cloud top (not just within cloud) is of particular importance. We find the bias for non-or lightly precipitating stratocumulus to be consistent with (if a bit smaller) than those identified during VOCALS for subtropical stratocumulus and find (as other studies have) that this bias is due (at least in part) to the width (shape) of the assumed drop size distribution, where a distribution that is too broad for nonprecipitating marine stratocumulus has been assumed in the retrieval. In general, precipitation is associated with wider distributions, and the observed DSDs is not always well characterized using a monomodal log-normal or gamma size distribution (see supplementary material). The biases in τ are less robust and typically not statistically significant at the 95% level of confidence, depending on data set and precipitation category. Errors in r e and τ propagate into the retrieved LWP and N d in somewhat complex ways, as errors in the r e and τ are correlated (again depending on the presence of precipitation). A summary is given in Section 5 with more detailed discussions given in Sections 3.3 and 3.4. In general, we find the bias and case-to-case uncertainty in the satellite N d retrievals is smaller than one might expect simply from the bias and random errors in r e because of this correlation.
We stress the SOCRATES measurement were collected in the afternoon and during the SH summer where solar zenith angles are less than 60° (conditions under which theoretical studies suggest the bi-spectral retrieval should work well for homogeneous clouds). Thus, we are not surprised to find the bi-spectral retrieval works similarly well during SOCRATES as has been found with subtropical stratocumulus. We suggest that additional research should be undertaken using detailed radiative transfer simulations to model and better understand how variations in the DSD and its vertical and horizontal structure are likely to effect retrievals at larger solar zenith angles typical of the SO in other seasons and other times of day and more generally suggest that additional measurements should be collected during the SO winter.
In Section 4.4, we examined briefly the impact of combining data from the CDP, 2DS, and 2DC probes has on our analysis. The agreement between satellite retrievals and in situ r e shows some dependence on the choices of in situ probes and merging methods. Several evaluation studies (e.g., King et al., 2013;Platnick & Valero, 1995;Witte et al., 2018) have considered the uncertainty of in situ measurement of r e . For instance, Witte et al. (2018) compared the r e measured by phase Doppler interferometer (PDI) with MODIS r e2.1 and revisited the evaluation studies over the Pacific (e.g., Noble & Hudson, 2015; PZ11) using different instruments during the same three campaigns. Witte et al. (2018) found no apparent systematic bias (mean bias of −0.22 µm) in retrieved r e2.1 . Indeed, as we show in Section 4.4, we can merge the CDP and 2DS data in such a way that there is little overall bias in the r e , but this result is obtain by balancing the errors between non-, lightly-and heavily precipitating case, and there is a fundamental difference in the bias for the different precipitating categories. In short, as these studies highlight, there are significant uncertainties associated with in situ measurements, and a continued need for improved in situ measurements.