The NASA DC-8 and P-3B aircraft flew within about a kilometer or less of each other on three occasions during the Transport and Chemical Evolution Over the Pacific (TRACE-P) campaign in order to intercompare similar measurements on the two aircraft. The first and last intercomparisons were in relatively remote marine environments during transits to and from Asia. The first began with a boundary layer measurement followed by an ascent to 3 km. The second set of intercomparisons was at a fixed altitude of about 5.2 km off the coast of Japan, also in relatively clean air. Finally, the third measurement began at 5.3 km and then descended into the boundary layer. A number of measurements were compared with the best agreement observed for the most abundant compounds such as CO2 and CH4 and with very good agreement for CO, O3, and j values. Other comparisons, including measurements of the same compounds on both aircraft and measurements of the same compound by two different instruments on the DC-8, varied over a wide range from quite good for PAN, NO, HNO3, H2O; to reasonable agreement for OH, HO2 CH2O, acetone, and methylethylketone; and to generally poor for NO2,SO2, PPN, acetaldehyde, and methanol. The comparison results, particularly those for the fast 1-s CO and O3 measurements, suggest that credible intercomparisons can be made using two aircraft in close proximity for relatively long lifetime and stable compounds. Much new understanding can also be gained from measurements of more reactive and generally shorter lifetime compounds, but additional improvements are needed to make such studies as meaningful as those of longer lifetime compounds. Comparisons such as these, made as a component of a larger field campaign, have the advantage that they test the actual instrument configuration used during the field study and they require no additional instrument installation and testing.
 The intercomparison of many different measurement techniques for a variety of important atmospheric compounds during a field campaign can provide much needed insight into how well various measurements are being made under real world conditions. The importance of intercomparisons has long been recognized by the NASA GTE program, which has played a leadership role in designing and sponsoring several such studies [Hoell et al., 1993, 1990; Beck et al., 1987]. These have, in general, been formal blind or double blind intercomparisons campaigns which have been separate from more broad based chemistry and transport research campaigns. The present intercomparison differs in several ways from previous comparisons, but also tries to address new and ever more complex measurement and validation concerns. Former NASA intercomparisons were designed to evaluate which instrument or measurement technique provided the most accurate, precise, and sensitive measurement. An isolated comparison of instruments, of course, does not by itself provide this information, but in conjunction with the use of common calibration standards, ancillary measurements, modeling, and a wide range of natural variability of the compound of interest, much can be learned about relative sensitivity, precision, and to a lesser extent, absolute accuracy. Such intercomparisons were fairly competitive, and at least to some extent, aimed at identifying the best instrument to be used in future NASA missions. While this need still exists, the campaign scene has become more complex, requiring additional comparison opportunities. The NASA GTE program now often flies two aircraft simultaneously during a field campaign, each with its own specialty, that complement the mission goals, but still with many of the same measurement capabilities on each aircraft. In addition, Transport and Chemical Evolution Over the Pacific (TRACE-P) and future missions are planned to be joint with aircraft from other agencies and with satellites which also posses overlapping measurement capabilities. If models are to combine data from multiple aircraft and even multiple agencies' aircraft and satellites in a meaningful way, biases in measurements between platforms must first be identified and, in the future, minimized or removed. As the number of compounds being measured expands and in many cases the techniques being used to study each of them also expands, with each technique often having its own inherent advantages and disadvantages for a given mission or platform, the problem of intercomparison becomes more difficult. In many cases, there may be no best measurement technique for a given compound or set of compounds. If several different measurements of the same compound can be compared during a field campaign, a diverse set of techniques is probably even preferable, because if they all agree (using very different measurement schemes), the combined set of measurements as a unit becomes far less susceptible to interferences and, in some cases, calibration errors.
 There are a growing number of measurements that have very special inlet needs because of surface interactions and/or air speed and altitude dependent sampling. Intercomparison of measurements of such compounds and particles requires that the exact inlet, sample line, and instrument configuration used in a field campaign also be used for an intercomparison study. It also follows that changes and improvements of sampling technique may invalidate intercomparison results. There is therefore a need to intercompare every new measurement configuration used in a field campaign. Thus the goal of present and future intercomparisons may be less to identify the best measurement technique and more to evaluate biases between instruments and platforms on a campaign-by-campaign basis, initially correcting for these biases and in future campaigns minimizing and/or eliminating them. In some cases, this may require discontinuing the use of certain techniques which are (1) inconsistent with other measurements, model predictions, and which can not be verified by independent means or (2) are too insensitive or slow to answer the questions they are meant to address. In most cases, however, it means intercomparing calibration standards, identifying interferences, finding inlet losses or enhancements, and determining the real world range of altitudes, speeds, temperatures, humidities, etc., over which measurements can consistently be made within some predetermined error limits.
 The present intercomparison is the first attempt within a GTE mission to address these broader issues. The intercomparison was completely informal with open data sharing throughout the study period. It involved three 0.5–1.5 hour comparison periods which were part of a larger field campaign. The advantage of this approach is that it compares instruments in the same configuration in which they are used during the mission and helps to identify instrument malfunctions during the field campaign. Comparisons can be made under a range of conditions typical of those encountered in the measurement campaign. There is no additional setup time or cost for installing instrumentation on the aircraft beyond that already required for the field campaign. The limitations of this approach are that only a short period of time is available to conduct the comparison portion of the study (this is a much larger problem for long integration time measurements). There is also some uncertainty as to whether the two aircraft are sampling the same air mass throughout the intercomparison period, and even when they are, the time that various features are sampled may vary by a few seconds from aircraft to aircraft. The advent of several 1-s chemical measurements and the large number of simultaneous measurements on the two aircraft, however, has greatly reduced, but certainly not eliminated, the latter concern for close aircraft proximity as will be discussed shortly. The present intercomparison was also not blind, but rather encouraged full sharing of data to address any concerns that arose. Considering that this was the first attempt at this type of intercomparison and very much a learning experience, it was very reasonable that the study was not blind. There is nothing, however, that prevents this same strategy from being applied to a blind study in the future by collecting data submitted in a blind fashion shortly after an intercomparison flight and then disclosing the data after their formal collection to all interested in its use for scientific planning and model intercomparisons.
 If measurements on two aircraft as different as the P-3B and the DC-8 can be successfully intercompared over a wide range of altitudes, then it would appear that intercomparison with most other research aircraft would also be possible in the future. A second successful set of intercomparisons was carried out between the NASA P-3B and the NCAR C-130 near the end of TRACE-P and the beginning of Aerosol Characterization Experiment-off the coast of Asia (ACE-Asia). These intercomparisons involved largely aerosol instrumentation which are discussed in detail by K. Moore et al. (A comparison of similar aerosol measurements made on the NASA P-3B, DC-8, and NSF C-130 Aircraft during TRACE-P and ACE-Asia, submitted to Journal of Geophysical Research, 2003) and Y. Ma et al. (Intercomparisons of airborne measurements of aerosol ionic chemical composition during TRACE-P and ACE-Asia, submitted to Journal of Geophysical Research, 2003). A very successful intercomparison of SO2 instruments on the same two aircraft is discussed by Thornton et al. .
 The results of the present intercomparisons provide insight into how to combine measurements from the two aircraft involved in TRACE-P into a single merged data set. In some cases, measurements from the two aircraft are essentially indistinguishable, while in others there are distinct differences that need to be acknowledged and possibly even compensated for in model comparisons. In the noncompetitive spirit in which this intercomparison was performed and because there are still concerns about exactly how similar the sampled air masses were, persistent differences and trends in data were pointed out, along with detection limit problems and time resolution concerns, but individual techniques were not critically reviewed. The details of individual measurements are contained in the many papers in this special section.
 The purpose of this article is only to provide a summary of possible biases when combining data from various measurement techniques, provide additional information to evaluate the uncertainty or lack thereof that might be encountered in the use of these data, and present a brief analysis of pitfalls and benefits that can be derived from future mission-based intercomparison studies. It is also hoped that the present summary of results will encourage open exchange to foster a better understanding of discrepancies and not focus counter productive emphasis on value judgments, particularly for this very informal and somewhat experimental comparison.
2. Intercomparison Details
 The DC-8 and P-3B flew in close proximity to each other on three occasions for the purpose of intercomparing similar measurements on both platforms during the TRACE-P mission. The intercomparisons varied in length from slightly shorter than a half hour to a little longer than an hour and a quarter, and over an altitude range of about 0.16 to 5.3 km. All three of these intercomparisons were conducted in fairly unpolluted air masses. The first was conducted in the boundary layer and during a climb up to 3 km from about 14°N latitude and 140°–143°E longitude on a transit flight from Guam to Hong Kong. The second was conducted at a fixed altitude of about 5.2 km off the coast of Japan at about 33°N latitude and from 137.5°–141°E longitude. The final and by far the longest intercomparison began with a fixed altitude flight at 5.3 km and then gradually descended into the boundary layer for a fixed altitude flight at 0.2 km. These flights covered a latitude range from 22.5°N–25°N and a longitude range from 152°W–146°W on the return transit flight out of Hawaii. Figure 1 shows the altitude of both aircraft throughout the three intercomparison periods. It also shows the approximate distance between the two aircraft, which may be uncertain by about 0.1 km. The intercomparison periods are largely defined by aircraft altitude, since the largest chemical differences would be expected to occur in the vertical direction. Also, differences in the horizontal direction along the flight path will show up as changes as a function of time with 0.1 km corresponding to less than a second time shift. Of course, the flight paths of the two aircraft will not necessarily intersect all of the same air masses, and even when they do, they will intersect the transition region between air parcels at a random angle. So, a time period equal to or even several times larger than that required to travel a distance equal to the aircraft spacing might be required to reach the same air mass. This is still, however, only on the order of seconds to a few tens of seconds in the extreme case, and most of the instruments being compared acquire data at a similar or slower rate. Most of the comparison data to be discussed will be from the 1-min merge files. While uncertainties in the vertical direction should be much smaller than in the horizontal direction, differences of tens of meters are still quite possible. Therefore no attempt will be made to adjust measurements based on altitude or position. As will be observed in the following section, most of the discrepancies that will be discussed correspond to differences over a significant portion of an intercomparison flight or, in many cases, the entire flight, and do not appear to be related to small differences in the time or altitude that an aircraft intercepted an air mass change. There is, however, one 2 or 3 min long exception in which both NO and OH varied in a consistent manner on each aircraft, but which differed between aircraft by a factor of 2–3. This was also a period in which observations on both aircraft showed a rapid change in NO (and only small changes in CO and O3), probably indicating that the aircraft were traversing a relatively recently emitted plume. Since these measurements were in the boundary layer with the aircraft a few kilometers apart, it is quite possible that a ship plume or other local pollution source might have influenced the two aircraft somewhat differently. Since the large NO discrepancies only persisted for about three 1 min data points, the data from 0123 to 0125 LT during the first intercomparison flight have been removed. These are the only data removed for the three intercomparison flights because they show the only obvious air mass difference. This is not to say that the remainder of the intercomparison was flown in identical and uniform air parcels. There were certainly lesser variations that were presumably encountered throughout the flights; however, as shown in section 3, such brief differences will probably not significantly change the comparison result. Also the agreement between several of the very rapid chemical measurements, particularly O3 and CO, adds greatly to the credibility of this two-aircraft intercomparison.
 A second type of intercomparison was also possible. A few compounds were measured on the DC-8 by two different instruments, including formaldehyde and several oxygenated hydrocarbons. The measurements of these compounds can thus be compared throughout the entire mission, and since they are all on the same platform, there are no air mass similarity issues. In the 1-m data set, there are a few points out of the whole data set (the points around 0445 LT on 21 March 2001) that have been removed because they are several times higher than all other measurements for several oxygenated hydrocarbons. If left in, these points would require rescaling figures, impose an unrealistic bias on fitting routines, and would, for several compounds, represent by far the largest single absolute data discrepancy. Also, the meaning of measurements during a single rapid plume crossing event by two instruments with very different cycle times, one as long as 180 s, is very limited.
 The above timing issue is not isolated to dramatic plume crossings, but is ubiquitous throughout the intercomparisons; it simply gets worse in large abrupt gradients. The major problems are differences in integration times. For example, PAN measurements on the P-3B with an integration time of 1–2 s measured once each 150 s are compared to 120-s integration time measurements on the DC-8. This could easily result in significant measurement difference purely due to timing (location) differences. In fact, PAN measurements compared quite well, but this is probably because the two aircraft comparisons were generally conducted in remote and relatively uniform air masses. This situation also existed for measurements involved in the DC-8 only comparisons, which were conducted throughout TRACE-P and included many rapid air mass changes. An example of such measurements are H. Singh et al.'s (Oxygenated organic chemicals in the Pacific troposphere during TRACE-P, submitted to Journal of Geophysical Research, 2003, hereinafter referred to as Singh et al., submitted manuscript, 2003) methanol data (hereinafter referred to as Singh data) being reported as a 180-s integration typically every 7 min, while Apel et al.  report data (hereinafter referred to as Apel data) integration times from 6 to 100 s with times between samples typically 240–350 s or longer. In the latter case, discrepancies were much larger and it is not clear how much of the observed difference is due to measurement timing (location) and how much is due to instrumental measurement differences and response rates to transients. This concern will be discussed in more detail by Apel et al. . There is also a concern that some instruments are not run on a consistent time base with other instruments. This added some additional uncertainty when comparing results but is an area where improvements can be made in future campaigns.
3. Two-Aircraft Intercomparison Results
 The results of this intercomparison can be divided into four categories, beginning with measurements of parameters that typically agreed with each other within a percent or two. These are typically measurements of long-lived and relatively abundant compounds or photolysis frequencies. A second group of measurements commonly agreed with each other at about the 10% level and generally within quoted error limits of the measurement. The third group of measurements also typically agreed with each other within the quoted error limits, but their uncertainties were sufficiently large that model comparisons using these measurements should acknowledge the differences observed on the two different aircraft platforms. This group contains only OH and HO2/RO2, compounds with very short atmospheric lifetimes and low concentrations. Finally, there were several measurements that did not agree within their quoted error limits and commonly disagreed by a factor of 2 or more.
 The measurements that fall into the very good agreement category include those of photolysis frequencies and O3, CO, CO2, and CH4 concentrations. All of these measurements do, however, have several elements in common. In each case, the values measured during the two-aircraft intercomparison periods were far above the detection limits of the instruments, the instrument and measurement techniques for a given parameter were essentially the same on both aircraft, and the same principal investigator was responsible for similar measurements on both aircraft. The latter two circumstances were not, in general, the case for measurements in the subsequent categories.
Figures 2 and 3 show comparisons between photolysis frequency measurements determined by actinic flux spectroradiometry [Shetter et al., 2003; Lefer et al., 2003] on the P-3B versus those on the DC-8 for j(O1D) and j(NO2), respectively. Agreement is typically well within the stated uncertainty (±10% and ±8%, respectively) of these measurements. Some of the minor scatter that is observed, for example, in jO3 around 10–15 × 10−6 and 45–50 × 10−6 and one point at about 70 × 10−6 in the O3 plot, appears to be caused by transience in jO3, probably due to clouds that would not necessarily be expected to be measured in a similar manner on both aircraft. These are particularly prevalent in the first and during the last half of the third intercomparison. Unlike the chemical measurements that will be discussed throughout the remainder of this paper, observed similarities in local chemical fluctuations provide little insight into variations in j values caused by more distant clouds. Fortunately, the observed agreement is quite good despite fluctuations in j, and the primary lesson from Figures 2 and 3 is that it is preferable not to intercompare measurements in areas of broken cloudiness above or below the aircraft. Along with the scatterplot of data, two other lines are included. The dashed line is a bivariant fit to the data given in the TRACE-P data archive (which can be found at http://www-gte.larc.nasa.gov/trace/TP_dat.htm) which is then weighted using the relative uncertainties give in the TRACE-P data table [Kleb and Scott, 2003a, 2003b]. The solid line uses the same weighting but is forced through the origin. Use of the latter line provides important insight because essentially all TRACE-P measurements have some means of obtaining a zero measurement value, and thus the data would be expected to converge to a line through the origin. This line is particularly useful for determining the slope of data that are taken over a small dynamic range far from the origin. More caution should typically be exercised in using the slope of the dashed line, particularly in figures where the origin is not even shown. When data extend over a relatively wide dynamic range, however, significant deviation in slope between the two lines and a large intercept may indicate potential measurement nonlinearity, an interference, or an unrealistic background measurement. In Figures 2 and 3, both lines have a slope so close to 1.0 compared to the stated uncertainty that they provide little additional insight. These two photolysis frequencies are shown as a sample of a much larger number of derived j values, also with accuracies in the 8–10% range. Agreement is at a similar level for these other photolysis frequencies and thus an order of magnitude more figures would add little additional information.
Figures 4 and 5 show measurements of CH4 and CO2 on the P-3B versus the DC-8 by Sachse et al. , Bartlett et al.  and Vay et al. , respectively. Both were measured using infrared diode laser absorption and in both cases the agreement is extremely good with slopes equal to 1 to within better than 0.1% (stated uncertainties are ±1% and ±0.25 ppmv for CH4 and CO2, respectively). Again the dashed line represents an unconstrained bivariant fit which is included in Figures 4 and 5 only for consistency. It has much less meaning for data over such a narrow dynamic range and so far from zero. Figures 4 and 5 show the best fit of all the data intercompared and also include the two most abundant (ppmv range) and longest-lived species that are intercompared. Extremely good comparisons were also obtained for O3 measured by Avery et al.  using chemiluminescence and CO measured by differential infrared absorption by the Sachse group [Sachse et al., 1987, 1991; C. Mari et al., The effect of clean warm conveyor belts on the export of pollution from East Asia, submitted to Journal of Geophysical Research, 2003] which are shown in Figures 6 and 7, respectively. Both had slopes within 1% of the 1:1 line and essentially all scatter was within the ±5% and ±2% accuracy quoted for O3 and CO, respectively. The bivariant fit slope of 1.096 may, in part be explained by the P-3B and DC-8 sampling slightly different air masses toward the end of that intercomparison period. This possibility is discussed below. These compounds have a shorter atmospheric lifetime than CH4 and CO2, on the order of days to months rather than years, and variations of a factor of 2 in concentration can commonly be observed in adjacent air masses, typically due to nonuniform mixing downwind of enhanced source regions. Both of these measurements also provide data at 1 Hz which, because of their high precision and the degree of agreement, can be used to better understand the relation between the air masses in which each aircraft was flying. Figure 8 shows a plot of O3 measured on both aircraft as a function of time for a period of time starting just before the two aircraft came into close proximity and extending throughout the third intercomparison period. Note that as the two aircraft approach each other the air masses in which they are flying have fairly different O3 concentrations and that once together (∼1804 LT) they both observe very similar O3 and very similar structure in O3 concentrations as they descend through several very different layers. An equally impressive demonstration of rapid time response and high precision is observed using the measurement of CO in Figure 9. Figures 6–9 provide evidence that both aircraft are sampling from a fairly similar part of the same air mass. However, notice that after ∼1850 LT the O3 measurements on the two aircraft diverge slightly (∼1–2 ppb). Interestingly, over this same time period, a similar divergence in the CO values (∼2–4 ppb) can be seen between the aircraft. These slightly different air masses (∼2–4 ppb CO difference) affect the CO values around 150 ppb at the high end of the regression of Figure 7, resulting in a larger bivariant fit slope. Decreasing the P3B values by ∼2–4 ppb with respect to the DC8 values places this cluster of CO values nearer the 1:1 slope. On the other hand, in the case of the O3 regression, this period of the 1–2 ppb difference in O3 measurements occurs approximately midrange in the O3 values (∼55 ppb), resulting in a much smaller impact on the bivariant fit slope. Figure 10 shows a 200-s period of time encompassing the largest peaks in both Figures 8 and 9 (this is the peak which goes off scale in Figure 8). Note that not only do the structures look very similar for the same compound, but a time lag on the order of a few seconds for the DC-8 can be observed in both O3 and CO measurements. The inclusion of these similar, rapid, and highly precise measurements on both aircraft adds greatly to our confidence that two-aircraft intercomparisons can be made highly credible particularly for longer lifetime compounds while still requiring only a small amount of additional mission resources. One brief exception discussed in the previous section in which NO and the associated OH varied in coincidence with each other but differently between the two aircraft despite similar O3 and CO values still suggests some degree of caution. As more measurements are added and the speed of existing measurements increases (particularly for relatively short-lived compounds), chemical differences such as that noted above, which should, in general, be expected to occur a small fraction of the time unless the measurements are completely collocated, will be more easily identified.
 All of the measurements shown in Figures 2–7 introduce such a small potential error into model comparisons between aircraft compared to the uncertainties introduced by other measured quantities that any minor differences can probably be ignored in most cases. It is not clear that this is the case for the next set of measurements.
Figure 11 shows a comparison of NO measurements made on the P-3B by chemiluminescence by Kondo et al.  versus those on the DC-8 by two-photon laser induced fluorescence by D. Tan et al. (On the NOx budget in Asian outflow: Results from TRACE-P, submitted to Journal of Geophysical Research, 2003, hereinafter referred to as Tan et al., submitted manuscript, 2003), Bradshaw et al. , and Crawford et al. . The slope of the bivariant fit through the origin is 0.902 with a similar slope for the unconstrained fit and a near-zero intercept. Thus there appears to be a systematic difference with the DC-8 instrument measuring about 10% higher on average but still with a significant amount of scatter. The dark dotted lines provide an approximate upper and lower bound for expected data scatter for an average slope of 1 (the 1:1 line ± the square root of the sum of the squared errors for both instruments plus the detection limits) using the errors given in the TRACE-P data table [Kleb and Scott, 2003a, 2003b]. Using the error limits given in this data table as 2σ error limits (in the case of NO they are ±16/20% for P-3B/DC-8), few (about 5%) of the points should be expected to fall outside of this set of lines. For NO this number is zero, suggesting that one or both of the error limits may be somewhat over estimated. Similar upper and lower limits of expected scatter are also shown for all of the correlation plots that follow. To better understand measurement differences, Figure 12 shows a plot of the two NO measurements as a function of time during the three intercomparison periods. While some of the larger discrepancies are associated with rapid changes in NO, similar discrepancies occur during periods of slow NO changes. Measurement differences are also not consistent: for example, in the first flight the DC-8 measurements sometimes are high and other times low. In the second flight, the P-3 values are consistently higher, and in the third flight, consistently lower. This suggests some type of a shift in calibration or sensitivity between flights, which can not easily be compensated for when comparing models from the two aircraft. While an average slope of 0.9 suggests fairly good agreement and potential 10% effects on modeling, trying to compare relative NO concentration between the second and third set of flights can lead to discrepancies of a factor of one and a half, with far more significant effects on model interpretation.
Figure 13 shows a comparison of PAN data measured on the P-3B by gas chromatograph/electron capture detection by Flocke et al.  versus those from the DC-8 also measured by gas chromatograph/electron capture detection by Singh . The constrained slope is 1.13, which suggests average agreement within the stated error (−10 + 5/20% for P-3B/DC-8). The two dotted lines show that individual measurements are outside of the expected error limits about 40% of the time, but only by a small amount. This is largely due to the slope not being quite equal to 1 around which the error lines are centered. The small amount of scatter around the average slope is actually quite impressive, since the P-3B measurements have a sample integration time of 1–2 s compared to 120 s on the DC-8. Figure 14 shows P-3B and DC-8 PAN measurements as a function of time for the second two comparison flights (PAN was at the detection limit during the first intercomparison flight). Here there appear to be no surprises; the slope of about 1.13 describes well the average agreement with no large deviations. The P-3B values are nearly always either above or only slightly below the DC-8 values, and if the dotted error limit bars (similar to those discussed for NO) were centered around the slope of 1.13, essentially all data would fall within the area they bracket. Whatever the cause; calibration, interferences, or sampling losses, the average differences between measurements appear to be consistent in time, relatively small, and should be much easier to deal with in model comparisons between aircraft.
Figure 15 shows HNO3 data measured on the P-3B by selected ion chemical ionization mass spectrometry by Zondlo et al.  versus those measured on the DC-8 using a mist chamber/ion chromatograph by Talbot et al. . Note that while agreement is fairly good and the average slope is close to 1, the scatter and stated errors (±20–30/15–30% for P-3B/DC-8) are quite large. Again, about 30% of the points fall outside of the dotted error limits (the error lines in this case are not quite straight because the error limits change with concentration), suggesting that one or both of the stated errors are underestimates. Figure 16 shows both the P-3B and DC-8 measurements plotted as a function of time. Note that while measurements appear to track each other, there appears to be somewhat of a bias for higher values being measured by the DC-8 instrument during the second intercomparison period (3/24), while during the third intercomparison (4/9) the higher values were measured far more frequently on the P-3B (no P-3B data for the first intercomparison). The large difference between the unconstrained bivariant fit and the fit forced through the origin in Figure 15 largely arises from this relative difference between flights, combined with essentially all of the low concentration measurements being made in the last intercomparison period. As in the case of NO calibrations, interferences or losses appear to vary during the mission, resulting in greater difficulty in comparing model results obtained independently for each aircraft. A more detailed discussion of the nitric acid intercomparison is given by Zondlo et al. .
 Measurements of H2O on the two aircraft fall into a somewhat unique category because H2O is measured by the project using frosted mirror/dew point instruments on both aircraft (measurements are a combination of data from General Eastern 1011B hygrometers and cryo-cooled hygrometers), but on the DC-8 it is also measured by IR diode laser absorption (Diode Laser Hygrometer) by the Sachse group [Diskin et al., 2002; Podolske et al., 2003]. The latter reference also discusses a separate water intercomparison using several additional measurement techniques. Figure 17 shows a log/log plot of frosted mirror (FM) devices on both aircraft versus the diode laser hygrometer DLH) instrument. While the diode laser instrument is not by definition correct, it does represent a much newer and faster measurement technology, has an accuracy of ±10%, and provides a good means for comparison, with its limit of detection well below 1 ppmv. The dashed lines are 10% above and below the 1:1 line showing that for H2O concentrations below a few hundred ppmv there are significant discrepancies which get as large as a factor of 2 at the lowest water concentrations. Even at high concentrations, there are some disturbing discrepancies. A time profile is not shown for H2O but can be described in a fairly straightforward manner: higher altitudes typically mean lower water concentration and a larger percentage discrepancy. Much progress is being made in the H2O measurement area, but for the present measurements, caution should be used when combining model results from the two aircraft, particularly at the lowest H2O concentrations.
 The next group of measurements, which consists only of OH and HO2/RO2, show significant discrepancies, but both also have relatively large error bars. Figure 18 shows the OH concentration measured on the P-3B by chemical conversion/selected ion chemical ionization mass spectrometry by Mauldin et al.  versus that measured by laser-induced fluorescence by Tan et al. . The relatively large scatter is consistent with the larger, stated uncertainties. The constrained bivariant fit has a slope of 1.50 which is also quite large, but well within the combined (±60/40% for P-3B/DC-8) error limits of the two measurements. About 10% of the measured values fall outside of the dotted error lines. These are all on one side, as shown in Figure 18, with no points even close to the other error line. This number is larger, but somewhat consistent with expected scatter, except that it is all biased to one side of the error range because the average slope is 50% above 1. The dark dot-dashed lines in Figure 18 show error limits that are centered around a slope of 1.50, but the lines represent error limits that are only 60% as large as the dotted line centered around a slope of 1. Note that only about 5% of the data points fall outside of these lines. This suggests that the scatter of the data from both instruments is probably better than that suggested by the stated error limits which is consistent with the precision being better than the absolute accuracy but that there seems to be a calibration problem associated with these measurements. This is not at all surprising, since OH measurements are inherently difficult to calibrate, due to the lack of stable standards and the rapid reactions of OH on surfaces. In fact, the accuracy of OH measurements is largely determined by the uncertainties associated with absolute instrument calibration. It is also interesting to note that despite common concerns about the sensitivity of OH measurements, Figure 18 shows that the data around 1 × 106 molecules cm−3 fall between either set of error lines at least as well as do data at higher concentrations. Figure 19 shows P-3B and DC-8 OH concentration as a function of time for the three intercomparison periods. Also shown is a solid line which is proportional to the product of jO3, O3, and H2O (a measure of OH production) and the average NO concentrations on both aircraft (a measure of the rate of HO2 to OH conversion). Both of the OH instruments appear to track some of the larger changes in production and NO (such as the largest NO peak in the first leg and production increase in the last leg), but both also show some inconsistencies. The overall discrepancies are reasonably consistent in time, with the P-3B measurements either higher than or equal to those of the DC-8, except for a brief time at the beginning of the second intercomparison period. Many of the largest discrepancies appear to have occurred during flights in the boundary layer, such as in the first half of the first comparison period and the last few points in the last comparison period. These were also periods in which the largest differences and changes occurred in jO3 values, though the jO3 changes alone were far too small to explain these differences. Mechanistically, no explanation can be provided for why OH should vary significantly with small actinic flux changes, and what is observed may be purely measurement scatter; however, it may be desirable to carry out future OH intercomparisons in relatively cloud-free areas if possible, at least until such discrepancies can be better understood.
 The relatively large discrepancies between the two measurements require that caution be exercised when comparing model results from the two aircraft. The percent discrepancies shown in Figure 19 have no clear altitude, time, or concentration dependence, and thus Figure 18 provides a reasonably complete review of expected differences.
 Measurements of HO2/RO2 cannot be compared directly because HO2 plus RO2 was measured on the P-3B while HO2 was being measured on the DC-8. While modeled HO2/RO2 ratios could be used to interpret this comparison, it was felt that because this article is intended in part to provide insight into experimental results for use in models, that it should not itself be biased by model expectations. Thus only a direct comparison of these two different but closely related quantities is shown. It should be pointed out, however, that models suggest that the [HO2/HO2 + RO2] can vary by up to about a factor of 2.6 centered around a ratio of about 0.56 during these comparison flights and that a comparison of this peroxy radical data using modeled ratios is given by Cantrell et al. . Figure 20 is a plot of the HO2 + RO2 measured on the P-3B using selective chemical conversion followed by selected ion chemical ionization mass spectrometry by Cantrell et al. (this issue) versus HO2 measured on the DC-8 by laser-induced florescence by Tan et al. . A slope of 2.46 is seen to fit the average data, but it should be remembered that the value of this slope is a relative number that depends on the ratio of RO2 to HO2 which presumably is not even a constant throughout the intercomparison period. Only the most optimistic error lines which are centered symmetrically around the slope of 2.46 are shown. This results in only a few percent of the points outside these lines, which is consistent with the stated uncertainties (±35/40% for P-3B/DC-8). Figure 21 shows both the P-3B HO2 plus RO2 measurements and the DC-8 HO2 measurements plotted as a function of time during the first and last intercomparison periods (no HO2 plus RO2 data were taken during the second intercomparison) along with altitude, which is shown by the solid line. The P-3B HO2 + RO2 is significantly higher than the DC-8 HO2 throughout the first comparison and the first half of the third comparison and then became approximately equal to the DC-8 HO2 for the last half of the comparison. Since both the beginning of the first comparison and the end of the last comparison were in the boundary layer, there appears to be no simple altitude trend. The relatively large stated uncertainties associated with these data and the associated scatter shown in Figure 20 combined with the fact that the amount of HO2 in the HO2 + RO2 measurement is unknown make this comparison particularly difficult. Model comparisons can be made directly to either HO2 or HO2 + RO2; thus there is no inherent problem associated with measurement/model comparison. Caution should again be exercised in comparing results from the two different aircraft.
 The final group consists of three P-3B and DC-8 measurements that were compared, including NO2, SO2, and PPN. These all had slopes that differed from 1 by a factor of 2.5 to 3.5, with most of the data for SO2 and PPN outside of the expected error range. Figure 22 shows the NO2 concentration measured on the P-3B by UV photolysis/chemiluminescence by Kondo et al.  and K. Nakamura et al. (Measurement of NO2 by photolysis conversion technique during TRACE-P, submitted to Journal of Geophysical Research, 2003, hereinafter referred to as Nakamura et al., submitted manuscript, 2003) versus the DC-8 value measured by laser photolysis/two-photon laser-induced fluorescence by Tan et al. (submitted manuscript, 2003), Bradshaw et al. , and Crawford et al. . While most of the NO2 points do fall within the large error limits (±36–72/40% for P-3B/DC-8) shown in Figure 22, the slope of the unconstrained bivariant fit is approximately zero which, combined with an R2 ∼ 0, suggests no correlation. It should also be noted that most of the data in Figure 22 are below the stated detection limit for the P-3B instrument (13 pptv) and thus should not be intercompared, except that DC-8 data from the same time period suggest that the NO2 was 2–5 times the P-3B detection limit. Also, the comparison was dramatically worse during the second intercomparison period than in the first (no comparison data for the third), suggesting a high degree of inconsistency in either measurement or calibration on the part of one or both instruments. Comparing NO2 results of the first comparison period to those of NO shows reasonable agreement for both NO2 measurements, while in the second the Kondo et al. NO2 measurements appear to better track the gradual NO decline with time. Clearly, more work is needed to resolve large differences in the measurement of this important compound. Some type of adjustment needs to be made to NO2 model values when comparing results from the two aircraft, and much could be learned about detection limits versus calibration problems if future comparisons were to contain NO2 concentrations well above the NO2 detection limit.
Figure 23 shows the SO2 concentrations measured on the P-3B by atmospheric pressure ionization mass spectrometry by Thornton et al.  versus those measured on the DC-8 using a mist chamber/ion chromatograph by Talbot et al. . The P-3B values are higher than the DC-8 values by up to an order of magnitude, except for one brief period in the middle of the last intercomparison period. These discrepancies are far beyond the error limits (±2–3/20% for P-3B/DC-8) or detection limits of either instrument and need to be investigated farther.
Figure 24 shows PPN measurements made on the P-3B by gas chromatography/electron capture detection by Flocke et al.  versus the DC-8 measurements made by gas chromatography/electron capture detection by Singh . Again, agreement is poor and nearly all points are well beyond the error lines (−10+5/30% for P-3B/DC-8). It should also be noted, however, that nearly all measurements are within a factor of 3 of the detection limits (5/1 pptv for P-3B/DC-8) for both instruments. Since PPN is measured by the same instruments used to measure PAN, but is observed to be at so much lower concentration, the influence of this discrepancy on overall model predictions is probably small. While efforts should be made to better understand and remove these discrepancies, at least an equal amount of effort needs to go into measuring and intercomparing measurements for other PAN-like compounds, for which there are even less data.
4. DC-8 Intercomparison Results
 The intercomparisons discussed in this section extended throughout the TRACE-P mission and included measurements of quite clean and also highly polluted air masses. Thus the number of comparison points and also the range of these measurements tend to be much larger than those in the previous section. Figure 25a shows formaldehyde concentrations measured using an enzyme derivatization/fluorometer by Heikes et al.  versus those measured using a tunable diode laser absorption spectrometer by Fried et al. . Figure 25a shows the whole data set, while Figure 25b shows just formaldehyde values below 600 pptv so that the majority of the data can be seen more clearly. From Figures 25a and 25b, it is clear that many of the data points fall outside of the ±21% error bars centered around a slope of 1 (uncertainties are ±15/12–15% for Heikes/Fried). If similar error bars are centered around the average slope of 1.49 (dotted/dashed lines)which is strongly driven by the highest observed concentrations, more of the very highest concentrations fall within the bracketed region, but little improvement is observed for data in the 2–3 ppbv range and below. An additional concern is also seen in Figure 25b. There are a large number of points that are at the stated detection limit for the Heikes instrument (plotted at 50 pptv) and the Fried instrument below 58–80 pptv (Fried data are best estimates plotted both above and below zero in this range), which have a companion measurement by the other instrument which is far above the detection limit. This can be seen in Figure 26 which allows discrepancies to be more clearly seen near the detection limit. Figure 26 plots all the CH2O comparison data acquired by the two instruments on the DC-8 as differences (Heikes data minus Fried data) versus the average of the two. The total combined uncertainty limits (2σ) are shown by the solid black lines, and these were calculated from the quadrature addition of the total uncertainties from the two instruments (Heikes data, 15% of concentration +50 pptv; Fried data, [(LOD)2 + (systematic term)2]1/2, where LOD is limit of detection). Figure 26 shows three different regions. The first region indicates that 61% of the comparison points yield differences within the combined uncertainty limits. The upper region contains 26% of the measurements and the lower region has 13%. These results suggest several areas that need to be addressed. There appears to be an overall inlet/instrument calibration problem at high concentrations, but at low concentrations correlated data scatter is larger so that nearly half of the data points shown in Figures 25a and 25b fall outside of the expected error limits. This suggests a measurement problem well beyond the stated error limits and/or detection limits for one or both instruments. In an attempt to provide some additional insight into how the concentrations of formaldehyde and other compounds in this section changed in various types of air masses, CO will be plotted as a function of time along with the other measurements made in this section.
 The compounds being compared in this section (formaldehyde, acetone, methylethylketone, acetaldehyde, and methanol) are all products of some type of hydrocarbon oxidation. In some cases they may also have direct emission sources, but even most of these would be expected to be associated with urban/industrial or biomass burning plumes, which typically also contain elevated CO (though a weak oceanic source may also exist for some of these compounds). CO has a relatively long atmospheric lifetime with much of its decline in concentration with time in plumes due to dilution rather than destruction. The atmospheric lifetimes of the compounds being compared in this section are varied but are generally much shorter than that of CO, with chemical production and loss mechanisms also being quite different. Thus a very high degree of correlation with CO is not expected. On the other hand, as a tracer for plumes containing a large amount of reactive carbon, some significant degree of correlation with CO would be expected and a total lack of correlation would seem difficult to explain. The following correlation plots are not being presented as a quantitative test of instrument performance but rather as a fairly general qualitative means of assessing measurement discrepancies for cases where correlation plots for the same compounds differ significantly. When looking at the correlation plots for the next five compounds, it should be noted that where agreement between similar measurements improves this seems to be reflected in a better correlation with CO for these measurements. This appears to be the case both for comparing one compound to another or, in some cases, the agreement and correlation with CO when comparing low and high concentrations of a single compound. For example, higher formaldehyde appears to correlate better both between instruments and with CO for both instruments. Also, acetone and methylethylketone appear to show better instrument to instrument agreement, and the Apel/Riemer data [Apel et al., 2003] appear to show better correlation with CO for these two compounds than for the other two compounds. The Singh data show reasonable CO correlation for all compounds.
Figures 27a and 27b show two correlation plots of formaldehyde with CO. Figure 27a shows the data from Heikes et al. , and Figure 27b shows the data from Fried et al. . The correlation appears to get better at higher concentrations for both data sets and is probably somewhat less scattered for the Fried data, particularly when CO is below 200–300 ppbv. The latter becomes more obvious if the correlation data are expanded so that all points are observable in the areas that are at present saturated with points in Figures 27a and 27b. Correlations are still not good, however, and as suggested by direct correlation between similar measurements, low concentrations again pose the greatest problem. This is a major concern because much of the mission data are either below or within a factor of 2 or 3 of the limits of detection of these instruments. Therefore additional sensitivity is badly needed for measurements in relatively clean air masses. There are also some discrepancies that need to be addressed throughout the concentration range, but many of these may be difficult because they are sporadic in nature. There are no multiaircraft model comparison issues that need to be addressed for any of the measurements in this section. From a practical modeling standpoint, it should be noted that the Fried formaldehyde data coverage is about 53% (34% above LOD), while the Heikes data are available for about 26% (16% above LOD) of the time.
 The four compounds in the final set were all measured independently by gas chromatograph/mass spectrometry by Apel et al.  and by gas chromatograph/photoionization detector in series with a reduction gas detector by Singh et al. [1995, 2000, 2001, submitted manuscript, 2003]. These data are particularly difficult to compare because the Apel/Riemer data are typically measured with 10- to 60-s integration times for low and high altitudes, respectively, while the Singh measurements are 180-s intergrations. This is discussed in more detail by Apel et al. . Figures 28–31 show the data of Apel versus Singh for the compounds acetone, methylethylketone, acetaldehyde, and methanol, respectively. The slope of 1.57 shown for acetone is somewhat beyond that expected from the combined uncertainties (±15/20% for Apel/Singh) but is not surprising considering the integration time differences. A contributor to this difference is the disparity observed in the calibration standards analyzed after the study. The Apel/Riemer standards yielded values that would result in 11.8%, 19.4%, and 28.1% higher values for acetone, acetaldehyde, and methanol, respectively, than if the Singh standard values were used (no methylethylketone standard comparison could be made).
 For methylethylketone, acetaldehyde, and methanol the agreement cannot be said to be good in that the slopes of the bivariant fits that are forced through the origin range from about 1.56 to 2.05, and the slope of the unconstrained fits are even larger though a fraction of this is presumably due to differences in standards. These slopes are still, however, much larger than those expected from the combined uncertainties of both instruments (±15/20% for methylethylketone and ±15/25% for acetaldehyde and ±20/25% methanol for Apel/Singh). Also, most of the data fall outside of the dotted error lines, and even if these lines are centered around the average slopes, no dramatic improvement is observed. In some cases, however, such as for acetone and methylethylketone, there does appear to be some reasonable correlation. Unlike formaldehyde, though, agreement does not seem to improve very much with concentration. Thus lack of sensitivity does not appear to be a significant contributor to the observed discrepancies. Also, a constant calibration error does not, in general, appear to be the major problem. Rather, discrepancies are highly variable over the entire measured range, with all four of the Apel/Riemer measurements higher than those of Singh most of the time. This could result either from an interference that could sporadically enhance signal, or a variable sampling loss, which could either reduce the concentration of the compound being measured or possibly delay the instrument response time. A postmission instrument evaluation [Apel et al., 2003] revealed interferences in the acetone and acetaldehyde measurements of Apel/Riemer, but these have already been corrected for the comparison data shown. Any of the former problems could dramatically degrade the data correlation between these two instruments. The same problems, however, would be expected to degrade correlations with other related compounds as well. Figures 27c–27j show the concentration of each of the above compounds plotted against that of CO. While some compounds show a better correlation with CO than others, it is reasonable to assume that all should have some degree of positive correlation even if CO is only assumed to be a tracer for Asian plumes. In several cases, there appear to be significant differences in the degree of correlation with CO. Some lack of correlation may be unavoidable in the Apel/Riemer data because of the 10–60 s time base used. Their sampling time base is dependent on altitude. Most plumes are observed at relatively low altitude; this is where the time base for Apel/Riemer is the shortest, often less than 20 s. Relatively high variability is observed for CO over the 1-min time periods perhaps precluding excessively tight correlations even in plumes. Better correlation with CO is expected for shorter-lived compounds that have no significant noncombustion sources. For longer-lived compounds such as acetone and methanol that have significant additional sources, one might expect a poorer correlation, particularly outside of plumes. It should be noted that for all four of these compounds and also for formaldehyde, the plots that appear to show the best correlation come from instruments that on average measured lower concentrations. If there were compounds that were highly correlated with plumes and CO which caused interferences in these instruments, they could enhance the observed correlation with CO, but they would also presumably lead to higher not lower measured concentrations. Thus it seems unlikely that the better correlations observed are a result of measurement interferences.
 A major finding of this study is that two-aircraft intercomparisons can provide much useful insight into instrument operation and measurement credibility. Measurements of long-lived compounds are probably the easiest to compare and good agreement between several of such species helps to define times that air masses are most similar. These are times that measurements of other long-lived species would also be expected to agree with each other. Much can be learned about the operation of instruments measuring short-lived compounds also but additional information is needed. The availability on multiple aircraft of rapid, high precision measurements of compounds with intermediate lifetimes such as CO and O3, which can vary over a relatively large dynamic range, provide additional insight into bulk air mass similarities. Good agreement of these measurements by itself does not assure that identical air masses are being sampled by both aircraft. More recent localized perturbations of photon flux or chemical transport/injection can alter short-lived compounds as discussed for complementary variations in NO and OH which differed between aircraft. The latter rare incident, however, occurred in the boundary layer during the first intercomparison leg for which the average aircraft separation was still a few kilometers and local injection of NO was not surprising. In later flights, when the aircraft separation was reduced to well below 1 km, the NO agreement was always much better than the factor of 2 observed briefly during the first comparison 4 March 2001 at around 0124 LT. To remove the ambiguity associated with comparisons of short-lived species, additional rapid measurements of many such species is highly desirable. When several fast measurements all suggest that a highly localized plume has been crossed, then agreement between aircraft would not be expected. Several candidate compounds which can already be measured rapidly enough are NO, H2O, and SO2. Discrepancies in NO at the 20% level in the present data set were observed when flying through both structured air masses and what appeared to be relatively uniform air masses. If the precision and cross calibration of these NO instruments could be improved so that agreement between aircraft (not necessarily absolute accuracy or measurements near the detection limit) was consistently about 5%, then the shorter lifetime, much larger dynamic range, and central role of NO in photochemistry would provide a major improvement in assessing air mass similarity. Fast water measurements on both aircraft would also provide additional but complementary insight into air mass similarity. At present, without very good agreement in NO or water measurements (there were no fast measurement on the P-3B such as the Diode Laser Hygrometer on the DC-8 with which to compare), it is not clear that an event of relatively short duration (small distance) would be detectable if it only involved differences on the order of 20% in relatively short atmospheric lifetime species. Sulfur dioxide measurements could also provide an additional intercomparison tool. Though this compound compared poorly between TRACE-P aircraft, it compared well using identical techniques on the NASA P-3B and the NCAR C-130 during TRACE-P–ACE-Asia comparison flights [Thornton et al., 2002], and it can be measured on a subsecond timescale.
 Averaged over a significant portion of an intercomparison flight (for example,100–200 km), persistent differences in chemical concentrations seem unlikely, particularly with an aircraft spacing of only a few tenths of a kilometer; however, at the end of the third intercomparison, O3 and CO measurements showed such differences. While the TRACE-P comparison provides much new general insight into measurement differences and future needs, there still remain concerns about just how similar the air masses were that both aircraft sampled from. In the future, it is hoped that the ever expanding development of more rapid measurement capabilities (including water) combined with experience from previous comparisons and more long range planning of intercomparisons (allowing consistent aircraft separations of only a few tenths of a kilometer or less) as an integral part of field campaigns will make future intercomparisons even more informative. If such comparisons are included as part of a field campaign, they can be accomplished with little additional effort and can directly provide comparison information on the exact instrument configuration used on the mission and its response to conditions encountered during that mission.
 The results of this comparison were quite varied. The first group of measurements agreed so well that additional improvements would advance the mission science objectives little except where even faster measurements are needed, such as for flux studies. The second and third groups of measurements including NO, PAN, HNO3, OH, and HO2 showed good promise, particularly PAN, but improvements in all of these would significantly advance scientific goals. OH and HO2 discrepancies make it particularly difficult to intercompare mission results. In the case of OH, absolute instrument calibration would appear to be an area for improvement. Similar concerns exist for HO2 and RO2, but the situation is more clouded by the inability to directly compare results. Hopefully, in the future these two instruments can be compared while both are measuring either HO2 or HO2 + RO2 or both. The final group of two-aircraft comparisons suggests that at least one of the instruments measuring NO2, SO2, and PPN was either too close to its detection limit or in error. These large discrepancies need to be resolved if these instruments are going to contribute to future joint aircraft measurement and modeling efforts. Additional insight into ongoing concerns about NO2 discrepancies and model comparison are discussed by Olson et al.  and Nakamura et al. (submitted manuscript, 2003). While the comparisons of instruments that were solely on the DC-8 during TRACE-P generally appeared to show somewhat poorer results, they were also subjected to a far greater diversity of air masses. Two-aircraft comparisons in plumes would be highly desirable in the future but will require far more planning and some luck. One of the major areas of improvement needed for the instruments that were only on the DC-8 is a more sensitive measurement of formaldehyde. Data coverage is significantly limited by measurements at or below the limit of detection, and far more measurements are within a factor of 2 or 3 of this limit. Additionally, the potential for interferences, inlet effects, and possible calibration problems in the oxygenated hydrocarbon measurements need to be more fully explored.
 As stated previously, the results of this intercomparison should be viewed as a starting point for achieving a better understanding of instrument operation and aircraft measurement problems. This text is specifically not intended to provide a critical review of individual instrument operation, but rather to summarize where additional effort is needed and as a brief guide to modelers who are using data from both aircraft or from the DC-8 where multiple measurements are available. This is the first time that most of these instruments have been compared on an aircraft, and for several instruments only the first or second time that they have flown. There was somewhat of a tendency for the largest discrepancies to be associated with measurements involving at least one new instrument or measurement technique. This is not meant to suggest that the newer techniques are in error but rather that very different measurement techniques are more likely to disagree than are similar techniques being used by two different investigators. It is also more likely that agreement will be observed for two instruments that have been compared before than for one or more new instruments which have never been compared. It is when good agreement is achieved between two or more dramatically different measurement techniques using independent calibration methods, however, that the most credible measurement validation is provided. Thus the development and intercomparison of unique new measurement techniques needs to be encouraged.
 The authors wish to thank the National Atmospheric and Space Administration (NASA) for their support of the many measurements discussed in this paper through the Global Tropospheric Experiment (GTE) program. We would also particularly like to thank the pilots of the NASA P-3B and DC-8 for making this study successful.