Seven Decades of Neutron Monitors (1951–2019): Overview and Evaluation of Data Sources

The worldwide network of neutron monitors (NMs) is the primary instrument to study cosmic‐ray variability on time scales of up to 70 yr. Since the 1950s, 147 NMs with publicly available data have been in operation, and their records are archived in and distributed through different repositories and data sources. A comprehensive analysis of all available NM data sets (300 data sets from 147 NMs) is performed here to check the quality and consistency of the data. The data sources include World Data Center for Cosmic Rays, the Neutron Monitor Database, the Pushkov Institute of Terrestrial Magnetism, Ionosphere, and Radiowave Propagation (IZMIRAN) and individual station/institution databases. It was found that The data from the same NM can be nonidentical and of different quality in different sources. We give and tabulate here a recommendation for the optimal data source of each NM. We also present here a list of 29 “prime” stations with the longest and most reliable data. Verified data sets for these prime stations are provided as supplementary information.

In this study, we analyze the history and the current global status of publicly available NM data. Using an automated data collection and analysis system, we obtain, study and compare data sets from different NMs and sources to produce an up-to-date assessment of the NM data sets and reliable recommendations for their usage, with the aim to assist NM data users to produce more reliable and reproducible results.
This study is organized as follows. In Section 2, we present a brief history of the NM network and NM data practices. Section 3 gives an overview of the NM data repositories, common practices, problems, and limitations. Selection of the prime stations and their assessment are presented in Section 4. Section 5 gives our recommendations for future improvements of the NM data archiving. Conclusions are summarized in Section 6.

Brief History of Neutron Monitors as Space-Physics Instruments
NMs were invented by Simpson (1948) as a detector to register and study the secondary neutrons generated by cosmic rays. The Climax NM started operating in 1951, whereas many other NM stations were launched during the International Geophysical Year (IGY) in 1957. These early NMs are therefore referred to as "IGY" type. Based on the collected experience, the design was improved, and a new type of detector, called NM64 or "super-monitor," was introduced during the International Quiet Sun Year of 1964. This design was so good (Hatton & Carmichael, 1964) with stable operation and robust data production, that it remains a standard design since then, and the number of NM64's operating around the globe reached many dozens. It should be noticed that the standard NM64 design (Hatton, 1971) was initially based on the BF 3 -filled proportional counters BP-28 produced by the Chalk River laboratory in Canada and their Soviet analog SNM-15 used in USSR and Eastern Europe. The latter are about 15% less effective than NM64 (Abunin et al., 2011;Gil et al., 2015) because of the less pure filling gas. Later, there was a tendency to use 3 He-filled counters but, because of high pressure and leaking ability of helium, they appeared unstable in the long run. At present, BF 3 -filled proportional counters of slightly improved design (higher gas pressure) are used again (Strauss et al., 2020).
The data obtained from individual NMs are traditionally collected at 1 h resolution by the World Data Center for Cosmic Rays (WDCCR) which was established during the International Geophysical Year of 1957 at IZMIRAN (Pushkov Institute of Terrestrial Magnetism, Ionosphere, and Radiowave Propagation), USSR and RIKEN, Japan. Through WDCCR, data were exchanged between the Soviet Union, the USA, and Japan. WDCCR is currently maintained by Nagoya University, Japan, and is mirrored at IZMIRAN. It offers historical data sets, provided as a set of ASCII data files in several formats, through an online FTP service that is updated on a monthly basis. WDCCR stores data from many old and short-lived stations that cannot be found anywhere else. IZMIRAN not only maintains a mirror of the WDCCR data set but also continuously develops its own database by collecting data and implementing apparent corrections to the raw data.
The first online real-time data service was provided by the Moscow NM station in 1997. In 2000, Oulu NM launched an online database, the first in Western countries. Since then, several NM stations started their own data service, each in its own style. A decade later, in 2008, the Neutron Monitor Database (NMDB) project started under the EU FP7 program, providing an accessible database of archival and real-time verified data from about 50 monitors. It started as a European project but currently includes NMs from around the globe.
Many active NM stations also offer data through their own web services or other systems. These also include stations and research institutes that manage and distribute data from multiple stations, as will be discussed below.

Data and Methods
In this study, we collected all available NM count-rate data from all the repositories, databases and individual NM homepages. We have identified 147 NM stations whose data are available in any of the main sources of data listed in Table 1. The station list is provided in Table 2 and in the supplementary information.
We developed an automated system for fetching online NM 1 h resolution data from all the sources of Table 1 up to the end of the year 2019. Each data set was then parsed and transformed into the Matlab data format. Thus, a data set of hourly NM count rates was created for further analysis. All data were downloaded during June 20-23, 2020.
A brief description of the data repositories is provided in the following subsections.

WDCCR
The WDCCR started its operation in 1957 (Lincoln & Shea, 1973). It collects pressure-corrected data from NM stations and makes them available online as ASCII files of 1 h time resolution, through an FTP service. There are 140 sub-folders for NM data in the FTP folder, but two of them (Bergen & Cape_H) are empty. Metadata is provided in each file, and changes, e.g., the number of counters, can be traced in the metadata. The data in WDCCR are typically from the time of their recording, while revisions/corrections/updates of the already written data are not foreseen.
The data for this study was collected from the WDCCR repository at http://cidas.isee.nagoya-u.ac.jp/ WDCCR/.

Data Format
WDCCR offers data in three formats: LONGFORMAT, SHORTFORMAT, and CARDFORMAT, described in the WDCCR homepage under "Data Formats". All the formats contain the same data in yearly ASCII files, which are different only in presentation. The long format displays monthly values in 12 lines, with relevant metadata at the start of each line. The short format displays the same data long format, but the monthly metadata is described in the datafile headers rather than referring to a readme-file, and the count rates are displayed with 12 h values on each line. The card format is similar to the short format in the form of displaying data but does not contain metadata beyond the basic station descriptors (NM name, type, pressure corrections, etc.) at the start of each line. For this study, we use data in the LONGFORMAT.

Scaling Factors
Count rates in WDCCR are provided as unscaled values (DATA), with a Scaling Factor (SF) and a Constant (CONST) provided in the metadata. The real count rates are defined as: Notes. Data repository name is a hyperlink to the respective website in the pdf version of this article. However, these SFs do not always correct such apparent problems as jumps related to the changing number of counters, their malfunctioning, change of type, etc. The scaling factors and their source or methodology are not described in any way. Such apparent jumps need to be analyzed and corrected separately.

NMDB
The NMDB was established in 2008 as a part of a European Union funded project (FP7 Program) to create a modern database of NM data, including real-time updates (Mavromichalaki et al., 2011). Originally, it was built on mostly European NMs, but data from several non-European stations have been added later. In total, NMDB has data folders for 66 NMs, 8 of which contain no data, leaving 58 stations with data available. Except for Leadville and Polarstern, all NMs listed there have data available from other sources as well.
Revisions and updates of data in NMDB are the responsibility of the individual station teams, which means that the data come straight from the source without any additional correction by NMDB.

Data Format
NMDB provides data for uncorrected (raw) counts, pressure-and efficiency-corrected count rates, and barometric pressure. Here we always use the "corrected" data.
The NMDB contains three data table options for each station: "ori" "revori" and "1 h", which contain originally loaded data, the revised data in the best time resolution (usually 1 min), and the 1 h validated data, respectively. Short descriptions are available at http://www.nmdb.eu/nest/ help.php#helptable and http://www.nmdb.eu/nest/statements.html.
Status of the currently available data and their version date for different tables can be found at http://www.nmdb.eu/status/status.php.
The NDMB-ori data set cannot be changed after the first load, while all later corrections/modifications are reflected in NMDB-revori (and NMDB-1 h) data sets. Accordingly, the NMDB-revori table supersedes the ori table (i.e., the NMDB-ori table is just the first version of NMDB-revori table). In this analysis, we will not discuss -ori and -revori tables separately, and will only analyze the -revori and -1 h tables.
For the NMDB data retrieval, we employed an automated web query method, which downloads and parses the data at 1 h resolution for each station from both the revori and 1 h tables in 1 yr increments. The queries were split into 1 yr increments since the NMDB's NEST data retrieval system automatically decreases the resolution (e.g., from 1 h to 1 month time resolution) for too long queries. Users should also be aware that the data table can also change when this happens (e.g., from -1 h to -revori) and should note the headers to ensure which data was retrieved. Finally, the data subsets were compiled into a single matrix for the subsequent analysis.
VÄISÄNEN ET AL.  The web query method utilizes the following url when fetching the data: Where StationAcronym is the acronym associated with the specific station, NMDBtable is the selected data table, StartYear is year for which to collect data and EndYear is StartYear+1.

IZMIRAN
The Pushkov Institute of Terrestrial Magnetism, Ionosphere, and Radiowave Propagation (IZMIRAN) of the Russian Academy of Sciences was established in 1939 by Nikolay Pushkov. The IZMIRAN database offers data for most Russian (former Soviet) NM stations, but it also offers data from other NM stations.
Altogether, IZMIRAN has data folders for 82 NMs (Belov et al., 1998). Only one of these does not contain any data (Putre), leaving 81 stations with available data.
The database does not simply copy data from original sources, but apparently applies an automated procedure of validation and correction of the raw data. However, the procedure is not documented nor traceable and may distort the data. We have found, e.g., that outliers of unknown origin occasionally appear in otherwise good data.
Because of the unofficial status of the redistributed data and lack of proper documentation, IZMIRAN data should be taken with caution, even in cases where our analysis recommends it as the primary or secondary source.

Data Format
The IZMIRAN database is located at http://cr0.IZMIRAN.ru/common/links.htm. The IZMIRAN data is available through the "iDB"-button next to each station. There are options for pressure-corrected data, barometric pressure data and nonpressure-corrected data. The queried data only includes timestamps and the data values. Empty values are denoted by 0.
The pressure-corrected data for the full analysis period were downloaded on June 22, 2020 using the following web query: Where StationAcronym is the acronym associated with the specific station.

Polar Geophysical Institute
The Polar Geophysical Institute (Murmansk region, Russia) distributes data of the Apatity NM after requesting access from the data owner. There is also an option for Barentsburg NM data, but data retrieval for it did not work for the present analysis.

Bartol
The Bartol Research Institute of the University of Delaware (Newark, USA) operates eight NM stations: McMurdo, Swarthmore/Newark, South Pole, Thule, Fort Smith, Peawanuck, Nain, and Inuvik. Data sets have not been updated after 2017.

Jungfraujoch
The Physikalisches Institut of the University of Bern (Switzerland) operates two NMs (one of NM64 and one of IGY type), both located at the Jungfraujoch high-mountain station.

Lomnický Štit
The Institute of Experimental Physics of the Slovak Academy of Sciences in Košice (Slovakia) operates the Lomnický Štit NM station.

Mexico
Universidad Nacional Autónoma de México operates the Mexico City Cosmic Ray Observatory.

Oulu
The Oulu NM started operation in 1964 in the Kontinkangas district and was moved to the Linnanmaa campus in 1974 where it is still located. The University of Oulu also operates two mini-NMs (a standard DOMC and a bare (lead-free) DOMB) at the Concordia station on the Central Antarctic plateau (Poluianov et al., 2015;Usoskin, Mursula, & Kangas, 2001).

South African Stations
The Center for Space Research in the North-West University (NWU) in Potchefstroom (South Africa) operates NMs at five locations: Hermanus, Potchefstroom, Sanae64 (NM64), Sanae80 (lead-free) and Tsumeb.

Yakutsk/Tixie Bay
Yu.G. Shafer Institute for Cosmophysical Research and Aeronomy of Russian Academy of Sciences (Yakutsk, Russia) operates two NMs, viz. Yakutsk and Tixie Bay stations.

Other Sources
We also list here a few other possible data sources which we did not use because of some problems reported below.
The data for the Australian NMs at Mawson and Kingston are available through their web page at http:// www.sws.bom.gov.au/World_Data_Centre/1/7 and FTP at ftp://ftp-out.sws.bom.gov.au/wdc/wdc_cosray/. However, the website offers only daily files. Moreover, because of a very slow and unstable connection, we were unable to download the entire data set. Since data from these NMs are available from other sources even at the 1 h resolution used here, we did not analyze this data set.
The Tibet/Yang Ba Jing NM has a data distribution web-page at http://ybjnm.ihep.ac.cn/nm/, which however, was not working during the preparation of this study.

Metadata
Data for each NM station are usually accompanied by metadata either in a station information page or at the header of a data file, which typically includes the following parameters: Name, typically denoting the geographical name of the location. Historically, because of the limited length for the filename in old data formats, each NM station also has a 4-letter or 6-letter acronyms, which are usually the same for the same station across databases, but can also be different (e.g., McMurdo station is called MCMU and MCMD in NMDB and IZMIRAN databases, respectively).
The naming of some stations can also cause confusion for data users which are not aware of the histories of specific stations. Such examples involve the Swarthmore/Newark station which moved from one location to another nearby one in 1978, and can be referred to as "Newark," "Swarthmore" and "Swarthmore/Newark" in different data sources. The "Newark" data set can either have data for the whole Swarthmore + Newark period (Station, IZMIRAN, Station) or only for non-Swarthmore-period (WDCCR). Separate data sets only for Swarthmore data are available in WDCCR and in IZMIRAN, called Swarthmore and Swarthmore/Newark, respectively. This is confusing since the Bartol institute uses Swarthmore/Newark as the name for the data set containing the full data set, whereas IZMIRAN only contains Swarthmore data. Also, the Aragats and Nor-Amberd stations (in NMDB) have differing names, which are also called "Yerevan3000″ and "Ye-revan2000″ in IZMIRAN or "Erevan3″ and "Erevan" in WDCCR, respectively. The acronyms of the stations may also differ accordingly in the data sources.
We have performed a careful check to make sure that we always refer to the same station even if the names/ acronyms are not identical across the databases.
Location includes the geographical latitude, longitude, and altitude above sea level.
Geomagnetic cutoff rigidity provides an estimate of the sensitivity of a NM to the energy/rigidity of cosmic rays. It is roughly interpreted so that the primary cosmic-ray particles must possess rigidity higher than the cutoff (Cooke et al., 1991). The cutoff rigidity may slowly change for a fixed geographical location, because of the migration and current weakening of the geomagnetic dipole, but this is not always taken into account in the NM metadata. Sometimes metadata (e.g., the IZMIRAN "see info" page) mentions the rigidity computation year for a single value but does not provide the exact model. This information can be used as a rough estimate, but for a detailed long-term analysis, the cutoff rigidity is recommended to be calculated for each location and each given time, rather than being blindly copied from the metadata.
The metadata, including also years of operation are available from the following locations: NMDB: Station list at http://www.nmdb.eu/station/.
IZMIRAN: Station info is available under the "see info" button under the specific station "idB" page, or under http://cr0.IZMIRAN.ru/*station*/baseinfo.htm, where station is the short acronym of the station.
Station homepages usually also provide/display some metadata in many different ways which we do not explicitly cover here.
The metadata do not always reflect possible temporal changes (e.g., changes in rigidity cut-off, number of counters, or monitor type), even in cases with a time-series of metadata (headers WDCCR data files). The metadata from different sources are not always identical and, e.g., some differences in the reported cutoff rigidity (probably due to computation year, the used model, or reasons) other are common.

Prime Stations
With so many stations, it is difficult to check the stability of any individual NM. In order to have a reliable baseline for data comparison and validation, we have constructed an aggregate based on data from stable long-lived NMs that we here call "prime" (or "reference") stations. The selection of the prime stations was based solely on the quality of data, not involving any a-priori or subjective knowledge or preferences, using the following criteria: 1. Times of ground-level enhancements (GLEs) were removed from each data set of hourly pressure-and efficiency-corrected count rates using the list of the International GLE Database (https://gle.oulu.fi). 2. The data was normalized by the median over 2-year interval of years 1975-1976 (or 1995-1996 if the data for 1975-1976 was not available).
3. Outliers were excluded using a 5-point moving median filter which removes points that are more than three median absolute deviations from the 5-point median. 4. After the previous steps, stations with less than 20 year of total data coverage were excluded. 5. All data sets were visually checked for apparent steps, drifts, or other obvious errors in the data. Some of the errors could be corrected using metadata (e.g., change of the number of counters, or incorrect scaling factor) or using information from other data sources. 6. Data sets, which could not be corrected above, were excluded. To automatically exclude data sets with too large steps or unphysical variation, the following method was applied. Using the knowledge that the natural variability of hourly cosmic-ray data does not exceed ±30% even for polar NMs and is much smaller for lower latitude stations, we excluded data sets with large steps or drifts by requiring that the max-to-min hourly value ratio does not exceed two (i.e., the variations from the mean in the data set do not exceed ±33%). 7. In cases with several data sources available for a prime station candidate, the source with the longest data coverage was used.
Using this procedure, we selected 29 prime stations, listed in bold in Table 2. For further analysis we divided them in three groups according to their nominal geomagnetic cutoff rigidity R c : low-(R c ≤ 1.75 GV, 12 NMs), mid-(1.75 < R c ≤ 2.75 GV, 5 NMs), and high-rigidity (R c > 2.75 GV, 12 NMs) stations. The temporal variability of these prime stations is shown in the Supplementary Information Figure S4. For the low-rigidity prime NMs, we computed a reference data set NM low as the mean of the normalized prime stations with R c ≤ 1.75 GV, shown by the black curve in the upper panel of Figure S4. The reference data set for the medium-rigidity stations NM med was composed in a similar way ( Figure S4 middle panel). For the high-rigidity group of NMs, averaging was not done, because of the too wide range of the R c values, from 2.9 to 11 GV, so that the modulation effects would make the averages to be solar-cycle dependent. This would cause variation around the mean when comparing station data to prime data.
The prime data sets were used to check the data quality of all stations and their different sources. For lowand medium rigidity NMs we compared the data of each individual NM with the corresponding reference data sets NM low and NM med . For the high-rigidity range we compared the individual NM data with the prime station with the nearest rigidity cut-off, or in case of no time overlap, to the second or the third nearest ones.
For the comparison, we computed the ratio of the normalized count rates of the analyzed NM to the prime reference data set.
As an example, we provide a detailed analysis of the mid-rigidity Newark (before 1978 known also as Swarthmore) NM in the supplementary information S1. Newark/Swarthmore has data represented in all the analyzed sources for a long time period and also nicely depicts common characteristics related to the different sources. Similar analyses were made for all stations and all data sources. Basing on the fraction of the good data (and manual inspection of the comparisons), we constructed a list of recommended data sources as described below.

Recommendations
The following information on all available NM data sets is given and described in Table S5 as an Excel-file. This table contains a large amount of information that can be useful for NM data users. The acronyms are helpful when accessing data, since the data retrieval methods usually employ the acronym specific for the database. The number of all hourly data points from each source gives a rough estimate of data coverage. The overall usability of the whole length of data depends on the data quality and potential corrections that can be applied to the data. Latitude, longitude, altitude, and geomagnetic cutoff of the stations were collected from the metadata sources, as described in Section 3.6. These values might be not correct in cases where the station has been moved during its operation.
Based on the analysis described in Section 4, we have summarized our recommendations on the data sources for each station in Table 2. More detailed information on the recommended data sources is collected in Table S1, which includes station name, recommended source, secondary source(s), and notes about the data. The 'secondary' (or alternative) sources are nearly equivalent to the primary ones and may contain VÄISÄNEN ET AL.
10.1029/2020JA028941 8 of 11 additional data. Summary statistics of the primary and secondary data source recommendations are presented in Table 1.
The following caveats should be noted. First, the "data quality" used here as a means for data source selection is only examined relative to individual station: even if a specific source is recommended for the station, it may not correctly describe the general data quality. It only indicates which of the sources is the best according to our criteria. Moreover, the data quality was assessed in late June 2020 and may change later.

Discussion and Conclusions
We have performed a survey of all available NM records in a number of publicly available data sets and assessed their quality. We present a comprehensive table containing detailed information about the available data sets and also a list of recommended data sources for each station. This information is collected based on the state of affairs as of writing; the data sets are subject to change and therefore users of this information need to keep this in mind. Nevertheless, these results from the most extensive and up-to-date analysis of the NM data sets and provide useful basic information for users and developers of the related services.
It appears that data sets for the same NMs are not identical between different sources, making it difficult to control the reliability and reproducibility of studies based on NM data. While the WDCCR provides a simple repository for the data without corrections and updates of the data, other data sources try to resolve this problem. However, even for the NMDB project, there are discrepancies between different data tables, in particular the 1 h and revori ones.
Somewhat surprisingly, station homepages are not the recommended sources for multiple stations. It seems that through the advent of NMDB, many NM stations have switched to preferring to use NMDB to distribute their station data. This often leads to a situation where NMDB has more up-to-date and reliable (corrected) data available. Nevertheless, nearly all station homepages are at least a secondary recommended source, so using station homepages is mostly reliable.
IZMIRAN implements corrections in many data sets that are not available elsewhere. One such example is the Rome station, where IZMIRAN has corrected a large number of steps. This is useful, but a proper description of corrections is not readily available.
The results seem to indicate that a rule-of-thumb for selecting which data source to use is as follows: 1. Station homepages are often a good choice, but might not always have the most up-to-date data. 2. NMDB is usually a good choice for long-lived European NMs but also houses reliable data from many NMs from around the world. 3. IZMIRAN is a good choice for most Russian and East European NMs but also has good and/or corrected data from other areas. IZMIRAN often has a corrected version of WDCCR data. However, users should be aware that, even though the corrections mostly look good and plausible (often correction obvious errors presented in other data sources), the changes are undocumented and unofficial. Therefore, caution should be taken when using IZMIRAN data. 4. WDCCR has data from many (short-lived) stations that are not available elsewhere, but usually other sources have more reliable data.
A summary geographic map of these recommendations is shown in Figure 1. Because of the large number of stations, names are not shown. For more detailed information and station names, the reader should refer to the supplementary table.
As discussed in Section 3, the metadata of the stations are sometimes not identical across different data sources. This means that users should double-check relevant and important metadata either from multiple sources or by directly asking the station team (when possible).
All of these inconsistencies make the use of data difficult for a nonexpert, who is not familiar with data sets and the history of ground-based observations. Here, we made an effort to systematize the available and partly controversial information and to provide a user with a verified set of ground-based cosmic-ray measurements.
It should be noted that this survey presents only a momentary snapshot (as for June 2020) of the situation with data sources. The analysis has only been conducted for the 1 h data resolution and results with other resolutions may differ. Due to the nature of online data services, the presented results may change when data are changed, corrected, removed, or combined in the analyzed data sources. The selection of data source recommendations includes a visual inspection of the data to account for the incompleteness of the prime station validation, which can introduce a subjective bias in the results. This analysis also does not take into account possible corrections that might easily render the source in question to have reliable and comparable measurements to other sources. When selecting the data source to use, one should refer to the data coverage (number of data points) in the information table to check out if an "nonrecommended" source could possibly have more data coverage after corrections. The prime-station method utilized here only roughly validates the data quality in relation to other stations, and may not be accurate, especially for high-rigidity stations, because of their low statistic. For example, the <10% limit for good data did not catch the clear 4% step in many Newark data sets (See Supplementary Information S1). A more sophisticated method, based on theoretical modeling of cosmic-ray modulation derived from the entire NM network would provide a more robust assessment, and it is planned for the subsequent work. Observatory of the University of Oulu, Finland; as well as All other individual stations mentioned in this work that are included in the databases (NMDB, IZMIRAN or WDCCR).