Big data – model integration and AI for vector-borne disease prediction

integration


INTRODUCTION
Vector-borne diseases that spread over regional to continental scales pose threats to public health, food security, and national and international trade (Tomley and Shirley 2009). Predicting the factors whose variation explains patterns of disease as part of early warning strategies (EWS) for arthropod-borne diseases (arboviruses) is challenging because different hosts (livestock, humans, or wildlife), vectors, and dispersal mechanisms can be involved as a disease system encounters a spatially heterogeneous environment through time (Racloz et al. 2012, Altizer et al. 2013, Stewart-Ibarra and Lowe 2013, Parham et al. 2015. Arboviruses with endemic and epidemic life cycles, such as West Nile, Rift Valley fever, and Venezuelan equine encephalitis, are directly and indirectly influenced by environmental variability at multiple spatial and temporal scales that often result in different processes governing dynamics as spatial extent changes (Woolhouse et al. 2013). Focusing on one or a few factors or processes at the exclusion of others can lead to devastating and surprising consequences for human and livestock health (Jacquot et al. 2017, Mayer et al. 2017, Sule et al. 2018. For many diseases, there is often insufficient information about the environmental factors and processes that interact to govern dynamics leading to multi-scale patterns in spread (Escobar and Craft 2016). It is thus necessary to first develop data-intensive, process-based approaches using a disease system where datasets are readily available for the full suite of potential environmental factors across its spatial domain (Michael et al. 2017). Developing scale-dependent, analyticsbased approaches will facilitate the implementation of more cost-effective early warning mitigation strategies for geographically extensive diseases (Han and Drake 2016).
Traditional approaches to identifying processes governing spread of disease have had limited success in examining multiple potential environmental factors as the spatiotemporal conditions change. Local-scale disease processes examined in response to environmental conditions are traditionally studied in the laboratory; however, these results cannot be extrapolated to other spatial extents and field conditions unless stability and stationarity in fine-scale patternprocess relationships are assumed (Althouse et al. 2012, 2016, Michael et al. 2017. Likewise, studies of multi-scale correlations cannot easily infer the importance of different processes (Cohen et al. 2016), and case studies at regional scales often fail to provide a mechanistic understanding of disease dynamics at finer scales (Ginsberg et al. 2009, Messina et al. 2014, Walsh and Haseeb 2015, Faria et al. 2017. Climate-driven models can examine the role of multi-scale climatic variation on disease dynamics (Morin and Comrie 2013), but not the relative effects of the full suite of environmental factors that vary across geographic space. Few studies have quantified the multi-scale relationships among more than one environmental factor and the habitat of a vector-borne disease in attempts to elucidate processes (Lo Iacono et al. 2018), or to develop a generalized strategy that can be applied to other diseases globally (National Academies of Sciences Engineering and Medicine 2016). Recent advances in data science and online availability of geo-referenced, multi-scale climate and environmental datasets combined with the development of trans-disciplinary approaches provide opportunities to develop EWS that are both context-dependent at a local scale and generalizable across the geographic extent of a disease (Han and Drake 2016).
Recently, Peters et al. (2018) developed a big data-model integration (BDMI) approach guided by expert knowledge to identify and evaluate the relative importance of a large and diverse suite of all known environmental factors and life-history variables to patterns in vector-borne pathogen incursion and expansion. This framework uses a trans-disciplinary team to coherently integrate: (1) fine-scale, process-based data and understanding of vector and host responses to a ❖ www.esajournals.org pathogen and to the local environment, (2) georeferenced disease incidence and virus phylogenetic relationships, and (3) fine-scale patterns in climate, land surface properties, and host density for a multi-decadal temporal period across the continental extent of a disease. This approach differentiates drivers (e.g., climate, topography) from specific variables within each driver that are chosen for their ecological meaning in the system under study (e.g., daily minimum temperature, annual precipitation, elevation). The approach considers potential effects of many environmental variables on disease processes, both individually and their interactions, within and across spatiotemporal scales when little is known about the ecology of a particular system. The approach focuses on pattern-process relationships in the neighborhood of an individual premise leading up to the time of disease. Both key local (potential vertical transmission, horizontal transmission, possible insect overwintering, contact spread between animals, insect transmission to animals) and spatial processes (dispersal of hosts and vectors) are used to identify major drivers that typically influence disease systems. The relative influence of one or more variables within each driver is assessed in a multi-model analysis based on expected relationships from literature and team expertise with responses in one or more processes, ultimately leading to patterns in disease occurrence. This approach allows comprehensive examination of a large number of potential variables across a range of scales, but then restricts analysis to those variables that are biologically meaningful. A trans-disciplinary team is used to develop the conceptual model of the system, identify the variables, and then harmonize, analyze, and interpret the data. Here, the utility of this approach is illustrated using incursions into North America by vesicular stomatitis New Jersey virus (VSNJV), a model system for other vector-borne diseases caused by RNA viruses.

The VS disease system
VSNJV is a vector-borne, zoonotic RNA virus in the family Rhabdoviridae that causes readily observed vesicular lesions on wildlife and domestic livestock. In some species (ruminants, pigs), these lesions are clinically indistinguishable from foot-and-mouth disease (FMD), one of the most devastating exotic diseases in livestock that was eradicated from the United States in 1929. VS is the most reported vesicular disease affecting livestock (domestic horses, cattle, pigs) throughout the Americas (Rodr ıguez 2002), and although VS is less severe than FMD, there is no cure and no vaccine. Economic costs of VS through loss of milk and meat production and through regulatory repercussions (e.g., quarantines, limited movement, and sale of animals and animal products) can be significant (e.g., losses totaled> $14 M in the United States in 1995).
Despite decades of documented patterns of occurrence (Rodr ıguez et al. 2000) and multiple epidemiological studies, there is limited understanding about factors governing patterns in the spread of VS through time. A VS event originating in Mexico has occurred every decade in the Western United States since 1906 with a combined spatial extent over> 1.1 M km 2 from 2004 to 2016 (Fig. 1a, b). Major disease cycles occur at ca. 10-yr intervals that consist of the following: (1) failed incursion years (defined as initial disease spread into the United States) where disease is limited to southern states bordering Mexico and stops after one year (e.g., 2012), (2) successful incursion years with initially limited spatial distribution (e.g., 2004; Fig. 1c) followed by widespread expansion in subsequent years (defined as proliferation and spread) (e.g., 2005; Fig. 1d), and (3) extinction years where few or no cases occur in a limited geographic area following expansion (e.g., 2006). Recent analyses show that each cycle of events has been caused by single distinct viral genetic lineages that originated in southern Mexico (Rodr ıguez et al. 2000, Rainwater-Lovett et al. 2007, Velazquez-Salinas et al. 2014. Landscape-scale patterns of disease have been related to one or a few factors, such as elevation, monthly precipitation, or stream flow (Rodr ıguez et al. 1996, McCluskey et al. 2003, Elias et al. 2019. Host density and environmental conditions affecting the life cycle and dispersal of relevant VSV biological vectors (e.g., black flies, biting midges, sand flies) are believed to play a major role in VS occurrence based on field observations and laboratory experiments (Walton et al. 1987, Kramer et al. 1990, Cupp et al. 1992. The Western United States is particularly rich in climate, stream flow, vegetation, and other biophysical data relevant to different components of the disease transmission system (Table 1). However, a multi-scale analysis of the potential suite of factors that could affect insect population dynamics and disease transmission across the spatial extent and temporal domain of this disease has not been conducted, and these factors have not been examined under natural conditions. Thus, the range of conditions for each environmental variable where the VS disease occurs has not been quantified, and the mechanisms mediating disease emergence and expansion remain elusive.
Furthermore, VS can be a model system for other vector-borne diseases because of the following: (1) the complexity and importance of the reportable disease, (2) the epidemiological knowledge of VS, yet lack of ecological understanding about the natural cycle and spread of this disease, (3) the accessibility of co-located disease occurrence, viral phylogenetic, and environmental data through time, and (4) the availability of scientific and technological expertise. In addition, occurrences of the disease are known to be associated with the presence of the virus because our case definition is clinical signs of disease confirmed by laboratory detection of the virus or immunological evidence of recent infection.
We had two objectives: (1) to identify the factors governing spatial variability in VS occurrence at the landscape-to-regional scale and (2) to assess how local-scale environmental conditions differ between years when animals become infected compared to years without infection, considering incursion years when neighboring premises are not infected separately from expansion years when neighboring premises contain infected animals. Addressing these objectives will inform the development of scale-dependent early-warning strategies for VS in the Western United States given that some of the conditions for disease spread at a larger scale can be modified at the local scale.
We used our multi-scale framework to test hypotheses at two spatial scales. (1) At the landscape-to-regional scale, we hypothesized that spatial patterns are determined by either (a) viral genetic determinants alone, such that year with a different phylogeny (incursion-expansion hypothesis). To test these alternative hypotheses, we compared variables related to occurrence patterns in two separate incursion-expansion events separated by a decade (2004)(2005)(2014)(2015) in which viral phylodynamics have been characterized.
(2) At the local scale of individual premises, we focused on vector-environment interactions because information on host immunity was unavailable. We expected a sequence of conditions would precede infection that are related to processes that increase the abundance of competent and infected insect vectors; different conditions in incursion and expansion years should indicate that different vectors are involved in each year. We tested this hypothesis using a detailed temporal analysis (1-4 weeks and 1-12 months prior to infection) to compare the importance of factors to VS occurrence at individual premises.

Approach to variable selection and harmonization
The trans-disciplinary team consists of experts in this disease, ecologists, and ecoinformatics experts, including people with experience in big data analytics and machine learning (Fig. 2). Our approach focuses on pattern-process relationships in the neighborhood of an individual premise leading up to the time of disease. The spatiotemporal resolution of a neighborhood is assumed to depend on the relationship between a variable and the vector or host process with the potential to influence infection. Both key local (potential vertical transmission, horizontal transmission, possible insect overwintering, contact spread between animals, insect transmission to animals) and spatial processes (dispersal of hosts and vectors) were used to identify six environmental drivers a priori (pedology, hydrology, topography, climate, drought, land surface properties) that typically influence disease ecology systems based on the literature (Table 1). Within each driver, we used expert knowledge in the VS system to select one or more variables expected to influence patterns or responses in one or more disease processes (Table 1), ultimately leading to patterns in VS occurrence. This approach allows comprehensive examination of a large number of potential variables, but then restricts analysis to those variables that are biologically meaningful (Tables 1, 2). We then identified a data source for each variable to cover the spatial and temporal extent of our study location, the contiguous watersheds in the Western United States where VS occurred between 2004 and 2015 (www.a phis.usda.gov) (Fig. 1b).
After variables and corresponding datasets were identified (Table 1), harmonization was used to facilitate synthesis and integration (Fig. 2). The VS disease occurrence data were converted from a geographic coordinate system to an equal area projection system (e.g., Albers Equal Area Conic) to ensure cell sizes remained the same (1 km 2 ) throughout the large spatial extent (~1.1 M km 2 ) of the study area. All other variables were harmonized to the projection, geographic origin, and cell size of this VS occurrence base map. The type of the native structure of the source data (raster, vector, polygon) determined the harmonization procedure and format of the analysis. For raster data (e.g., gridded PPT at a 4 km 9 4 km resolution), harmonization consisted of resampling maps to 1 km 9 1 km. Vector data (e.g., points, lines) were converted to raster, harmonized to the base layer, and then translated into distance maps. Polygons were rasterized by calculating average properties and then harmonized to the base layer. All spatial data were manipulated with ArcGIS v.10.3 to assist in the harmonization procedure. The variable selection procedure resulted in 472 raster layers, which were collated into a harmonized data cube to enable calculations and predictions to be carried out in geographical space.
For landscape-to-regional-scale analysis, the R package MaxentVariableSelection (Jueterbock 2015) was used to control model complexity, avoid collinearity among predictor variables, and optimize parameters for analysis. Variables were removed from the analysis whose contribution to the model was <5% and whose correlation with another variable was >0.7. The resulting model typically had <20 uncorrelated variables, and <10 variables with contribution >5%. For the localscale analysis, a tabular dataset was generated from the raster maps such that fine-scale relative temporal relationships (weeks or months prior to a VS occurrence) were preserved between occurrence and environmental variables, and arbitrary classifications (e.g., month, season) at broad spatial scales were masked. Cell size (4 km 2 ) was selected to characterize the environment surrounding a VS occurrence. All spatial data exported to tabular data were extracted from raster maps (118 variables) by calculating the mean of values extracted from a 100-point grid centered on a VS occurrence location. The number of candidate variables for multivariate analysis was reduced to 20 using an iterative, human learning procedure to identify those variables with the strongest univariate relationship to VS occurrence to avoid collinearity among predictors >70%. All tabular data processing and analyses Fig. 2. Workflow showing the major steps in our approach to predictive disease ecology that includes an integration of big data with models, expert knowledge, and machine learning. Expert knowledge from disease experts and ecologists are shown in blue boxes, ecoinformatics expertise is shown in pink boxes, and humanguided machine learning is shown in yellow boxes.
were conducted in SAS v.9.4: SAS Institute, Cary, NC USA. The USDA-APHIS policy of mandatory reporting of VS occurrence by doctors of veterinary medicine resulted in accurate identification of VSNJV-infected animal, onset date, and premise location. Occurrence data represent the clinical onset of VS lesions. Since only occurrences were reported, premises where VSV did not occur (i.e., absence data) were unavailable for analysis. We focused on two VS events that represented 96% Virus phylogenetic data.-Viral genetic variability evaluated with a phylogenetic analysis using partial genomic sequences (P gene hypervariable region) have been previously described for the 2004-2006outbreaks (Rainwater-Lovett et al. 2007, Velazquez-Salinas et al. 2014. Phylogenetic analyses were based on near fulllength genomic sequences of representative viral strains from the two outbreaks. Alignments were conducted using the ClustalW algorithm implemented in MEGA v7.0.18 (Kumar et al. 2016).

Identification of variables and data sources
Models of nucleotide substitution were evaluated using the Model Testing tool in CLC Genomics Workbench where various statistical analyses were assessed (hLRT, BIC, AIC, and AICc) and the GTR + G model was implemented in a maximum likelihood (ML) phylogenetic reconstruction (www.qiagenbioinformatics.com).
Livestock density.-To capture host population densities, two livestock variables were acquired: number of horses and number of horse premises from tabular county census data in available years (2002,2007,2012) (Table 1). Broad-scale and long-term patterns were evaluated using the mean of yearly census data while fine-scale patterns were evaluated by linearly interpolating annual values. Lastly, county livestock density (number of horses or premises with horses per km 2 ; Table 2) was calculated as the average or yearly number of animals or premises divided by county area. Only the horse data for VS had sufficient numbers for analysis in all four years; data for other animals were too sparse for multiyear analyses. These averages were rasterized and included in the raster dataset and yearly values were incorporated into the tabular dataset.
Since proximity to VS occurrences and horse density are likely to facilitate transmission, we calculated the number of neighbors with VSV using the following procedure for the local analysis. For each VS occurrence, the number of horses Table 2. Derived temporal variables for landscape-and regional-scale (noted by a superscript "1") and local-scale (noted by a superscript "2") analyses. † Variables with temporal resolution (from Table 1) Weekly 2 (n = 4) Monthly 1,2 mean (n = 12) Seasonal 1 mean (n = 4) DSeasonal 1 mean (n = 4) Annual 1,2 mean (n = 1 or 2) X ¶ Properties with horses (density) § X ¶ Notes: "Monthly" refers to January to December. "Seasonal" refers to winter, spring, summer, and fall. † Each of 4 yr analyzed separately (2004, 2005, 2014, and 2015). ‡ Environmental variables. § Biotic factors. ¶ Linear extrapolation used for non-sample years; temporal analysis included prior and current year.
with VSV in the neighborhood was counted during the occurrence year, defined as 1 March through February, and during the prior year. A neighborhood was defined by centering a 36 km 9 36 km grid (9 9 9 cells) on an occurrence location.
Pedology.-Two variables were used to capture soil properties that influence small pools of standing water and favorable conditions for biting midges. Available water content (AWC) and percentage clay content in the top 20 cm were extracted from STATSGO (Table 1) and summarized by calculating weighted means by area from all soil components in each map unit. The polygon data were then rasterized and incorporated into the raster and tabular datasets.
Hydrology.-Three variables were used to capture hydrological conditions expected to lead to favorable environments for black flies. Daily stream flow data (streamflow: Table 1) were used to estimate the presence of water in ephemeral watercourses and to quantify the rate of flow (in cubic feet per second [cfs]). Distance to nonzero monthly, annual, and mean annual flow data, measured as Euclidean distance, were incorporated into the raster and tabular dataset. Euclidean distance to major rivers and lakes (USGS 2017) was included in the raster dataset only. Lastly, mean monthly stream flow calculated from the closest gauges (mean = 18.7 km) for the 11 months prior to a VS incident was incorporated into the tabular dataset.
Surface runoff and soil moisture were included (Table 1) to identify favorable conditions for black fly reproduction. Daily runoff (in kg/m 2 ) and soil moisture (in kg/m 2 ) values averaged for each month, season, and year were included in the raster dataset. Monthly averages for the prior 11 months were calculated and incorporated into the tabular dataset.
Topography.-Elevation (in m) was used to capture ecosystem variability resulting from topoclimatic processes. Elevation derived from digital elevation models (Table 1) was incorporated into the raster and tabular datasets.
Climatology.-Daily and monthly total precipitation (mm) and average maximum and minimum temperature (°C) data were used to calculate seasonal, annual, and 30-yr means to explore macro-scale climatic relationships with disease occurrence. In addition, weekly averages, calculated from daily data, and monthly averages relative to VS incident date (i.e., prior weeks or months) were extracted for each VS occurrence and incorporated into the tabular dataset. Severe departure from normal climatic conditions resulting in drought was captured by the evaporative demand drought index (EDDI :  Table 1). Seasonal, annual, and long-term values were calculated with the longest available dataset following the same procedure used for the PRISM data and incorporated into the raster and tabular datasets.
Vegetation.-Variation in vegetation growth may be important to both host and vectors. To capture variability in green vegetation biomass, we included normalized difference vegetation index from MODIS (NDVI: Table 1).

Normalized environmental variables
In addition to raw data values, normalized values were calculated to limit sensitivity to differences in magnitude. Deviations from seasonal long-term means were calculated using 30-yr means for climate from the PRISM database for temperature and precipitation. For other variables, such as livestock density, streamflow, EDDI, NDVI, soil moisture, and runoff, the longterm averages were calculated over the duration of the time period of VSNJV incidence data (2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015).

Hypothesis testing
Objective 1: Landscape-to-regional-scale analysis.-We used Maxent (Phillips et al. 2006, Phillips andDud ık 2008) to model annual distributions of VS and to calculate relative occurrence rates (RORs) across the study area. Maxent has been used extensively to create and evaluate species distributions in covariate space, often represented geographically, across a broad range of biological applications using presence-only data (Elith et al. 2011). This machine-learning approach, based on the principle of maximum entropy, enables fitting of complex non-linear relationships and performs similarly or better than traditional general linear modeling approaches, particularly when only occurrence data are available (Elith et al. 2006).
For each Maxent analysis, 10 replicates were conducted using a 90% random subsample of the occurrence data for model training which were ❖ www.esajournals.org then averaged together to create the final model. Because Maxent settings were optimized using the R package MaxentVariableSelection (Jueterbock et al. 2016), we were able to compare multiple models of annual VS occurrence using AIC. Model performance was assessed using the corrected Akaike information criterion (AICc) which provides a relative measure of model quality considering fit and complexity (Akaike 1998). Model selection criteria, such as AIC, are designed to minimize overfitting by penalizing excess complexity (Boxand Draper 1987, Burnham and Anderson 2004) which can result in poor model transferability (Chatfield 1995, Sarle 1995. Underfitting can also reduce transferability, especially when indirect predictor variables are incorporated in a model (Wenger and Olden 2012). Multi-model inference was applied in the development of annual models and only the best performing model was projected into the validation years. To ensure that models were sufficiently parsimonious but not underfit, we assessed the best performing model's transferability by non-randomly partitioning our data into temporally distinct subsets (years), which were then used for training (2004 and 2005) and validation (2014 and 2015). Our results provide insight into how the model will perform (transferability) under subsequent environmental scenarios (Vaughan and Ormerod 2005). We followed previous recommendations by interpreting the Maxent output as relative occurrence rate (ROR) (Merow et al. 2013). We evaluated variable importance using jackknife plots, variable response curves, and frequency distribution plots to test hypotheses about VSNJV occurrence.
To evaluate support for our virus or incursionexpansion hypotheses, variability in annual VSNJV occurrence patterns was explored by developing separate models for VS occurrences in 2004 (incursion) and 2005 (expansion). The 2004 and 2005 models were then cross-evaluated by projecting each onto 2014 and 2015 environmental conditions. The ability of each model to predict subsequent events and event specificity (e.g., whether or not an incursion model [2004] can predict a future phylogenetically unrelated incursion event [2014] or a phylogenetically related expansion event [2005]) was then evaluated using the mean predicted ROR values at occurrence and randomly selected background locations and by visually inspecting the geographic representation of ROR. Degree of niche similarity between models was determined using Schoener's D and Warren's I.
Objective 2: Local-scale analysis.-A tabular dataset was developed from the raster maps (118 variables) such that fine-scale relative temporal relationships (weeks or months prior to a VS occurrence) were preserved between occurrence and environmental variables. Cell size (4 km 2 ) was selected to characterize the environment surrounding a VS occurrence and included only the weekly and monthly data associated with the 1393 unique combinations of sampling locations and "occurrence" month-week. To capture the degree of deviation from normal conditions, we analyzed the standardized values: where V represents a variable at time interval t and x represents each value of the dataset. Standardization enabled comparison among years at the same location, reduced problems associated with input scale, reduced issues with multicollinearity, and allowed interpretation in the normal manner since regression coefficients are identical. All spatial data were exported to tabular data by calculating the mean of values from a 100-point grid centered on a VS occurrence location.
This process was performed for two purposes. First, to compare individual contributions of explanatory variables in order to explain temporal variation of VSNJV occurrence. We conducted logistic regression of VSNJV occurrence (1 or 0) against one explanatory variable at a time to select 20 explanatory variables with highest maximum rescaled R 2 . Correlated explanatory variables (Pearson's r ≥ 0.7) were prevented from cooccurring in the multivariate model. Second, daily occurrence data were converted to Julian month, and least-squares regression was used to evaluate the relationship between county occurrence and latitude, elevation, and long-term precipitation. Predictor variables were selected based on the highest R 2 . The resulting model was used to generate a map of estimated onset dates for the entire study area at 1-km 2 resolution using the rasterized environmental data. All local-scale statistical analyses were conducted in SAS v.9.4.

RESULTS AND DISCUSSION
Landscape-to-regional-scale analysis Spatial distribution models based on humanguided machine learning and constructed using the geo-referenced, harmonized maps showed that of the 472 possible variables (Table 1), only four environmental drivers and host density were needed to explain patterns in VS occurrence in 2004 and 2005 (Table 3). Remarkably, these drivers represent all but one of the possible classes of environmental drivers included in the analysis; only large-scale drought is missing when both years are combined (Fig. 3, Table 3). Hydrology, vegetation, and climate were important in both years, and elevation was important in 2004 while soil properties were important in 2005. These few drivers and the variables contained within them provide both technological and biological complexity that yield new insight into factors governing spatial patterns in VS occurrence.
First, there is high diversity in data sources and types in these variables that required a big data approach of creating derived data products using decisions about spatial or temporal aggregation and harmonization of online national datasets prior to analysis. Detailed climate data in time and space, digital elevation models, digital soil maps and accompanying properties, remotely sensed imagery for vegetation, and distance calculations from each premise to the nearest stream were handled differently before analysis and often required domain technical expertise for interpretation. Although most previous studies typically included one or a few of these variables, and most studies focused on climate or elevation (e.g., Rodr ıguez et al. 1996, McCluskey et al. 2003, our findings clearly show the importance of including all of these variables in a trans-disciplinary approach to vector-borne diseases across large spatial extents. Second, these few drivers are associated with different processes, multiple levels of biological organization (host, vectors), and different vectors that reflect different variables (Table 1). High horse density, proximity to streams with water, and high green vegetation during the summer were important to VS occurrence in both the incursion (2004) and expansion years (2005) (Tables 3, 4). These variables may define general habitat characteristics for VS based on horses as known hosts, and black flies, a known vector, that are biologically bound to flowing streams for oviposition, hatching, and larval development (Adler and McCreadie 1997). In the arid and semiarid Western United States, an increase in green vegetation during pre-monsoonal summer months is often found spatially distributed along streams and other water bodies.
Because 2004 and 2005 have similar viral phylogenies (Rainwater-Lovett et al., 2007), we were able to test whether spatial patterns are determined either by viral genetic determinants alone (viral hypothesis) or whether environmental variables are needed to explain different patterns in occurrence in incursion (e.g., 2004) vs. expansion years (e.g., 2005) (incursion-expansion hypothesis). Results show that different environmental variables were needed to explain patterns of VS occurrence in each year (Table 3), in support of the incursion-expansion hypothesis. In 2004, high horse density (0.82 animals/km 2 ), locations at moderately high elevations (mean = 1642 m) with low spring, prior winter, and prior fall precipitation, and close proximity to streams (17 km) were the most important variables to patterns in VS occurrence (Tables 3, 4). Elevation has been implicated as an explanatory variable for landscape-scale variation in VS occurrence in Mexico and may be a surrogate for a combination of vector-environment interactions (Rodr ıguez et al. 1996) or may contribute to stream flow dynamics that affect the black fly life cycle. Hydrological patterns are also related to patterns in VS occurrence in the Western United States (Elias et al. 2019). The combination of these factors and the limited spatial distribution of VS occurrences in 2004 are consistent with the hypothesis that black flies might be the principal vector dispersing VS northward along streams or flowing water canals in incursion years.
In 2005, summer conditions of low precipitation, cooler than average temperatures, and high green vegetation along with higher than average rainfall in the fall, close proximity to streams, and soils with moderately high available water holding capacity (AWC) were the most important variables associated with VS cases (Tables 3,  4) These conditions would promote an additional insect vector that requires small patches of shallow water or waste-enhanced mud along the edges of bodies of water for successful breeding habitat (e.g., biting midges; Culicoides spp.) (Mullens and Rodriguez 1988). Numbers of competent female midges peak in late summer and early fall (Mullens and Rodriguez 1988, Mullens (4) [dev], winter 6 (8) 1 7 Notes: The percentage contribution (with rank of importance in parentheses) and permutation importance of each variable from MaxEnt analysis in explaining spatial variation in each year are shown. AWC, available water holding capacity of the soil; dev, deviation from long-term mean.
† Beta multiplier = 3.0 optimized from the R package and run in MaxEnt; (rank). ‡ Beta multiplier = 2.5 optimized from the R package and run in MaxEnt; (rank). § The permutation contribution is the decrease upon removal of a variable from the model. A large decrease indicates that the model depends heavily on that variable. Values are normalized to sum to 100. 1989, Pfannenstiel and Ruder 2015), and have been associated with variations in local climate (Stallknecht et al. 2015). Thus, we hypothesize that these differences in important variables between years is a shift from an insect vector (black flies) in incursion years where moving water is needed for successful breeding and transport of eggs and larvae, and could explain patterns in disease concentrated along rivers, to multiple insect vectors (e.g., black flies and midges) in expansion years that would allow widespread expansion of disease throughout the geographic area.
We further tested our hypotheses by comparing distribution models constructed from 2004 (or 2005)  Our results show that the two incursion years (2004,2014) were most similar, and the two expansion years were most similar in environmental conditions (Fig. 4f). The model created using 2004 environmental data (i.e., incursion model; Fig. 4a) had the highest ROR (0.57; Fig. 4f) and the greatest degree of niche overlap (Schoener's D = 0.46, Warren's I = 0.74) with the 2014 (Fig. 4b) environmental data in predicting VS occurrences in that year more so than with data from the same viral lineage in an expansion year (2005; Fig. 4c) or with a different viral lineage (2015; Fig. 4d). Using the 2005 model (Fig. 4c), the highest ROR (0.33) and greatest niche overlap (Schoener's D = 0.74, Warren's I = 0.82) were found with 2015, the other expansion year, yet with a different viral lineage (Fig. 4d). Both incursion years (2004,2014) had lower winter precipitation, higher summer and fall precipitation, and cooler long-term summer maximum temperatures compared with both expansion years (2005,2015) (Fig. 5). These results indicate that changes in viral genetic lineage were less important to patterns in VS occurrence than factors affecting the host-vectorenvironment interaction in incursion (2004,2014) and expansion (2005,2015) years. Thus, vectorenvironment interactions had a stronger control on patterns in disease occurrence than viral phylogeny.

Local-scale conditions preceding VS incursion and spread
Similar to the large-scale analysis, there was some overlap in on-site environmental conditions that represent incursion and expansion years, but there were also distinct differences (Fig. 6). These results provide further support for two vectors: black flies predominating in the incursion phase when few infected animals were present, and biting midges being the dominant vector in the expansion phase when far more infected animals were present. This relationship with the host has been observed previously as    (2005, 2015; open bars) at VS occurrence locations: horse density, elevation, distance to nearest stream, vegetation greenness in summer, normalized vegetation greenness in summer, spring precipitation, summer precipitation, winter precipitation, fall precipitation, normalized fall precipitation, normalized maximum summer temperature, and (normalized maximum winter temperature. Letters indicate significant differences between years (P < 0.05). horses (and other livestock) within 0.4 km of running water (which presumably placed them at higher exposure to black flies) had a greater risk of contracting VS ). On-site green vegetation during the month of occurrence and higher rainfall four Fig. 6. The sequence of events that precedes the occurrence of VS at scale of an individual premise in incursion (lower) and expansion years (upper panel). Mean VS onset month calculated for any location in the region based on latitude, elevation, and long-term precipitation (R 2 = 0.27, P < 0.001, n = 1491). Given this date, a livestock or equine owner can monitor their local conditions to determine the likelihood that VS will occur in each month of that year. Univariate plots illustrate the relationship between predicted probability of occurrence (P) and standardized environmental conditions. months prior combined with either cool daytime (expansion) or nighttime (incursion) temperatures one month prior were common indicators of VS occurrence in both years (Fig. 6). The remainder of the variables were related to the putative vector: either black flies in incursion years (stream flow [2 months prior], surface runoff [3 months prior]) or midges in expansion years (soil moisture [3 months prior]) (Fig. 6). Our results are consistent with studies on black flies showing that numbers of viable eggs and larvae that survive the winter are dependent on dynamics of early season stream flow, water and air temperature, atmospheric humidity, and other factors (Mullens and Rodriguez 1988).

Developing early warning strategies for VS and other vector-borne diseases
Our multi-scale findings showed a temporal sequence of early warning indicators supported by a spatial analysis of important vectors that veterinarians and livestock owners can use as early warning strategies for VS occurrence throughout the Western United States. These indicators do not depend on knowledge of environmental variables that vary geographically, but rather depend on conditions relative to average conditions on a premise. Recently, an EWS was proposed using computer-based searches that relies on the public response triggered by the occurrence of the initial outbreaks (Wang et al. 2018). Our EWS consist of a statistical relationship to first estimate onset month of VS for a premise (Fig. 6), and then a set of indicators is selected based on disease phase (incursion, expansion) and vector identity (black fly, biting midges) (Fig. 6). Predictions that do not consider disease phase or vector identity are likely to be too general for use at finer spatial and temporal scales. Proactive measures to reduce exposure to vectors can then be implemented in advance, such as reducing on-site vegetation in years with cool summer temperatures, relocating susceptible animals away from streams and housing them in structures guarded from biting insects  or implementing aggressive vector habitat mitigation strategies in locations with a high probability for VS expansion.
Prior to this analysis, specific knowledge about vector-environment interactions in incursion vs. expansion phases, and the identity of these early warning indicators were not available for VS. The hypotheses generated by this analysis are being used to guide field studies to sample vectors for VSV across environmental gradients in the Western United States. Collection of new data about vector-environment relationships, in particular in different years of a VS event, will improve future model predictions and ultimately aid in the refinement of the conceptual model under development for the VS system (Peters et al. 2018).
This big data approach coupled with human and machine learning can be applied to other emerging diseases for improvement in understanding, prediction, and management of vectorborne diseases (National Academies of Sciences Engineering and Medicine 2016). Translation of this knowledge can be made to improving animal and human welfare, and aiding food security to assist in development of early warning strategies that are currently based primarily on climate (Muñoz et al. 2016) or environmental predictions based on only a few variables (Lo Iacono et al. 2018).