Big Data for weed control and crop protection

Farmers have access to many data-intensive technologies to help them monitor and control weeds and pests. Data collection, data modelling and analysis, and data sharing have become core challenges in weed control and crop protection. We review the challenges and opportunities of Big Data in agriculture: the nature of data collected, Big Data analytics and tools to present the analyses that allow improved crop management decisions for weed control and crop protection. Big Data storage and querying incurs signiﬁcant challenges, due to the need to distribute data across several machines, as well as due to constantly growing and evolving data from different sources. Semantic technologies are helpful when data from several sources are combined, which involves the challenge of detecting interactions of potential agronomic importance and establishing relationships between data items in terms of meanings and units. Data ownership is analysed using the ethical matrix method to identify the concerns of farmers, agribusiness owners, consumers and the environment. Big Data analytics models are outlined, together with numerical algorithms for training them. Advances and tools to present processed Big Data in the form of actionable information to farmers are reviewed, and a success story from the Netherlands is highlighted. Finally, it is argued that the potential utility of Big Data for weed control is large, especially for invasive, parasitic and herbicide-resistant weeds. This potential can only be realised when agricultural scientists collaborate with data scientists and when organisational, ethical and legal arrangements of data sharing are established.


Summary
Farmers have access to many data-intensive technologies to help them monitor and control weeds and pests. Data collection, data modelling and analysis, and data sharing have become core challenges in weed control and crop protection. We review the challenges and opportunities of Big Data in agriculture: the nature of data collected, Big Data analytics and tools to present the analyses that allow improved crop management decisions for weed control and crop protection. Big Data storage and querying incurs significant challenges, due to the need to distribute data across several machines, as well as due to constantly growing and evolving data from different sources. Semantic technologies are helpful when data from several sources are combined, which involves the challenge of detecting interactions of potential agronomic importance and establishing relationships between data items in Introduction Food production must increase by 70% in order to feed a world population that is expected to reach 9.6 billion by 2050 (Foley, 2011;Foley et al., 2011). This challenge is even greater, when we take into account the scarcity of new arable land, the effects of climate change on agricultural production and the societal demand for decreasing the environmental impact of agriculture (Foley et al., 2011). Weed management will be of crucial importance, given that crop yield losses caused by weeds (about 32%) are higher than those caused by either pests (18%) or pathogens (15%) (Oerke & Dehne, 2004).
Farmers have access to three categories of dataintensive technologies to help address the above mentioned challenges: (i) Farm Management Information Systems (FMIS), which refer to a planned system for collecting, processing, storing and disseminating data in the form needed to carry out a farm's operations and functions (Fountas et al., 2015); (ii) precision agriculture, which is the scientific domain that deals with management of spatial and temporal variability to improve economic returns and reduce environmental impact (Blackmore et al., 2003); and (iii) agricultural automation and robotics, which is the process of applying robotics, automatic control and artificial intelligence techniques at all levels of agricultural production (Zhang & Pierce, 2013).
However, it is not just farmers who will use Big Data solutions for weed control. In several European countries, the number of invasive plants (IAS, invasive alien species) has significantly increased during the last decades (De Almeida & Freitas, 2012;Py sek et al., 2012). Big Data solutions have been developed to prevent further spread of IAS (P eknicov a & Berchov a-B ımov a, 2016). For example, areas vulnerable to invasive weeds were identified using species distribution data and data on local environmental conditions in conjunction with species distribution models, GIS software and statistical tools (Guisan & Zimmermann, 2000;Thuiller et al., 2009). These vulnerable areas can then be subjected to monitoring. Early et al. (2016) provided the first global, spatial forecast of weed invasions in the 21 century by analysing spatial data for the factors that determine introduction and establishment of IAS. Big Data analysis has also been used to predict the spread of IAS with particular preferences for soil, water and temperature, such as Oxalis pes-caprae L. (bermuda buttercup), Solanum elaeagnifolium Cav. (silverleaf nightshade) and Taraxacum spp. (Travlos et al., 2008;Luo & Cardina, 2012;Travlos, 2013b). Finally, Big Data analysis will result in a better understanding of the biology and ecology of several parasitic weeds like Orobanche spp. and Phelipanche spp., which in turn will enable better management (Song et al., 2005;Prider et al., 2012).
Plant invasions on global and regional scales pose severe ecological, agricultural and health concerns resulting in considerable economic losses. Ambrosia artemisiifolia L. (common ragweed) is an important agricultural weed, especially in spring-sown crops, such as sunflower, maize, sugarbeet and soyabean. A main problem with this plant is its enormous production of highly allergenic pollen grains, generating huge medical costs and reduced quality of life among the allergic population (Fumanal et al., 2007). The highly allergenic pollen causes sensitisation of up to 60% of the allergic population, with annual medical costs of these allergies amounting to, for example, €110 million in Hungary and €88 million in Austria (Gerber et al., 2011). The European Aeroallergen Network (EAN) pollen database (https://ean.polleninfo.eu/Ean/) holds information from more than 600 pollen-monitoring stations from all over Europe. EAN data have been used to identify large local permanent or expanding populations of ragweed ( Sikoparija et al., 2009;Thibaudon et al., 2010). Combined with other data sources, this can lead to early detection and eradication in new areas and the development of a sustainable management strategy of A. artemisiifolia in several invaded or potentially susceptible habitats.
Data-driven innovations have already revolutionised several sectors of the economy. The promise that a similar revolution in agriculture may provide benefits is contributing to a growing interest in the application of Information and Communication Technology (ICT) in agriculture. Data collection, data modelling and analysis, and data sharing have become core challenges, an opportunity for innovation and a growth area for commercial development. Vast amounts of data are collected with proximal, airborne or satellitebased sensors, in situ sensors (i.e. soil moisture sensors), on-farm weather stations and instrumented farm equipment. This qualifies as Big Data according to the definition of De Mauro et al. (2016), namely information assets that are characterised by high volume, high velocity and high variety and that require specific technology and analytical methods for its transformation into value.
In addition, there is a need to share data across the supply chain, both to increase the efficiency of the supply chain and to respond appropriately to agricultural standards, such as integrated crop and weed management. Consumer pressure for more information about agronomic practices creates technical and business model opportunities, if the right architectures, analytical tools and data presentations can be developed. The growth of open data and linked data provides opportunities to integrate data from multiple sources and thus to provide new insights and new services. The combination and proper analysis of Big Data from previous records in a wide area, together with specific measurements and data from field history, can result in the quick evaluation and management of herbicideresistant weeds. This can be further accompanied by decision-support systems, to find the ideal tailor-made solutions for each case.
The tools provided by precision agriculture and other information technologies have not yet moved into mainstream agricultural management. In general, adoption of technological innovations depends on characteristics of the innovation (e.g. cost, complexity), the innovator and his or her socio-economic background (e.g. preferences and educational level of farmer), the perceived usefulness and ease of use (Rogers, 1995). This has been confirmed for agricultural innovations (Pedersen et al., 2004;Kutter et al., 2011;Lawson et al., 2011;Fountas et al., 2015). In agriculture in general, the adoption of innovations is also highly dependent on the knowledge support system in place (Straub, 2009).
The aim of this study was to provide an overview of technologies relevant to the application of Big Data for weed control and crop protection, to highlight noteworthy examples and to indicate the work that is still needed to increase the exploitation of Big Data. The remainder of this study is structured as follows. In the following three sections, we describe the building blocks for Big Data in weed control and crop protection, namely data (Big Data capture, storage and sharing), data analytics (Big Data analytics) and thirdly delivering information to farmers (delivery of actionable information to farmers). We then discuss existing decision-support systems for weed control and crop protection and describe opportunities for further development (current applications of Big Data for weed control and crop protection). Conclusions and recommendations are given in the final section.

Big Data capture, storage and sharing
Precision agriculture is an information-intensive, cyclic activity, which can be divided into data collection, data analysis, decision-making and evaluation of decisions ( Fig. 1) (Fountas et al., 2006). It is useful to characterise decisions based on the planning horizon and to distinguish strategic, tactical and operational decisions. An example of a strategic decision is whether or not to use precision agriculture; an example of a tactical decision is which crops to include in the rotation; finally, operational decisions have to be made on a day-to-day basis regarding the timing of field operations and the amounts inputs used.
Where does the data come from?
The data in precision agriculture originate from many sources. Crop and soil management data describe the operations that are carried out in the field: tilling, planting, fertilisation, crop protection, weed management and harvest, along with the details such as date, kind of seed or fertiliser or chemicals used, as well as the amounts and the manner in which they are applied. The volume of this information is very small, just a few hundred bytes ha À1 year À1 (Table 1, Fig. 2), and it is often recorded manually by the farmer in a Farm Management Information System (FMIS). Another kind of information concerns samples of soil and plants that are sent to a laboratory for analysis of texture, chemical composition and potential presence of pathogens and weeds. Yields are recorded at the end of the season and will certainly show up on the receipts sent by the cooperative or private buyer to whom the product is shipped.
By far the largest amount of data results from automatic recording with electronic sensors. These include automated weather stations on farms, soil moisture sensors and an increasing number of sensors attached to quads, tractors, harvesters and (semi-)autonomous ground and aerial vehicles (Table 1, Fig. 2).

Data storage
Once collected, data must be physically stored and organised in such a way that it can be queried. In the 1960s, relational databases evolved as the standard to model and store data, in part because relational databases can model alternative types of databases, such as hierarchical and network databases. The behaviour of relational data can be fully described using set theory (Codd, 1970). There is ample literature using the relational model to store agricultural data, including work started by decision support system for agrotechnology transfer (DSSAT) and continued by the International Consortium for Agricultural Systems Applications (ICASA) (Hunt et al., 1994;White et al., 2013) and the Agricultural Model Intercomparison and Improvement Project (AgMIP) (Rosenzweig et al., 2013), but also by others (Van Evert et al., 1999a,b;Steiner et al., 2009).
In Big Data applications, specific requirements with respect to storing and searching tend to make the use of relational databases difficult. For example, the data usually have to be distributed across several machines due to its volume, and it may moreover be constantly growing and evolving. In such situations, it might be challenging to 'partition' a relational database management system (RDBMS) across multiple machines and maintain it as new data continue to pour in (Marz & Warren, 2015). Moreover, searching can be slow in very large relational databases. To cope with these challenges, several alternatives for RDBMS are employed with Big Data systems, including NoSQL ('not only' Structured Query Language) databases like key value stores (e.g. Riak, http://basho.com/products/ riak-kv/), document stores (e.g. MongoDB, http:// www.mongodb.com), or distributed storage (e.g. Google's Bigtable) (Chang et al., 2008). Table 1 Volume of data produced by selected data sources. The area of the circles in Figure 2 is related to the volume of data Number of measurements per year log(Spatial resolution of measurement, m2)  Fig. 2 Overview of spatial and temporal characteristics of common measurements. Horizontal axis: frequency of the measurement (year À1 ); vertical axis: spatial resolution of the measurement ( 10 log(m 2 )). See Table 1 for explanation of symbols.
Big Data storage and querying may be made more efficient through a lambda-architecture (Marz & Warren, 2015). With a lambda-architecture, arbitrary 'views' (queries) are pre-computed over the stored data; that is, they are made ready before an actual request for them is posed. This ensures that the needed information can be retrieved quickly when a request is placed for a specific view. Clearly, pre-computing these views takes a certain amount of time (say, some hours). Consequently, new data are available only a few hours after arrival in the system. This can be compensated through an additional system component (termed the speed layer), which is responsible for processing new ('incremental') data. While this system part still has to provide high-speed querying on the data, this is required only on the data increment, and not on the entire data set, hence allowing for an efficient system overall.

Linked data
Applications of Big Data typically involve the challenging task of establishing relationships between data items of different provenance. For example, the term 'wheat yield' may refer to 'yield-as-harvested' (e.g. 11.6 Mg ha À1 , moisture content not known), to 'dry matter yield' (e.g. 10 Mg ha À1 ), or to 'yield adjusted to market-standard moisture content' (e.g. 11.35 Mg ha À1 ). The meaning of the term 'yield' is slightly different in each case, and it would be an error to use them interchangeably. Similarly, yield may be expressed using units of t ha À1 , but also g m À2 or dt ha À1 (in common use in Germany). Again, errors will occur if units are not taken into account.
When data are stored in table format (e.g. in a database, spreadsheet, or text file), the names of columns typically give an indication of the meaning and the units of the data, but this is rarely conclusive. The manual intervention that is almost always needed to bring data from two or more sources together constitutes a significant barrier to the application of Big Data in agriculture.
A system to address these shortcomings and to make automated matching of data possible has been proposed (Berners-Lee et al., 2001). They proposed the name 'semantic web', but the name currently used is Linked Data. An introduction to recent developments is available (Allemang & Hendler, 2011). Linked Data is built on a number of principles. First, every 'thing' is given a name: a uniform resource identifier (URI). Second, this name preferably is a uniform resource locator (URL) which you can type into a web browser and then will give you information about the thing. In the case of the above example, 'dry matter yield' would have a different name (perhaps http://ld.example.org/ dry-matter-yield) than 'yield adjusted to market-standard moisture content' (perhaps http://ld.example.org/ yield-standard-moisture-content). Third, information about 'things' is given in the form of triples, basically simple sentences of the form <thing1 > <thing2 > <thing3 > , where the meaning of each part of a sentence can be looked up. If we take 'ex:' as shorthand for http://ld.example.org/, we can for example create the following triples: Measurements expressed using Linked Data technology can be combined without manual intervention, regardless of where they were collected or where they were stored, as long as they are described using the same concepts (or when a mapping exists between concepts). This highlights the importance of shared vocabularies or ontologies. A number of ontology development efforts are under way. Of particular interest to the domain of weed control and crop protection are the Global Agricultural Concept Scheme (GACS), which combines AGROVOC, the CAB Thesaurus and the NAL Thesaurus into one ontology (http://testeros-kktest.lib.helsinki.fi/gacsdemo/gacs/en/), the Plant Ontology (Jaiswal et al., 2005) and Crop Ontology (Shrestha et al., 2010). Unfortunately, anyone trying to use these ontologies will quickly find that many concepts are not yet included, which limits their immediate usefulness.
The advent of Linked Data has led to the development of databases that are optimised to store triples. Examples are RDF4J (http://rdf4j.org) and Virtuoso (http://virtuoso.openlinksw.com). Tools such as D2RQ (http://www.d2rq.org) offer the capability to access relational databases as if they contained triples.

Ownership and sharing of data
Big Data applications typically involve several data owners. For research data, the issue of ownership, archiving and sharing has received ample attention (King, 2007;White & Van Evert, 2008). The consensus is that the scientific method calls for sharing data liberally, although care should be taken to respect concerns such as privacy of people, the need to protect rare species and habitats by withholding details about location, and the need to publish before sharing (Duke & Porter, 2013).
Sharing research data is, of course, not the same as sharing data from commercial farms. Tellingly, a survey of Danish and US farmers showed that many are even reluctant to use cloud-based storage (Fountas et al., 2015). However, a decrease in public funding for agricultural research in recent years has resulted in fewer scientific experiments in agricultural sciences. This is at a time when increasingly there is a need for long-term experiments (LTEs) to investigate issues such as climate change, where the effects can be expected to become visible over a long time horizon (White & Van Evert, 2008). The scarcity of new experiments also reveals the need of extensive exploitation of already available data to develop efficient integrated weed management that benefits farmers and the environment. Intensively monitored farms may be the LTEs of the future. When on-farm collected data becomes an important vehicle for scientific progress, some of the arguments that apply to sharing scientific data become applicable to sharing farm data. The discussion about sharing farm data that does not proceed beyond the obligation of scientists to make research data available does not do justice to the topic. A framework to discuss ownership and sharing is needed.
Ethics is the branch of philosophy that examines the rights and duties of people in a systematic way. It seeks to answer questions such as 'what is the right way to act'. Ethics has no ready-made answer for our specific question whether farmers should share production data and with whom. Here, we hypothesise that ethical reasoning can help to structure the argument and can thus contribute to finding a resolution that is acceptable to parties involved. We will focus on the ethical matrix which was proposed by Mepham (2005), following work by (Beauchamp & Childress, 2001).
The ethical matrix has two dimensions. The first (column) dimension consists of the three broad categories in which Mepham (2005) summarises the major ethical theories. These categories are well-being (related to utilitarianism: the greatest benefit to the largest number of people), autonomy (related to deontology: do as you would be done by) and fairness. The second (rows) dimension describes the parties that are affected by the issue at hand. In our case, the parties with ethical standing are farmers, owners of agribusinesses, consumers and the living environment (biota).
The ethical matrix is used to record concerns that exist about a new situation that is envisaged. In our case, that situation is 'data collected on commercial farms is shared'. Each concern about this situation is entered in the cell of the ethical matrix that is at the cross between the party affected and the category of the concern. A possible listing of concerns that is about sharing farm data is shown in Table 2.

Big Data analytics
Once the relevant data have been properly prepared and stored, knowledge valuable to users is extracted through data analytics. Conventionally, agricultural applications use standard statistical methods, such as regression, analysis of variance (ANOVA) and principal component analysis (PCA). Big Data applications require new methods. First, standard statistics may be inadequate to deal with the large number of variables typically found in Big Data applications, and these variables may be related in a complex, non-linear manner. Second, even the implementation of simple methods is not straightforward when extremely large data sets are involved. In other words, devising and implementing a numerically efficient 'Big Data PCA' is a non-trivial task (Balcan et al., 2014). At least two steps must be considered: adopting an appropriate machine learning model (e.g. a neural network), and secondly training the model using an appropriate algorithm (e.g. a gradient descent method). A third step consists of measures to ensure privacy, which is of high relevance in agriculture.

Machine learning models
The goal of a machine learning task is to learn the relation between input and output, given a set of training data. For example, given training data (X i, Y i ) where i = 1,. . .,n, and where a pair (X i, Y i ) represents measured environmental parameters and the yield for a certain past season i (Brdar et al., 2011), the goal is to learn the function f, Y = f (X), which fits best (in a certain sense) the available training data. A possible approach is to find f which minimises the average squared error loss: but many other forms of losses are also possible. For computational tractability, one needs to restrict f to a certain class of functions (e.g. polynomials of order at most m), such that the above minimisation is feasible. Generally, the choice of the loss function and the admissible function class determine different machine learning approaches or models. Three machine learning models are widely used and relevant in agricultural applications (Kastens & Featherstone, 1996;Baral et al., 2011;Brdar et al., 2011;Rahaman et al., 2015;Agrimetrics, 2016) and Big Data (Davies & Frigola, 2014;Hsieh et al., 2014;Najafabadi et al., 2015). These are first, neural networks (NNs, see also the related concept of deep learning (Najafabadi et al., 2015)), second, (nonlinear) support vector machines (SVMs) with kernels and third, graphical models (GMs). Other relevant models include (group)-sparsity and other structured models (Slavakis et al., 2014), models involving spatial data (Vatsavai et al., 2012) and linear and non-linear dimensionality reduction and clustering methods (Kashyap et al., 2015).
Neural networks (NNs) have proved successful in speech recognition and image and natural language processing (Xie et al., 2014). Their name indicates a resemblance in structure to actual, biological neural networks. Namely, with NNs, function f is modelled as a functional composition of basic computational elements, for example. neurons, where each neuron consists of a linear activation function, parameterised by a weight vector w and a non-linear transfer function s (e.g. a sigmoid function (Bishop, 2006;Hinton et al., 2006)). The neurons are organised in layers (the number of layers is the depth of a NN), each of which has a certain width. Common loss functions are squared error and cross-entropy loss, and a popular numerical algorithm for training NNs is back propagation and its variants (Bishop, 2006;Hinton et al., 2006).
Neural networks have been used in many agricultural use cases, including prediction of yield (Baral Support vector machines (SVMs) with kernels -Given a training data set, SVMs seek a function f which makes at each data point an error of at most e, where e is a predefined small positive number, as explained by Smola and Sch€ olkopf (2004). SVMs were initially proposed for linear models. A non-linear version can be made by first transforming input X into a (higher dimensional) feature space through a non-linear mapping Φ(X) and then applying standard linear SVMs over features Φ(X). This can be done without ever explicitly calculating features Φ(X); thus, there is no need to work directly in the (usually very high dimensional) feature space. Namely, function f can be expressed as a linear combination of inner products with data points X i 's. That is, it is only needed to define (and subsequently compute) a Kernel function K(x 1 ,x 2 ), which defines the inner products <Φ(x 1 ), Each row lists concerns that pertain to a stakeholder group; concerns are grouped by the three broad categories of Mepham (2005). For each concern identified, an attempt is made to determine how it will be affected by the envisaged new situation. For farmers, income is a direct measure of farmers' well-being. Sharing data with scientists will lead to new scientific insights that will in turn allow farmers to improve profitability and sustainability of their business. On the other hand, sharing data with businesses may increase the economic power of those businesses and compromise the ability of farmers to sell at attractive prices. A farmer may risk liability suits, for example when records show that equipment malfunctioned and (unintended) contamination of the environment occurred. The autonomy of a farmer could be compromised when he or she loses control over the flow of data. Also, the farmer's sense of identity may be compromised when critical farming decisions are made by consultants or decision-support software. On the other hand, new insights resulting from sharing data with scientists may provide the farmer with more options to manage the farm and to make better decisions. Fairness requires that a farmer be compensated for the value that others derive from the data he or she shares. Farmers are concerned that governments may use farm data to argue for stricter controls on, for example, emissions of nutrients and chemicals. Agribusinesses need access to farm data to generate income. For developing innovative data-based services (autonomy), agribusiness is dependent on the ability to access farm data. Businesses require an equitable regulatory framework (fairness) with respect to acquiring, storing, transferring and using farm data. Consumers will benefit from an increase in safety and quality of food made possible by new insights when farm data are shared with scientists. Tracking and tracing the origin of products and the agricultural practices used to produce them with special focus on the pesticides used (and especially residual herbicides) will allow consumers to make informed purchasing decisions (autonomy). New insights derived from sharing farm data with scientists will lead to an increase in the availability of food and increase fairness for consumers. Finally, wildlife and the living environment will benefit (fairness) when sharing data lead to new scientific knowledge and less pollution (well-being), preservation of biodiversity and native species, and when threatened species and rare breeds are preserved (autonomy).
Φ(x 2 )> in some feature space. Many easily computable functions (e.g. polynomial kernels, exponential kernels) turn out to be valid kernel functions (valid inner products in some feature space), which makes SVMs very efficient in practice. Support vector machines have been widely used in agricultural applications. Examples include prediction of agricultural yields based on relevant environmental parameters (Brdar et al., 2011) (tactical decisions) and detection and classification of plant diseases (Rumpf et al., 2010) (operational decisions).
Graphical models (GMs) In a standard setting, GMs are not concerned with input-output modelling, but rather they model interdependencies among a set of (input) variables (Jordan, 2004;Wainwright & Jordan, 2008). Variants exist that model input-output relations also, such as conditional random fields (CRFs) (Lafferty et al., 2001;Wytock & Kolter, 2013). Differently from NNs and SVMs, CRFs adopt a probabilistic framework; that is, they model input-output relations through a (conditional) probability distribution P(Y | X), rather than an explicit function of the form Y = f (X). However, the underlying principle is similar: given a training data set (X i ,Y i ), i = 1,. . .,n, one again learns the probability distribution P(Y | X) from a certain class of distributions (e.g. Gaussian), by minimising an appropriately defined loss function. Once the 'best' distribution P(Y | X) is learned (the training has been completed), one can perform inference (e.g. predict a new output Y i based on a given new input X i ) by finding a maximum a posteriori estimate of Y i , that is by finding Y i which minimises P(Y | X i ) viewed as a function of Y.
With CRFs and with GMs in general, the key object of the (joint) probability distribution of interest is associated with a graph, whose nodes are the individual variables X i 's and Y j 's. Then, the probability distribution is defined as a product of factors associated with graph cliques (all-to-all connected subsets of nodes). The graph structure allows for natural modelling of complex phenomena in applied fields, as noted in Jordan (2004), for example for modelling spatial or spatio-temporal processes (Vatsavai et al., 2012). Introducing the graph formalism and structure with GMs turns out to be quite useful in constructing efficient numerical algorithms to perform inference. The so-termed belief propagation-type algorithms. Graphical models have significant potential for modelling in Big Data agricultural applications. A variant of GMs, spatial random fields, can be used to model various spatial phenomena, such as the location prediction problem, for example prediction of spatial distribution of a disease within a field (Vatsavai et al., 2012). This may correspond with tactical or operational decisions. GMs are also used for modelling traits' interdependencies with large scale phenotyping (Rahaman et al., 2015;Agrimetrics, 2016).
Missing data and variable spatio-temporal resolutions are two specific challenges that arise with machine learning in agriculture. There are many situations where certain planned data entries are missing (e.g. a sensor malfunctioned, or cloud cover prevented acquisition of satellite imagery). It is thus necessary to have machine learning models that can cope with missing data; see for example Slavakis et al.  (Table 1). Models are needed which can effectively treat such data (Klein et al., 2015).

Numerical algorithms: parallel and distributed optimisation
Once the prepared data are ready for processing, and an appropriate machine learning model with parameter set w has been adopted, a numerical algorithm is used to produce the set of parameters w* which best explain the data. Usually, this task is performed by solving an optimisation problem, namely that of minimising an appropriately defined loss function (e.g. a squared loss) with respect to model parameters w: minimise fðw; DÞ; parameterised by the available data D = {(X i , Y i ), i = 1,. . .,n}. For example, with classification tasks, the loss function f can be logistics or hinge loss (Bishop, 2006). When function f is convex, this optimisation problem can in principle be efficiently solved by standard numerical optimisation methods (e.g. gradient descent or Newton method.) However, in Big Data applications, a major challenge is that the size of the data set D and possibly the dimension of the unknown parameter set w are so large that the problem cannot be solved in a reasonable time with standard numerical optimisation methods on a single standard computer. Therefore, there is a need to develop parallel and distributed optimisation methods that partition the problem of interest into multiple smaller problems, each of which is solved by a separate processor (Jakoveti c et al., 2014;Slavakis et al., 2014). There are now parallel and distributed methods that can solve huge problems. For example, a (convex) logistics loss problem with an order of 70 000 data points of size 20 000 (real numbers) was solved in less than 10 s using 40 parallel processes (Facchinei et al., 2015).
Several challenges arise when designing Big Data algorithms. The first challenge, scalability, refers to how computational time reduces when the number of processors is increased. A naive consideration would imply that the time decreases linearly with the number of processors. However, the delays due to interprocessor communications cause more complicated (and less efficient) scaling (Hong et al., 2015). A second challenge is that in many agricultural applications, analytics should be able to respond in real time to changes in the sensed data (e.g. weed emergence, weather changes, plant stress). This is also true in the case of evaluation of herbicide-resistant weeds, where a rapid decision and response are required (Travlos, 2013a). That is, algorithms should be able to quickly adapt their solutions based on the changes in the incoming streaming data. This can be, in many cases, accomplished through online learning and stochastic optimisation algorithms (e.g. stochastic gradient descent) (Duchi et al., 2011). Essentially, such methods allow for computationally inexpensive solution updates (e.g. a gradient descent step) accounting for only newly acquired data samples, as opposed to revisiting all the data samples at each algorithm iteration. A third challenge, privacy preservation, is discussed in more detail in the next section on Data analytics under privacy constraints.
Several commercial and open source Big Data software libraries and platforms exist. Apache Hadoop (Apache, 2016) includes several modules: (i) Hadoop Distributed File System (HDFS)a distributed file system for high-throughput access to application data, (ii) YARNa framework for job scheduling and computer cluster resource management, (iii) MapReducea system for parallel processing of large data sets, (iv) Mahouta machine learning tools library and (v) Sparka compute engine and a programming model which supports a wide class of tasks, including machine learning, stream processing and graph computation. GraphLab (Low et al., 2010) is a framework for developing efficient and provably correct parallel machine learning algorithms, very expressive for asynchronous iterative algorithms with sparse computational dependencies. Package pbdR (Ostrouchov et al., 2012) is a software package for Big Data based on the R programming language. PAIRS is a platform specifically designed for handling geo-spatial data which has been used for agriculture-related Big Data (Klein et al., 2015).

Data analytics under privacy constraints
A very important issue with data analytics in agricultural applications where multiple parties (e.g. farmers) are involved is that of data privacy, as farmers may not be willing to disclose or share their private data or practices. On the other hand, exploring hidden knowledge from all parties' data can clearly yield improved solutions with respect to the solutions based on parties' individual data sets.
We consider the following conceptual system model (Fig. 3). There is a group of N parties, for example farmers, each holding its own private data D i . Parties outsource messages m i (D i ) related to their private data to an analytics provider. (One can think of m i (D i ) as a 'disguised' version of D i .) Subsequently, the provider processes messages from all parties and sends the obtained result to all parties. One can think of this result as a 'disguised' version of the optimal solution that a (hypothetical) provider would compute, if it had available all data D i 's and from which each party can reconstruct this optimum. We assume that there is also an adversary who, based on the observed messages (and possibly the observed result), attempts to recover the parties' private data. The goal is to design messages and the provider's analytics, such that the method is computationally feasible and the parties can reconstruct the optimum from the result, while the adversary cannot (or is at least unlikely) to discover private data D i 's.
Existing works to solve the described problem can be broadly categorised into two classes (Weeraddana et al., 2014): (i) cryptography-based approaches and (ii) non-cryptography-based approaches.
Cryptography-based approaches -With this class, each party creates message m i (D i ) as an actual, classical encryption (e.g. homomorphic encryption (Xie et al., 2014)) of D i with a privately known key. Subsequently, analytics is performed over the encrypted data (for example, via secure multiparty computation (Xie et al., 2014)) and the encrypted result is sent back to all parties, which then decrypt the result. This approach obviously allows for a 'perfect privacy', in the sense that the adversary cannot reconstruct private data (without having the parties' private keys) in a feasible amount of time. The price of ensuring 'perfect privacy' is the large amount of computer time needed to generate solutions. The reason is that each nonencrypted bit of information corresponds to a large sequence (perhaps a thousand) encrypted bits, and hence, any arithmetic operation over the encrypted symbols is much costlier than the equivalent operation in the standard, non-encrypted domain. Currently, for many machine learning models, cryptography-based approaches are computationally unfeasible. As an illustration, solving a linear program (LP) with 282 unknowns and 180 constraints (a moderate size problem) with current cryptography-based solutions as of 2011 is estimated to take seven years (Dreier & Kerschbaum, 2011). However, for relatively simple models and tasks, cryptography-based approaches might be good solutions. For example, Xie et al. (2014) propose a cryptography-based system for neural networks and conjectures their practical feasibility for inference tasks (e.g. performing classification of a data point for an already trained neural network) and perhaps also learning tasks for simple models (e.g. training a moderate size neural network with a small number of layers).
Non-cryptographic approaches achieve some degree of privacy through algebraic data transformations. That is, each message m i (D i ) represents some (deterministic or random) algebraic transformation of D i . Naturally, such transformations are sought to be 'non-invertible', in the sense that the adversary cannot (or is very unlikely) to recover private data by observing the messages. For example, Dreier and Kerschbaum (2011) and Fung and Mangasarian (2013) address solving LPs, where each party multiplies its own real-valued private data vector D i by a privately generated random matrix. The main advantage of the non-cryptography-based solutions with respect to the cryptography-based ones is that data analytics is performed directly over the transformed, but nonencrypted data (real numbers, vectors and matrices).
Hence, they do not introduce huge computational overheads of performing algebraic operations over encrypted sequences. However, this in general comes at the cost of a certain information leakage, that is, no 'perfect privacy' is ensured. Currently, for models like LPs, there exist efficient methods to solve moderate sized problems with a very low information leakage (Dreier & Kerschbaum, 2011). As an illustration, an LP with 282 unknowns and 180 constraints can be solved within 25 min (compared with the 7 years' execution time of cryptography-based approaches), while ensuring that the adversary can guess the problem solution with the chance lower than 10 À1408 (Dreier & Kerschbaum, 2011). Further research is needed to devise methods which handle very general models and huge data scales.
Besides the described 'algebraic transformation' approaches, there are non-cryptographic approaches like the methods based on the notion of e-differential privacy (Sarwate & Chaudhuri, 2013;Duchi et al., 2014;Nozari et al., 2016). Further, other more elaborated models than the one considered in Fig. 3 are certainly relevant and have been studied. For instance, parties can be arranged in a network (e.g. induced by their geographical proximity or by their business relations), where each party itself possesses its own data analytics (computing) resources, that is not only the data but also the actual analytics algorithm is distributed over the N parties. The parties then collaboratively solve the common learning task through exchanging messages along links in the network. The adversary can observe messages from all (or a subset of) links (Yan et al., 2013;Nozari et al., 2016). In such setups, interestingly, some standard, 'general-purpose' iterative distributed methods, like the alternating direction method of multipliers, exhibit certain privacyenabling properties (Weeraddana et al., 2014).

Delivery of actionable information to farmers
The goal of Big Data analytics in weed control and crop protection is to provide actionable information ... for better management decisions by farmers and their agri-food partners. The information can be used in strategic, tactical and operational decisions, and at different spatial scales on the farm (fields, management zones or grids within field, individual plants, for example Christensen et al. (2009)), or at regional or agrifood chain level. But no matter what type of decision is made or what the scale of application is, digital information must be presented to farmers in a straightforward and comprehensible way. The simple computerised record-keeping solutions of early years have evolved into comprehensive Farm Management Information Systems (FMIS). Sørensen et al. (2010) defined an FMIS as a planned system for collecting, processing, storing and disseminating data in the form needed to carry out a farm's operations and functions. Essential FMIS components include specific farmer-oriented designs, dedicated user interfaces, automated data processing functions, expert knowledge and user preferences, standardised data communication and scalability. It has been stressed that the evolution of FMIS must take into account the social aspects of business processes (Fountas et al., 2015).
There is not always a smooth path to commercial availability, even for systems that have already shown their potential in a research setting. In the Netherlands alone, several commercial initiatives to develop geo-information system (GIS) platforms for use in agriculture have failed during the last 10-20 years. However, a system called 'Akkerweb' (in English: Farm Maps; see www.akkerweb.nl) is currently gaining traction. Akkerweb is the product of a public-private partnership between Agrifirm, the largest farmers' cooperative in the Netherlands, and Wageningen University & Research, the leading agricultural research organisation in the Netherlands. Akkerweb allows geo-data acquisition, management, visualisation and use at the farm level, in combination with a standard FMIS (Kempenaar et al., 2014b(Kempenaar et al., , 2016. The roots of Akkerweb can be traced to the development in 2012 of a decision-support system for control of plant parasitic nematodes NemaDecide (Been et al., 2004(Been et al., , 2007. Akkerweb offers GIS functionality and a number of general free for use applications ('apps'), such as a cropping scheme app, a satellite data app and a sensor data app, to visualise and analyse soil and crop data and to generate task (prescription) maps. Akkerweb also contains several subscription-based apps for variable rate application of pesticides and fertilisers. The success of Akkerweb is due to the combination of its ICT infrastructure and its science-based content, the bottom-up development with users in the driver's seat, and the effective cooperation between a farmers' cooperative, a research institute and an IT company with the knowhow to build and maintain the required software.
Akkerweb is an open platform, in the sense that third parties can also use the Akkerweb platform to develop and offer fee-based services. Today, data of ca. 30 000 parcel crop years are stored using Akkerweb.

Current applications of Big Data for weed control and crop protection
It is useful to consider strategic, tactical and operational decisions separately and outline the specificities for each of these.

Strategic decisions
NemaDecide is a system to support strategic decisions on the control of plant pathogenic nematodes (Been et al., 2004(Been et al., , 2007. The system is based on a model of the population dynamics of nematodes and takes into account the presence of host plants, specific crop rotations, soil analysis data and efficacy of control methods. GeoNema (Haverkort & Kempenaar, 2016) is the NemaDecide decision-support system in a GIS platform accessible via Akkerweb. Farmers can apply soil analysis data from laboratories in combination with the decision support, to decide on optimal crop rotations and control strategy, for example to make a task map for site-specific control.
In weed and disease control, the use of population model information to make strategic decisions in crop rotation management is less advanced. A DSS on weed control would be especially useful if it can contribute to effective weed control methods that minimise the development of herbicide-resistant weeds. Decision support might also be given in the form of predicting which mix of cultivars of certain crops is most likely to maximise yield and minimise risk (Marko et al., 2016). In this study, the effect of weeds, pests and diseases on yield was not explicitly considered, but this could be included.
The data that need to be collected in order to support strategic decisions on weed control include the occurrence of weeds (kind or species, density), combined with soil data, management, crop yield and, of course, weather. These data can be used to derive risk factors for weeds and to determine how effective weed control measures are. Several model-based decisionsupport systems for weed management in arable crops are already available, taking advantage of the tank of Big Data (Berti et al., 2003;Rydahl, 2004;Parsons et al., 2009).
The algorithms that will be useful to support strategic decisions in crop protection include Bayesian parameter estimation methods which can be used to (better) parameterise population dynamics models, possibly using inverse modelling. It may also be possible to use NNs to develop non-mechanistic models of cause and effect, especially in the case of many factors with weak influence, such as soil pH, soil organic matter content, CEC and soil texture.

Tactical decisions
In several countries, decision support for weed control is available in the form of recommendations for herbicide selection, herbicide rate and time of application, for example Crop Protection Online (Murali et al., 1999; see also https://plantevaernonline.dlbr.dk/cp/d ocuments/InfoFactSheet2.pdf) in Denmark and Gewis (http://www.agrovision.nl/sectoren/teelt/producten_voor_ de_teler/crop/gewis/) in the Netherlands. These recommendations are based on data on weed species, crop sensitivity and climatic conditions. These kinds of systems can also be used to optimise the application of fungicides and insecticides, whether or not in combination with early warning systems for infection of crops by diseases. Attempts are now being made to put these decision-support systems in GIS-platforms. This is the case for variable rate application of soil herbicides. The Akkerweb recommendation for variable rate application of soil herbicides uses data on spatial variation in soil organic matter, CEC and pH, in combination with data on soil moisture, sensitivity of the crop to the herbicide, climate conditions, FMIS data and weed maps, in order to make task maps for variable rate application taking into account the relevant spatial variation. This has resulted in a reduction in herbicide use of 10-20% compared with uniform treatment of the field when applied at a resolution of 10-30 m 2 grids (Kempenaar et al., 2014a).The power of this method can also be illustrated with an effort in which data on the application of fertiliser N and resulting maize yields are pooled across experiment sites and years. This has led to more specific and better fertiliser recommendations (Sawyer, 2010).
The data that are needed for Big Data supported decisions in weed control include spatial information on weed occurrence, landscape position, soil characteristics and weather. Data on weed occurrence are traditionally collected on an intermittent basis in experiments (Gerhards et al., 1996). Nowadays, this information can be obtained on a large scale by logging the occurrence of weeds as they are detected by weed-detection software fed by cameras on robots or on spray booms. Alternative methods include using a camera mounted on an unmanned aerial vehicle (Perez-Ortiz et al., 2016). Landscape position can be obtained from a Digital Elevation Model (DEM) on the basis of the logged position.
Useful algorithms include probabilistic reasoning to estimate parameters of population dynamics models. Also useful will be neural networks (NN) to link cause and effect where insufficient knowledge about underlying mechanisms is available.

Operational decision
Autonomous robotic weed control depends on accurate information on the position and determination of weeds in crops. Although much scientific progress has been made in this area (e.g. Bakker et al., 2006Bakker et al., , 2010, certainly in the field of algorithms for weed-detection (Eddy et al., 2008), commercial use is still limited (Merfield, 2016). The aim of these kinds of efforts is illustrated by a prototype weed control robot that is truly autonomous (Van Evert et al., 2011). In this robot, weed detection is combined with an autonomous platform and a mechanical weed control device that uses the information to destroy the tap root of Rumex obtusifolius L. (broad-leaved dock) in grasslands with high accuracy.
The most useful data are also the hardest to obtain: labelled images. Labelling can be performed by outlining the weed or by simply noting whether a weed is present or not. Typically, labelling is performed by humans and is extremely time-consuming. For operational decisions, the algorithms that are most useful for classification are SVM and NN.
The cases above illustrate how data can be used to obtain actionable information for weed control and crop protection. We expect this will grow in the future when more data layers, models and data analytics become available. The model parts, either statistical models or agronomic models, will become better when farmers share data to better estimate the parameters of the models.

Conclusions and recommendations
In this paper it was argued that a new conceptual model for weed control and crop protection should be developed which consists of three elements: (i) capture and store data, (ii) analyse data and (iii) generate recommendations. We put forward the view that integrated solutions for weed control and crop protection are needed. Such integrated solutions require simultaneous advances in agricultural science, in ICT, in collaboration between supply chain partners (coinnovation), in respecting the interests of all parties involved, and in legal frameworks. In the area of science, new knowledge is needed which will allow us to use historical data to predict the occurrence (time, location, severity) of weeds, pests and diseases. Research is needed on the interaction between realtime data collection on weed occurrence, soil and climatic conditions during the growing seasons. These data should be the basis for building models for the physiology and behaviour of weeds at given climatic conditions, which should be organised in a systematic way and use the appropriate Big Data analytics to deliver the best decisions. Technical advances are needed to allow us to integrate data from various sources. Here, the most likely avenue to success is through semantic technologies and the most pressing need is for appropriate ontologies to be developed. Any integrated solution will require the collaboration of supply chain partners, even if they are many, and even if they are commercial competitors. In the Netherlands and some other countries, farmers' cooperatives play an important role in establishing effective working relationships between supply chain partners. This example may need to be emulated by farmers elsewhere, and indeed by the many enterprises, large and small, that are offering services. We have made the case that safeguarding the interests of all partners will be helpful in establishing successful collaboration. In case of conflicting interests, ethical reasoning may help to reach understanding between parties. Data sharing protocols may need to be developed that can be used as templates in commonly occurring situations. Agreements between parties must be formalised in legally binding contracts and national and international law must be in place to support this. Creating protocols and reaching agreements ultimately is based on trust; this trust has to be earned by the parties that want to be involved.
In recent years, significant advances have been made in developing general-purpose tools and methods for Big Data capture, storage and analysis, as well as some emerging customised systems and applications in the agri-food domain. An interdisciplinary effort is required to overcome remaining challenges and fully realise Big Data opportunities in agriculture. In the case of weeds, many opportunities may arise, especially for invasive, parasitic or herbicide-resistant weeds. This effort requires the involvement of agricultural experts, of computer and data science experts, as well as advances in terms of organisational, ethical and legal arrangements. BRDAR S,