# Complex decisions made simple: a primer on stochastic dynamic programming

## Authors

Corresponding author. E-mail: olivier.gimenez@cefe.cnrs.fr

## Summary

1. Under increasing environmental and financial constraints, ecologists are faced with making decisions about dynamic and uncertain biological systems. To do so, stochastic dynamic programming (SDP) is the most relevant tool for determining an optimal sequence of decisions over time.
2. Despite an increasing number of applications in ecology, SDP still suffers from a lack of widespread understanding. The required mathematical and programming knowledge as well as the absence of introductory material provide plausible explanations for this.
3. Here, we fill this gap by explaining the main concepts of SDP and providing useful guidelines to implement this technique, including R code.
4. We illustrate each step of SDP required to derive an optimal strategy using a wildlife management problem of the French wolf population.
5. Stochastic dynamic programming is a powerful technique to make decisions in presence of uncertainty about biological stochastic systems changing through time. We hope this review will provide an entry point into the technical literature about SDP and will improve its application in ecology.

## Introduction

Numerous problems in ecology involve making decisions about the best option among a set of competing strategies. These so-called optimization problems can be solved using mathematical procedures such as linear programming (Nash & Sofer 1996) which allows the determination of maximum benefits or minimum costs given some objectives and under some constraints for deterministic systems assumed at equilibrium. If uncertainty in the dynamic of the system needs to be accounted for, a Markov decision process (MDP, Puterman 1994; Williams 2009) model is usually adopted. ‘MDPs are models for sequential decision making when outcomes are uncertain’ (Puterman 1994). MDPs are made of two components: Markov chains that model the uncertain future states of the system given an initial state and a decision model. First, a MDP is a Markov chain in which the system undergoes successive transitions from one state to another through time. For example, these state transitions can correspond to the change of a population size from 1 year to the next. In Markov chains, the transitions to future states only depend on the current state of the system. In other words, the state of the system at time step t provides sufficient information to predict the states of the system at time step t+1. Second, a MDP involves a decision-making process in which an action is being implemented at each sequential state transition. In the conservation and wildlife management literature, the phrase stochastic dynamic programming (SDP) is often used to refer to both the mathematical model (MDP) and its solution techniques (SDP per se). MDPs are usually modelled and solved by going through several successive steps: defining the different objectives and formalizing them as a mathematical function of costs and/or benefits (Williams, Nichols & Conroy 2002); defining possible states of the system, monitoring the system and making statistical inference on system behaviour (Nichols & Williams 2006); defining a set of alternative actions that influence the performance of the system; building a dynamic model to describe the system transitions from one state to another after implementing every possible decision; and finally determining the optimal strategy that is the set of decisions that is expected to best fulfil the objectives over time (Runge 2011). These objectives are formalized in a utility function that prioritizes some desired outcomes by evaluating the benefits of any decision for the system (Williams, Nichols & Conroy 2002). MDP models highlight the trade-off between obtaining current utility and altering the opportunities to obtain utility in the future. Such problems abound in ecology because decisions taken today often have important implications for the future behaviour of biological systems.

Stochastic dynamic programming is an optimization technique used to solve MDPs and is appropriate for the nonlinear and random processes involved in many biological systems. While the time dimension is often neglected in optimization procedures such as classical linear or nonlinear programming, SDP determines state-dependent optimal decisions that vary over time (Williams, Nichols & Conroy 2002). Finally, SDP is acknowledged to be one of the best tools for making recurrent decisions when coping with uncertainty inherent to biological systems (Possingham 1997, 2001; Wilson et al. 2006; Chadès et al. 2011).

The principle of SDP relies on the partitioning of a complex problem in simpler subproblems across several steps that, once solved, are combined to give an overall solution (Mangel & Clark 1988; Lubow 1995; Clark & Mangel 2000). SDP was first developed and used in applied mathematics, economics and engineering (Bellman 1957; Intriligator 1971) and has gained attention in ecology (Mangel & Clark 1988; Shea & Possingham 2000). A pioneer use of SDP was in behavioural ecology to determine individuals' breeding and foraging strategies maximizing fitness (Houston et al. 1988; Mangel & Clark 1988; Ludwig & Rowe 1990). Early work in resource management included applications to pest control (Winkler 1975) and fisheries management (Walters 1975; Reed 1979). In conservation biology, SDP has been successfully used to produce evidence-based management recommendations (optimization of resources allocation: Westphal et al. 2003; Martin et al. 2007; Chadès et al. 2011; management of natural resources in the context of global change: Martin et al. 2011). In forestry, SDP allowed achieving a balance between the protection of biological diversity and sustainable timber production (Lembersky & Johnson 1975; Teeter, Somers & Sullivan 1993; Richards, Possingham & Tizard 1999). Stochastic dynamic programming has also been implemented in various studies aiming at controlling the spread of weeds, pests or diseases (Shea, Thrall & Burdon 2000; Baxter & Possingham 2011; Pichancourt et al. 2012), to determine the best water management policies (Martin et al. 2009) or to enhance the efficiency of a biocontrol agent (Shea & Possingham 2000). In wildlife management, SDP has often been used to find the optimal rates for harvesting populations (Johnson et al. 1997; Milner-Gulland 1997; Spencer 1997; Martin et al. 2010).

Despite the flexible nature of SDP and its ability to solve important decision-making problems in ecology, its transfer to ecologists is difficult. One reason for the slow uptake is the mathematical knowledge required for SDP to be implemented. Here, we provide a primer on SDP for ecologists. We introduce the main concepts of SDP, provide a step-by-step procedure to implement dynamic programming in a deterministic system and illustrate how to make decisions in the presence of uncertainty. We demonstrate the applicability of SDP by applying this approach to data from a wolf population controlled by culling. We provide R code to run the models as well as procedures in specialized toolboxes implementing SDP that can conveniently be amended for one's own purposes.

## The six steps of stochastic dynamic programming

The aim of SDP is to find the solution of an optimization problem based on the ‘principle of optimality’ which states that ‘an optimal policy has the property that, whatever the initial state and decision are, the remaining decisions must constitute an optimal policy with regards to the state resulting from the first decision’ (Bellman 1957). The principle of optimality allows us to consider a static problem for the current period by assuming that all future decisions will be made optimally. The effect of the current action thus contributes to both current utility and to future utility through its effect on the future state of the system. In this way, SDP finds a strategy that balances current rewards with future opportunities. Stochastic dynamic programming is the technique used to solve a Markov decision problem. One can conceive solving a Markov decision problem through six steps described below. Notations are gathered in Table 1, and a non-exhaustive list of studies that have used SDP is given in Table 2.

Table 1. Notation used in dynamic programming
VariableNotationNature
State variable X t Vector indexed by time
Control action A t Vector indexed by time
Time t Index
Optimal action π* Vector of length the number of states at time t
UtilityU(Xt, At)Function of the states and actions
Transition probabilityP(Xt+1|Xt, At)Matrix (number of states at t, number of states at t+1)
ValueV(Xt)Vector of length the number of states at t
Discount factorβReal number between 0 and 1
Table 2. Non-exhaustive list of studies using stochastic dynamic programming
StudyState variableObjectiveActionsUtility function
Shea & Possingham (2000)Site level of colonization: empty, insecure, establishedBiocontrol agent colonizing as many sites and as quickly as possibleMany agents released in small patches Few agents in several patches Mix of both strategiesNumber of established sites
Venner et al. (2006)Energy supply of the orb-weaving spiderOptimize fitness by maximizing the energy brought by breeding and foraging while minimizing predation and starvation risksWeb-building choice possible web size.Balance between energy gained from eggs laid and prey caught on the web and energy cost from starvation and from predation risks.
Runge & Johnson (2002)Pre-breeding abundance of ducksFind the optimal harvest rate given several recruitment and survival functionsHarvest rateTotal number of harvest accumulated through time
Martin et al. (2010)Female raccoon abundance Oyster productivityMaintain Oystercatcher productivity above a level necessary for population recovery while minimizing raccoon removal.Harvest rate in each age classTotal number of raccoons after harvest with a penalty factor when oyster productivity goes below a threshold
Milner-Gulland (1997)Saiga antilope abundance Proportion of males and femalesMaximize monetary yield while preserving the saiga population already threatened by droughtHarvest rate Proportion of males in the harvestAnnual monetary yield from game hunting, given the price of the meat, the horn and management costs
StudyDynamic modelOptimizationLast valueUncertainty
Shea & Possingham (2000)Colonization, extinction, establishment in insecure sitesBackward iteration T = 10UnknownProbability of establishment and of local extinction
Venner et al. (2006)Discrete Markov model describing the transition energy state of a spider from t to + 1 given the choice of web-building of individuals.Value iteration over an infinite time horizonLifetime fitness given its energy state time horizon is expected to be 1Probability to catch a prey and predation risks
Runge & Johnson (2002)

Reproduction

Harvest

Natural mortality

Value iteration Infinite time horizon (convergence criterion was no change of state-dependent policy for more than 4 years)

No discount rate

No values were assigned to the terminal state of the process V(XT)=0

Structural uncertainty Recruitment functions (linear, exponential, hyperbolic)

Survival functions (constant, logistic, compensatory)

Martin et al. (2010)

Model structured in 3 age classes (raccoon population)

Log-linear relationship between oyster productivity and total number of raccoons.

Backward iteration iterating at most 100 timesteps until a stable policy was maintained for 15 time periodsThe expected abundance range of raccoonEnvironmental stochasticity Parameter uncertainty
Milner-Gulland (1997)Model structure in age and sex classes with density-dependent effects on survivalValue iteration infinite horizonExpected future yield at time horizon is 1Environmental stochasticity and partial controllability

The first step defines the optimization objective of the problem. An objective must be specific to the problem, acceptable by involved actors, achievable, defined over a period of time also called time horizon, and measurable with a function that is related to the system states and actions. This function, called utility, gives the reward for the outcome of any action applied to a certain state (Williams, Nichols & Conroy 2002). Several objectives can be defined depending on the type of ecological problem we are investigating, but an optimization objective must be defined as maximization or minimization of a function over a time horizon (Puterman 1994; Converse et al. 2012). The time horizon can be defined as finite or infinite. For many resource problems, choosing the time horizon is quite challenging and depends on a number of factors. First, there may be mandated constraints on a problem. Conservation and management programmes are often planned on a limited time and budget, and are bounded by political decisions also taken at regular time intervals. For instance, the conservation status of species listed under Appendix S2 of the Habitat Directive is evaluated every 5 years by the European Commission (92/43/EEC). As a consequence, some governments evaluate every 5 years decisions related to management of wildlife and habitats present within their territory (Meedat – Map 2008). For private decisions, a finite horizon is often appropriate for situations in which firms hold time-limited rights to extract resources. Finite horizons should be used carefully in situations where they are arbitrary specified. It is very possible that the ‘optimal’ decision as the time horizon approaches will reflect only very short run goals. For example, a conservation problem that penalizes failure to meet a target performance level at the time horizon may result in short run decisions designed only to meet the target rather than designed to maximize the long run conservation goals. Objectives in management for harvested populations typically focus on maximizing the harvest opportunities, while insuring sustainable populations over the time horizon (Caughlan & Oakley 2001; Hauser et al. 2007; Nichols et al. 2007). Alternatively, the monetary value of the economic yield from harvest might be used (Milner-Gulland 1997; Table 2). Objectives can include both conservation and exploitation of natural resources and can also include several, possibly conflicting, conservation goals. For instance, a conservation problem might deal with the protection of two species that are negatively interacting between one another over an infinite time horizon (Chadès, Curtis & Martin 2012). In metapopulation models, often used in invasion biology, epidemiology and landscape ecology, objectives can also be expressed as maximizing or minimizing the number of sites occupied by a species (Shea & Possingham 2000; Chadès et al. 2011; Table 2). When the economic costs of management and monitoring, as well as the cost of failure to maintain a viable protected species are well known, the objective can be clearly formalized to determine the best way to allocate funding to protect a threatened species (Chadès et al. 2008; McDonald-Madden et al. 2011) or eradicate an invasive species (Regan et al. 2006; Baxter & Possingham 2011).

The second step is to define the set of states that represents the possible configuration of the system at each time step. Let Xt be the state variable of the system at time t. The state variable can be a population abundance (Milner-Gulland 1997; Runge & Johnson 2002) or predator abundance and prey productivity (Martin et al. 2010). Others studies have considered a qualitative state variable such as site occupancy of a colonizing species (Shea & Possingham 2000). We refer to Table 2 for additional examples.

In the third step, one needs to define the decision variable, At, that is the component of the system dynamic that one can control to meet the objective. For example, it can be expressed as the way of releasing a biocontrol agent in crop sites: many individuals released in few sites or few individuals released in many sites. Another example of control actions is different harvest rates in each age class (Martin et al. 2010) or sex class of a species (Milner-Gulland 1997).

The fourth step is to build a transition model describing the system dynamics and its behaviour in terms of the effect of a decision on the state variables (Table 2). This transition model follows a Markov process in which the future state Xt+1 depends on the current state Xt and the action adopted At but not on the past state and action pairs of the system.

In the fifth step, one needs to define the utility function Ut at time t also called the immediate reward. It might be expressed in terms of economic benefits, desired ecological status or social improvement (Table 2) and might be quantified in a more or less subjective way (Simon 1979; Isen, Nygren & Ashby 1988; Milner-Gulland 1997). This function, denoted as Ut (Xt, At), which pertains the Markov chain formulation, represents the desirability of acting in a given state of the system and is defined in terms of the state variable Xt (step 2) and the decision At (step 3). The utility values can accrue over either a finite or an infinite time horizon depending on the objectives formalized in step 1. In the former case, a terminal reward or salvage value, R(XT+1) with T the horizon time, can also be specified that measures the utility that accrues if the system is left in a given state after the last decision is made. In population biology and behavioural ecology, R(XT+1) is often chosen to be the desired abundance of a population or the energy state of an individual (Mangel & Clark 1988; Martin et al. 2010).

Sixth, the final step consists in determining the optimal solution to the optimization problem. The optimal solution, also called the optimal strategy or policy, maximizes our chance of achieving our objective over a time horizon. An optimal solution is defined as a function πt:Xt → At that maps each state to the optimal action for that state. Hereafter, we examine the three most commonly used approaches to solve an MDP: backward iteration, value iteration and policy iteration.

## How to determine the optimal solution?

Several algorithms using SDP technique are available to find the optimal solution of an MDP. How to choose the most appropriate algorithms mainly depends on the optimization objective (step one). Backward iteration is the run over a finite horizon in time-reversed fashion. It leads to a time- and state-specific optimal solution. Value iteration and policy iteration are used to solve infinite time horizon problems. Both techniques provide an optimal action expressed as a time-independent function.

### Optimization procedure over a finite horizon

According to the principle of optimality (Bellman 1957), an efficient way to find an optimal decision is by reasoning backward in time. More precisely, it consists in assuming that the last decision taken at the horizon time T is optimal and by choosing what to do in every remaining time step. T is the time required to reach the optimal solution. Let V(X) be the value function of states that quantifies the reward or the penalty after each state transition following decision (Lubow 1995). Let π* be a vector that maps the best decision for each state at the horizon time. π * is the set of decisions (A) associated with the maximum value function of the set of states [V(X)]. Let β be the discount factor, representing the value of the reward gained in the next period relative to the reward obtained in the current period (Moore, Hauser & McCarthy 2008; Martin et al. 2011). It can also reflect a measure of confidence level in the predictions of the dynamic model. Predictions made for the near future are generally more certain than the ones made for the distant future.

The finite horizon problem can be written formally as

(eqn 1)

The expression includes two parts, the sum of the discounted utility values from time t to the horizon T and the discounted terminal reward (R(XT+1)), which is a function of the state that the system is left in, XT+1 after the last decision is taken.

In the backward iteration algorithm, the starting point is to realize that there exists a recursive relationship that identifies, for each state, a value function for step t, denoted Vt(Xt), given that step Vt+1(Xt+1) has already been solved (Appendix S1):

(eqn 2)

As suggested by the principle of optimality, the Bellman equation expresses the optimization problem in terms of the current decision alone. The first part of this equation is made of the immediate reward represented by the utility function, while the second part is the value function for the next period, Vt+1(Xt+1). The procedure is initialized by setting VT+1(XT+1) = R(XT+1). Then, the previous value VT(XT) is computed, then VT-1(XT-1), and so on. The optimal action, that is the action associated with each initial state X0, is obtained by repeated backward recursions from the horizon time T to present time 0 (see Fig. 1b–d) and by taking the argument of the maximum initial values V0(X0) (Fig. 1d and Fig. 2).

An important issue, besides the choice of the horizon T and of the terminal value of the system, R(XT+1), is the choice of a discount factor β (Lubow 2001) which lies between 0 and 1 (Bellman 1957). Discounting is often specified in terms of a discount rate r, with the (annual) discount factor given by β = 1/(1 + r) Conservation biologists are more likely to use a β of 1, meaning the value of future system states is not discounted. In such situations, future utility contributes as much to the overall objective as current utility.

Even though not discounting future utility complies with the sustainability principle, most economists recommend using a discount factor <1. One reason is that many people place more importance on current than future rewards, especially when future rewards are risky (Norgaard & Howarth 1991). In addition, most problems in resource management involve utilities that have some social and economic cost and benefit, associated with them. When the resource has a non-market value, it can be difficult to convert the ecological, social and economic costs and benefits into a common scale (Wam 2009). Such scale differences and issues of utility incommensurability impede the determination of an appropriate discount rate (whether financial, social or ecological). The method commonly used for selecting a discount rate is based on a market rate for a relatively risk-free asset such as a US Treasury bond. Recent recommendations for environmental projects suggest the use of = 2% for long-term projects (http://www.whitehouse.gov/omb/circulars_a094/a94_appx-c; see also EU's ‘Guide to Cost-Benefit Analysis of Investment Projects’).

### Optimization procedure over an infinite horizon

With infinite horizon problems, both the value function and the optimal policy are independent of time. The problem to be solved can be written as

(eqn 3)

Starting with an arbitrary value function and iterating over an infinite horizon model with policy or value iteration causes the optimal action to converge towards a time-independent function also called a stationary strategy with the optimal solution only depending on the state of the system and not on time.

The first algorithm used to solve an MDP over an infinite horizon, called value iteration, follows the same procedure as described previously except that the Bellman equation is applied iteratively until a convergence criterion is met. A typical convergence criterion (Boutilier, Dearden & Goldszmidt 2001) is

(eqn 4)

where the norm ‖V(Xt+1) - V(Xt)‖ is the maximum absolute value of the difference between two successive decision values, for all possible states. The value of ɛ is usually chosen to be small, so that when the condition in eqn (eqn 4) is satisfied, the value function is within ɛ of its optimal value. In our example, we fixed ɛ at 10−3 as in Boutilier, Dearden & Goldszmidt (2001).

Another algorithm called policy iteration (Howard 1960) involves alternating between finding the best policy (or strategy) given the current guess of the value function and determining the value function associated with the current policy (Appendix S2). One advantage of the policy iteration algorithm is that it will generally run faster than the value iteration (Howard 1960). The policy iteration approach can be decomposed in two steps.

In the first step (evaluation), a value function is calculated from a guessed policy (Boutilier, Dearden & Goldszmidt 2001). Let At be any policy which describes the actions that are taken for any value of the state Xt, so that Xt+1 is a function of both the state and action variables that can be written as Xt+1 = g(Xt, At). The value function associated with this policy can be determined by solving a system of linear equations, one for each value of the state variable

(eqn 5)

In the second step (improvement), we find the policy A’ that satisfies, for each value of the state

(eqn 6)

The same procedure is performed again (back to first step) until the two policies A and A’ do not change.

In infinite time horizon problems, the standard approaches may need to be modified if a discount rate of 0 is used. With no discounting of the future, the value function will not be stationary unless there is a probability of 1 that the state variable reaches and stays in a non-valued state at some time, such as extinction in a population conservation problem. If there is a positive probability of obtaining a positive reward in any given future period, the expected value of future rewards will be infinite, and hence, V will not well defined. In this situation, it is therefore more appropriate to use an average value approach that attempts to maximize the per period expected value function. Algorithmically, there are two main approaches to solve such problems. The vanishing discount approach uses a discount factor near 1 (such as 0.999999). The relative value approach solves for the average reward plus an adjustment to account for the relative value of being in alternative states. For further discussion, see Puterman (1994).

## Making decisions in presence of uncertainty

Thus far, we have focused on deterministic MDPs in which each state and action combination yields a unique, known result. Here, we discuss how to accommodate uncertainty in dynamic programming. In SDP, there are several possible next states, given the action taken and the current state and each of them has a certain probability to be achieved. Let P be a transition matrix displaying the conditional probabilities of the system at state Xt at time t and action At (in rows) to change into states Xt+1 (in columns) given the action. The transition matrix is a stochastic matrix which consists of non-negative elements with rows that sum to 1. The Bellman equation can be rewritten as the sum of the utility value at the current state (which holds in the deterministic version) and the sum of the expected future rewards that are the products of transition probabilities and values of all possible next states (Appendix S1). In the backward iteration procedure, for example, the stochastic version of the equation is

(eqn 7)

One may notice that the difference from eqn (eqn 2) is the addition of the transition probability matrix. Actually, the deterministic version of the Bellman equation can be rewritten as a special case of SDP, where P is a matrix of 0s with a single 1 in each row. In SDP, P consists of transition probabilities depending on stochastic events related to demographic and/or environmental stochasticity or to the action taken, the effect of which can be uncertain.

We distinguish several types of uncertainty that can be accounted for to solve a Markov decision problem. First, there is the natural uncertainty which is related to natural and inherent processes occurring in the system and its environment. It is difficult to measure and even more difficult, if not impossible, to reduce. Populations are subjected to environmental stochasticity that can strongly affect their vital rates through changes in weather conditions, habitat structure or other external biotic and abiotic factors (Regan, Colyvan & Burgman 2002; Martin et al. 2010). Demographic stochasticity is also a common source of natural uncertainty. It reflects the variability in survival and reproduction among individuals and is likely to occur in small-size populations (Lande 1993).

Second, management uncertainty, also called partial controllability, results from the inability to accurately apply the decision being made (Williams 2011). Sometime, actions themselves are taken in an uncertain way. For instance, a planned harvest rate or a prescribed burn can sometimes not be achieved by wildlife or forest managers for many reasons even though it was assumed to be the best solution (Milner-Gulland 1997; Baxter & Possingham 2011; Richards, Possingham & Tizard 1999; see also Table 3).

Table 3. Main features of software packages implementing dynamic programming. MDPSolve (https://sites.google.com/site/mdpsolve/) and MDPToolbox (http://www.inra.fr/mia/T/MDPtoolbox/) are considered. MDP is for Markov decision process, POMDP for partially observed Markov decision process and AM for adaptive management
MDPSolveMDPtoolbox
Natural and Management uncertaintyYes (infinite/finite); Value, policy, backward iterationYes (infinite/finite); Value, policy, backward iteration
Comments f2p and g2p functions compute the transition matrixNeed to build transition matrix
Observation uncertainty (POMDP)YesNo
Model uncertainty and Parametric uncertainty (AM)YesNo
Comments Passive and ActivePassive AM to be included in future release
Unknown uncertainty (reinforcement learning)NoYes (on Infinite horizon)

The third type of uncertainty deals with that coming from the partial knowledge of the value of the state variable. To cope with such uncertainty, one may use partially observable Markov decision process (POMDP), a procedure that can solve stochastic dynamic problems assuming we are unable to observe perfectly the state of the system (Chadès et al. 2008). In a population model, a POMDP might augment an MDP to include detection probability matrices. The detection history is not explicitly represented but rather is summarized by a belief state or probability distribution over the state space representing where we think the state of the system is (Chadès et al. 2008; see also Table 3). Unfortunately, POMDPs are even more complex to solve than MDPs, and to date, it is possible to compute exact solutions only for small-size problems (Chadès et al. 2011).

Another form of uncertainty is model uncertainty, which refers to the lack of certainty about the structural frame shaping the behaviour of the system (Walters 1986; Punt & Hilborn 1997). Adaptive Management is a common approach adopted to reduce such uncertainty by testing multiple models through the ongoing process of management and monitoring occurring under the principle of ‘learning by doing’ (Runge 2011). In adaptive management, belief weights are attributed to each model depending on the comparison between model predictions of the outcome of an action and the observed response from monitoring. Such a comparison allows us to increase our belief in the model that is most likely to give rise to the observed response.

Two approaches, based on the role of learning, are then conceivable (Williams 2009). Passive adaptive management assumes learning is a by-product of decision-making in which the models weights are updated by applying Bayes theorem but remain constant during the optimization process (Williams, Nichols & Conroy 2002). For instance, Martin et al. (2010) used passive adaptive management to determine an optimal harvest strategy to control raccoons to improve oystercatcher productivity. They considered two models, one assuming no effect of raccoons on oystercatchers' productivity and another one assuming a strong effect. In the second approach, referred to as active adaptive management, model weights appear in the optimization process. More precisely, the next updated weights are incorporated in the expected sum of future rewards of the Bellman equation. Such approach is the most advanced form of adaptive management. In contrast to passive adaptive management, active adaptive management considers how current decisions will affect future learning and chooses an optimal balance between rewards based on current beliefs and future rewards based on updated beliefs (Runge 2011). For instance, McDonald-Madden et al. (2011) used active adaptive management to assess species relocation strategies in the context of climate change. They considered two models, one in which carrying capacity declined over time because of climate change and another one in which climate change had no impact on species carrying capacity.

The last form of uncertainty, referred to as parametric uncertainty, is related to our limited knowledge about the parameters that govern the system dynamics (Williams 2009). This optimization problem is also referred to as adaptive management under parameter uncertainty. One approach to this problem makes use of conjugate priors over the unknown parameters (Raiffa & Schaifer 1961). For example, Walters & Hilborn (1976) used a normal prior over parameters in population growth model. Hauser & Possingham (2008) and Rout, Hauser & Possingham (2009) used Beta priors to represent uncertainty over transition probabilities. Recently, approximate approaches using projection methods have been developed for situations that do not support the use of conjugate priors. Springborn & Sanchirico (2013) applied this approach to the management of development that impacts mangrove habitat.

When the form of uncertainty is unknown, an alternative optimization approach to backward iteration, policy or value iteration is reinforcement learning. This technique makes sequential decisions when transition probabilities or rewards are unknown and cannot be estimated by simulation (Chadès, Curtis & Martin 2012). The Q-learning algorithm is used in which the optimal value V0* and the corresponding action are estimated by a learning process of observed transitions and values obtained with function approximation (Chadès et al. 2007; Table 3). A potential issue with this method, originally developed in robotics, is that it requires a large number of observations to build the transition matrix.

## Software packages performing dynamic programming

There are several software packages that allow the implementation of SDP. Adaptive Stochastic Dynamic Programming (ASDP) Lubow 1995, was the first application developed for biologists to solve optimization problems using dynamic programming. It is a MS DOS executable that is no longer maintained by its author. Two other packages are available for MATLAB: MDPSolve (Fackler 2011, available at https://sites.google.com/site/mdpsolve/) and MDPtoolbox (version 4.0) available at http://www.inra.fr/mia/T/MDPtoolbox/. MDPtoolbox is also available for the open-source software for numerical computations Scilab, R (http://cran.r-project.org/web/packages/MDPtoolbox) and GNU Octave (GNU's not Unix). Both MDPSolve and the MDPtoolbox implement the value iteration and the policy iteration algorithms, while ASDP uses only the former. Adaptive Stochastic Dynamic Programming does not use the convergence criterion discussed previously for infinite time horizon but stops after the policy remains the same for a specified number of iterations. MDPSolve and MDPtoolbox deal with natural and management uncertainty in finite and infinite time horizons (Table 3). MDPtoolbox satisfyingly copes with unknown management uncertainty through the implementation of Q-learning in an infinite horizon, while MDPSolve does not. MDPSolve enjoys capabilities that permit solving POMDP and addressing model uncertainty, while MDPtoolbox does not. Transitions for continuous variables are often defined in terms of either a conditional density or a transition equation which specifies the next period state as a function of the current state and, possibly, a random shock reflecting environmental variation or other process noise. MDPSolve has procedures that define discrete transition matrices that approximate the transition for continuous variables.

In the following section, we provide an application of SDP and solve the associated decision problem using both MDPSolve and MDPtoolbox. Although we emphasize that this exercise does not represent a general introduction to these packages (we refer to the user's guides instead), we hope it will be a good starting point. In addition to the use of these packages, we demonstrate that MDP problems can be implemented in program R and provide code that can be amended for one's own purpose.

## Application to wolf culling

In this section, we illustrate each step of SDP required to derive an optimal management strategy to control a population of wolves in Europe. We consider several decision models of increasing complexity for wolf culling. First, we build a deterministic model to keep things easy and illustrate the notation. Then, we illustrate how to make decisions when uncertainty exists.

### Setting the scene

We go through the six steps of dynamic programming. First, the optimization objective is to maximize the population while providing that the population does not exceed 250 individuals (Nmax) and remains above 50 individuals (Nmin). These thresholds are somewhat arbitrary from a biological perspective, but were selected to obtain results in a reasonable amount of time while scanning a relatively broad range of abundance states. Second, the state variable Xt is the population size Nt at time t, which ranges from 0 to K where K is an arbitrary upper bound on the state space. Third, the control variable At is the harvest rate Ht, a discrete variable ranging from 0% to 100% with an increment of 1/(K + 1) therefore allowing as many possible actions as there are number of states. Fourth, regarding the transition model describing population dynamics and the consequences of actions (harvest Ht) on the state variable (abundance Nt), we adopted an exponential growth (Fig 1), which is suitable to describe a population currently in a colonization phase. We assumed an additive effect of human offtake on total mortality (Creel & Rotella 2010; Murray et al. 2010). More precisely, we used:

(eqn 8)

where λ is the population growth rate. The value of λ was extracted from the literature using the French population as an example (the estimate of λ is 1.25 with 95% confidence interval [1.14; 1.37]; Marescot et al. 2011). Fifth, utility is based on abundance and harvest rate bearing in mind the objective to keep a population size between Nmin and Nmax. We choose a utility function that is increasing linearly with abundance when the current state is within the objective range. In mathematical terms, we write:

(eqn 9)

where αt takes the value 1 if NminNt+1Nmax and 0 otherwise. Given the current population size Nt and harvest rate Ht, if the future state is above the utility threshold Nmax or below Nmin, the penalty factor αt takes a null value and, therefore, the utility function does as well. If, however, the future population size Nt+1 is in the target abundance range, then the utility of harvest level Ht in state Nt is the population size after harvest but before annual growth occurs (Fig 1b). Because we assumed exponential growth, and because Nmax is below the carrying capacity and the growth rate is greater than 1, the objective can be translated into attaining and then maintaining the population at Nmax. An alternative utility function could be defined only on the current abundance because no economic cost was considered here. Adopting the general formulation in which utility is defined as a function of current action would be useful to incorporate economic costs and pay-offs. Sixth, we need to solve the Bellman equation using the value iteration or the policy iteration algorithm.

### Deterministic case

We first ran a deterministic model over an infinite time horizon using both value iteration and policy iteration algorithms. There was also an analytic solution to this deterministic MDP, which enables us to validate the approach. With an objective of keeping a population between Nmin and Nmax, the optimal action for a state N is a harvest rate of the maximum between 0 and 1 – Nmax/(λN) which removes the exact surplus of individuals above Nmax as in our linear utility function. The three different methods provided the same optimal harvest rates. The strategy of no culling remained the best strategy until population reached 200 individuals. Above 200 individuals, expected population size reached the utility threshold Nmax (200 × 1.25 = 250). From there, optimal harvest rate increased from 0.8% to 20%. The highest harvest rate was reached at the utility abundance threshold of 250 individuals. We provide R code to implement the resolution of this MDP (Appendices S1 and S2). This example was also run in MDPSolve and MDPtoolbox (Appendix S3 for the scripts and S4 for the numeric values).

The solution demonstrates the trade-off between current and future utility inherent in dynamic programming problems. Here, there is no reason to cull unless the population will exceed Nmax in the next period. If the population is high enough, however, it is optimal to forgo current utility by culling enough to ensure that utility is obtained in the next period.

### Coping with uncertainty

Besides the deterministic model, we consider models with demographic stochasticity that generates variability in population growth rates arising from random differences among individuals in survival and reproduction within a season or a year (Lande 1993). R code is provided to run this additional example (Appendix S5).

We assume that the state variable is distributed according to a Poisson distribution:

(eqn 10)

with mean value equal to its deterministic counterpart (Appendix S1 and S2). The transition probabilities are now changing across the different states according to a Poisson distribution:

(eqn 11)

We found that harvesting was not recommended as long as population was below 200 individuals. As in the deterministic model, above this abundance threshold, harvesting increased from 0.8% to 20% of population size (Fig. 3). When population was already at the upper objective limit Nmax, 50 individuals were to be removed.

## Discussion

Stochastic dynamic programming is a valuable tool for solving complex decision-making problems, which has numerous applications in conservation biology, behavioural ecology, forestry and fisheries sciences. It provides an optimal decision that is most likely to fulfil an objective despite the various sources of uncertainty impeding the study of natural biological systems. The formalization of objectives of any Markov decision problem is given by the utility function that allows prioritizing the preferences of the ones who make the decisions (decision-maker or manager). As opposed to the dynamic model, the representation of utility is subjective and hence can be difficult to define.

### Different way of defining a utility function

The use of dynamic programming implies a particular formalization of the objective into a utility function. The utility is a function of one or more decision variables, themselves defined on the system states and actions. Utility is a sometimes defined with constraints that can reflect different decision rules (Williams, Nichols & Conroy 2002).

Problems in resource management often deal with trade-offs not only between current and future objectives but also between multiple current objectives. For example, one trade-off objective is to control and protect a predator that is potentially threatened. Other objectives can be to restore natural habitat while minimizing action cost and allowing some recreational activities. When multiple objectives are involved, different decision variables can be considered. The objective can be to find a relevant utility optimum reflecting the trade-off between the different decision variables (for instance, the habitat quality and the intensity of recreational activity) which can respond differently to decisions (restoration). In such cases, some weighting scheme must be used to express the different decision variables in common units. For example, suppose that E is an environmental performance of variable and B is the benefits from recreation activities and C is the financial cost of an action. Utility can be defined as a weighted sum of the decision variables wE BC, where w is a weight that assigns a monetary value to environmental variable.

An alternative to using weighted sums is to use a multiplicative functional form such as Ea Bb. The parameters a and b serve two functions. First, if a and b are both positive and if >  b, it implies that environmental variable is more important than the recreation variable. The relative value of a to b changes the weight that is placed on E versus B. Second, if a or b value is <1 in absolute value, it implies that the marginal contribution of an additional unit is smaller for larger values of the variable than for smaller one. This representation is also appropriate when it is deemed more important to save an additional individual of a protected species such as the wolf in France when there are very few remaining than when the population is more abundant. Note that unlike the additive utility form, this multiplicative form is not affected by the scale of either variable.

Another approach is to convert one decision variable into a constraint or to use a penalty function for failure to meet the target. This approach simplifies the multiple objectives into a single constrained objective (Converse et al. 2012). For instance, one objective can be to improve habitat quality given a limited budget of \$50 000, while allowing a minimum of 100 h/year of recreational activities. For example, E if ( 100 h/year) and if (< \$50 000); otherwise = 0. Here, the decision variable is the intensity of recreation, and action cost has been converted into a constraint. This avoids the need to make comparisons between variables of different types, but it also has implications that an analyst should be aware of. First, if the system never reaches the threshold implied by the two constraints (100h/year of recreation and a budget of \$50 000), it means that both B and C are irrelevant. Second, it implies that once one threshold is reached, further increases in C or further decreases in B are irrelevant. Finally, it should be noted that this utility is not the same as optimizing with respect to E subject to a long run expectation that the thresholds are satisfied.

### Limits of dynamic programming: curse of dimensionality

Despite the flexibility of dynamic programming, one has to find a trade-off between biological realism and model complexity when tackling an optimization problem. Indeed, DP methods often face the issue known as ‘the curse of dimensionality’ which states that, when more state variables are added in the model, the size of the DP problem increases exponentially (Walters & Hilborn 1978; Schapaugh & Tyre 2012). To overcome this computational complexity, approximate optimization methods can be used such as heuristic sampling algorithms that proved efficient for models with several variables (Nicol & Chadès 2011). These methods approximate the optimal solution given the starting state by simulating the possible future states the more likely to occur. Simulating only possible future states lightens the computational calculation in comparison with the value or policy iteration procedure in which values are computed for all possible states.

### Perspectives for wolf population management

The aim of this study was to demonstrate the usefulness and relative ease of SDP. We hope that this study can serve as an entry point into the extensive literature and potential applications of SDP in ecology. For the sake of clarity, we made assumptions to keep the illustration simple, but SDP can accommodate several useful extensions. For example, we did not include socio-economic constraints in the modelling process. However, SDP allows the incorporation of such factors by maximizing several objectives simultaneously using complex trade-offs in the utility function (Walters & Hilborn 1978; Milner-Gulland 1997; Runge & Johnson 2002). In our example, economic constraints could be incorporated via a trade-off between monetary loss from livestock depredation, impact of wolves on game abundance and indirectly on hunting activity, the receipts from ecotourism and the cost of wolf culling (e.g. Milner-Gulland 1997). Second, the lower abundance limit could also be refined based on an ecological threshold that once reached is irreversible (Holling 1973; Bodin & Wiman 2007). Using such thresholds would be relevant for a protected species because it would insure population viability without necessarily changing the optimal policy (Martin et al. 2009). Additionally, further work is needed to compare optimal strategies obtained with alternative population dynamic models. Indeed, the choice of exponential growth is an adequate model for a colonizing population, but when a population is established and the habitat saturated, this model becomes inappropriate. Instead of considering exponential growth, one could use a logistic growth with density-dependent effects such as an Allee effect which has been shown in social species with few breeding units like African wild dogs (Lycaon pictus) (Stephens & Sutherland 1999).