Correspondence site: http://www.respond2articles.com/MEE/

# Beyond stochastic dynamic programming: a heuristic sampling method for optimizing conservation decisions in very large state spaces

Article first published online: 21 OCT 2010

DOI: 10.1111/j.2041-210X.2010.00069.x

© 2010 The Authors. Methods in Ecology and Evolution © 2010 British Ecological Society

Additional Information

#### How to Cite

Nicol, S. and Chadès, I. (2011), Beyond stochastic dynamic programming: a heuristic sampling method for optimizing conservation decisions in very large state spaces. Methods in Ecology and Evolution, 2: 221–228. doi: 10.1111/j.2041-210X.2010.00069.x

#### Publication History

- Issue published online: 1 APR 2011
- Article first published online: 21 OCT 2010
- Received 12 May 2010; accepted 2 September 2010 Handling Editor: Robert P. Freckleton

### Keywords:

*Leipoa ocellata*;- Markov decision processes;
- metapopulation;
- on-line sparse sampling algorithm;
- optimal management;
- stochastic dynamic programming

### Summary

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

**1.** When managing endangered species the consequences of making a poor decision can be extinction. To make a good decision, we must account for the stochastic dynamic of the population over time. To this end stochastic dynamic programming (SDP) has become the most widely used tool to calculate the optimal policy to manage a population over time and under uncertainty.

**2.** However, as a result of its prohibitive computational complexity, SDP has been limited to solving small dimension problems, which results in SDP models that are either oversimplified or approximated using greedy heuristics that only consider the immediate rewards of an action.

**3.** We present a heuristic sampling (HS) method that approximates the optimal policy for any starting state. The method is attractive for problems with large state spaces as the running time is independent of the size of the problem state space and improves with time.

**4.** We demonstrate that the HS method out-performs a commonly used greedy heuristic and can quickly solve a problem with 33 million states. This is roughly 3 orders of magnitude larger than the largest problems that can currently be solved with SDP methods.

**5.** We found that HS out-performs greedy heuristics and can give near-optimal policies in shorter timeframes than SDP. HS can solve problems with state spaces that are too large to optimize with SDP. Where the state space size precludes SDP, we argue that HS is the best technique.

### Introduction

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

The need to make correct management decisions in the face of an uncertain future motivates conservation biologists to optimize decision making using mathematical techniques. For models with small state spaces the best technique is stochastic dynamic programming (SDP), which can be used to derive the globally optimal control policy to achieve a specified management objective (Bellman 1957; Puterman 1994). SDP has been used widely in the conservation literature, with applications in optimal translocation (Lubow 1996; Rout, Hauser & Possingham 2007), release strategies for bio-control agents (Shea & Possingham 2000), reserve size and configuration (Tuck & Possingham 2000; Haight *et al.* 2004; Meir, Andelman & Possingham 2004; Strange, Thorsen & Bladt 2006; Wilson *et al.* 2006), landscape reconstruction (Westphal *et al.* 2003; McDonald-Madden *et al.* 2008), strategies for pest eradication (Regan *et al.* 2006; Bogich & Shea 2008), harvesting (Johnson *et al.* 1997; Hunter & Runge 2004; Spring & Kennedy 2005), and fire regimes (Richards, Possingham & Tizard 1999; McCarthy, Possingham & Gill 2001).

Although SDP is a powerful tool, it suffers from Bellman's curse of dimensionality (Bellman 1957), which means that adding a new state variable to a model results in an exponential increase in the size of the state space. The result is that for all but the simplest models, the SDP policy becomes too computationally expensive to compute. As many ecological systems are vastly complex, being restricted to two or three state variables may frustrate the ecological modeller [e.g. Meir, Andelman & Possingham (2004)]. In this paper, we describe and explore the use in conservation of a heuristic sampling (HS) approximation algorithm developed in the field of artificial intelligence by Péret & Garcia (2004), which is able to circumvent the curse of dimensionality and provide policies where the quality of the policy improves with increasing run time. The method can provide approximately optimal policies for systems with infinitely large state spaces.

#### Ecological case studies

We demonstrate the advantage of using the HS approximation in two case studies. In the first, we consider the management of a metapopulation of malleefowl *Leipoa ocellata* habitat in the Bakara region of South Australia. Malleefowl are robust ground dwelling birds that inhabit semi-arid regions of Australia. The birds are susceptible to a number of threats and are classified as vulnerable under the national legislation. A metapopulation model for the Bakara population has previously been developed (Day & Possingham 1995), however management options were not explored. In this paper, we explore two management options: fox baiting and reintroduction of captive-bred malleefowl to empty patches.

We then demonstrate the applicability of the HS method to eradicate an invasive species from a ring-structured metapopulation with 33 million states.

### Materials and methods

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

The mathematical framework for many optimal management problems (including SDP and the HS method) is the Markov decision process (MDP) model. A MDP is a sequential decision model in which the available actions, rewards and transition probabilities for each state depend only on the current state and not on the states visited and actions taken in the past (Puterman 1994). Defining a MDP model requires a management objective, as well as five components: (i) the state space, (ii) a set of management actions, (iii) the length of the time horizon, (iv) a transition probability matrix and (v) the immediate costs (or rewards) associated with being in a state and taking an action (Sutton & Barto 1998). The solution to a MDP is an optimal policy which associates each state with an action such that if the policy is followed the objective will be achieved with maximum probability. A MDP model can be solved using a number of methods, including SDP and the HS algorithm. For discussion on using how Markov decision theory can be applied to ecological problems, see Mangel & Clark (2000) or Possingham *et al.* (2001).

In the remainder of the ‘Materials and methods’ section, we introduce the SDP and HS algorithms, and then formulate two ecological case studies as MDP models. These case studies are solved using SDP and the HS algorithm to give the optimal management policies for each problem.

#### Stochastic dynamic programming

A standard SDP technique for solving a MDP numerically is the value iteration algorithm. We define the states **s** and the actions **a** to be elements of the state space *S* (**s** ∈ *S*) and the action space **A**(**s**) (**a** ∈ **A**(**s**)). We use the cost function to define an iterative value function *V*_{j}(**s**) that represents the global cost of being in a state **s** after *j* iterations (in ecological applications, an iteration is usually assumed to be an annual time step). The value function is defined using Bellman's equations (Puterman 1994):

- ( eqn 1)

where *c*_{j}(**s**, **a**) is the problem-specific local cost received for being in state **s** and taking action **a** at iteration *j*, *γ* is a discount factor that weights the importance of future states because they are uncertain, and *P*(**s ^{′} | s, a**) is the transition probability of going from state

**s**to state

**s**

^{′}after taking action

**a**. We refer to

*Q*(

**s**,

**a**) as the

*Q*-value for state-action pair (

**s**,

**a**). For discount factors less than unity, Eqn 1 converges over time (or number of iterations) to the optimal value function

*V*

^{*}(

**s**) . We use a discount factor of

*γ*= 0·99 in all of the calculations in this study.

Once the value function has converged, the optimal policy *π*^{*}(**s**) for state **s** is the action that minimizes the value function:

- ( eqn 2)

#### The heuristic sampling algorithm of Péret and Garcia

Value iteration provides a globally optimal policy for managing a MDP. However, obtaining this policy is computationally expensive because the value iteration algorithm performs an exhaustive search of the whole state space for each iteration of the algorithm. Combining this with Bellman's curse of dimensionality, researchers have found that value iteration will not always be an effective technique for solving MDPs with large state spaces (for discussion on the efficiency of value iteration, see section 4.7 of Sutton & Barto (1998).

Péret & Garcia (2004) provide an attractive approximation technique to overcome this complexity. The HS algorithm computes the best action from the current (root) state by simulating the possible future states to obtain likely outcomes of each action. Future states are simulated using a model that is able to predict transition probabilities from the current state. Value iteration is performed only on the simulated future states rather than on all possible states. This makes the algorithm attractive for problems with large state spaces as the running time is independent of the size of the state space of the problem.

Because the algorithm only looks at states that are likely to be reached from the current state, the policy obtained will only approximate the optimal policy for the current state. We sacrifice the computationally expensive globally optimal policy for all states (the SDP solution) for a faster local approximation that applies only to the current state (Fig. 1).

The computational cost of a solution is determined by the acceptable error level (where the acceptable error level is determined by the particular application). The quality of the HS policy improves as the computing time/budget increases. This allows the user to assign a given computational budget to the problem, and be assured that the lowest cost policy that can be obtained with that budget (and exploration policy) will be achieved. If the budget is increased, then lower cost policies can be obtained as the algorithm can consider additional future possibilities.

The HS algorithm relies on two key measures: (i) the value function and (ii) the global error at the root state. Both measures are estimated by simulating *N* trajectories of length *H* from the root state (Fig. 1). In order to generate the trajectories, we choose which action to take from a state based on an exploration policy (we use a Boltzmann distribution) and our current estimate of the *Q*-value of each state (Sutton & Barto 1998). See Supporting Information Appendix S3 for details of the exploration policy.

Using our current estimate of the value of each state, it is possible to adapt the Bellman equation (Eqn 1) to account for finite samples and compute the value function at different depths *H*:

- ( eqn 3)

where *h* ∈ {0,1,…,*H*} is the iteration index and *S*(**s**, **a**,*C*) is the set of *C* states that have been sampled from initial state **s** after carrying out action **a**. *V*_{0}(**s**) is the initial value function estimate. We used an initial value estimate of zero for all states. Eqn 3 is an iterative equation where each step is denoted using the index *h*. When the value function is updated, the new value calculated from Eqn 3 is weighted to provide a proportion (1 − *α*) of the updated value, and the rest of the value function (proportion *α*) is provided by the estimate of the value from the previous batch of samples *V*_{C,h−1}(**s**). We refer to *α* as the learning rate (where 0 < *α* < 1), as it determines how much emphasis we place on newly acquired knowledge (Sutton & Barto 1998). We use a fixed value of *α* = 0·2 in this study. This value of *α* is suitable for most applications [for more information on *α*, see section 6.3 of Sutton & Barto (1998)].

The goal of the HS algorithm is to learn the true *Q*-values (and hence the optimal policy) by simulation of what is likely to happen in the future. The more we simulate, the more possible future outcomes we consider, and the better our estimate of the true *Q*-values becomes. The *Q*-values for each state-action pair are updated and stored after each batch of simulations is completed. The *Q*-values are then improved using new samples collected in the next iteration of the simulation.

Because we are taking a finite sample from the total set of possible states, we could visit from each state in the trajectory, there will be a sampling error each time we take a batch of *N* simulations. If we proceed to increase the length of the trajectories and do not account for this sampling error, then the error will compound and our estimates of the optimal management action will become worse as we increase the trajectory length. To control this effect, Péret & Garcia (2004) provide an estimate of the global sampling error associated with the value of being in a given state:

- ( eqn 4)

where *σ*_{init} is some initial error estimate (in this paper, we assume *σ*_{init} = 10). Eqn 4 minimizes the error in a state by minimizing the expected sum of local errors *e*(**s**, **a**) in a similar way to the value estimate equation (Eqn 3). The local error *e*(**s**, **a**) is defined as the error due to finite sampling, and is estimated using a well-known result from statistics. For a state-action pair that has been sampled *C* times, the local error is given by , where *σ* is the standard deviation of the *C* samples of *γ**V*_{h−1}(**s ^{′}**) and is the

*t*-value from the Student's

*t*-function with

*C*− 1 degrees of freedom and confidence level

*β*/2. In this study, we set

*β*= 0·1.

Starting from our root state, we generate trajectories and update the estimates of the value function and the global errors of all states visited. Each time a batch of simulations is generated, we decrease the computational budget by *N* units. This process is repeated until the error between successive iterations falls below a user-defined tolerance (we use a tolerance of 0·1). At this point, we know that we have sampled the likely daughter states of the root state sufficiently frequently that we are confident that the global error is small, and we increase the length of our sampled trajectories (*H* *H* + 1). We repeat this process until the computational budget is exhausted. Once the budget is exhausted, the best action is the action with the minimum *Q*-value at the root state, based on the most recently updated *Q*-values. This action can then be carried out and a new state obtained. The new state becomes the root state in the next time step and the algorithm is run again (Fig. 1). Pseudocode for the algorithm is given in Appendix S3.

#### Case study 1: malleefowl conservation model

In the first case study, we adopt the discrete-time, presence-absence metapopulation model described in Day & Possingham (1995), which consists of *M* = 8 patches. The management objective is to minimize the probability of extinction of the malleefowl population over the long term (infinite horizon). The state space *S* for our model is based on whether each patch is occupied or not. The size of the state space is 2^{M} = 256 states. We describe each state as a vector **s** of binary states. The state of the *i*th patch is **s**(*i*) = 1 if the patch is occupied, and **s**(*i*) = 0 otherwise (*i* ∈ {1,2,…,*M*}).

We use previous studies of malleefowl populations (Priddel & Wheeler 1997; Benshemesh 2000; Thompson *et al.* 2000; Mawson 2004) and cost data (Armstrong 2004; Mawson 2004) to estimate the costs of fox baiting and re-introduction. The assumptions and methods used in the estimates are detailed in Appendix S1. In each of the 8 patches, we need to decide whether or not to lay fox baits, and also whether or not to do a re-introduction. By assuming a minimum annual expenditure and maximum budget allocation (see Appendix S1), we obtain the set of possible management actions that can be taken from a state **s** is the action space **A**(**s**).

We define four processes which we use to create a transition probability matrix: reintroduction of malleefowl, fox baiting of patches, extinction then colonization.

The effect of a re-introduction in the model is to cause an unoccupied patch to become occupied. We assume that each attempted reintroduction has a 50% probability of success. The effect of fox control on the malleefowl population was estimated using a rudimentary population viability analysis of the Bakara malleefowl population using data from Benshemesh (2000), Day & Possingham (1995) and Priddel & Wheeler (1996) (see Appendix S2). Extinction causes an occupied patch to become empty, and colonization occurs when an empty patch becomes occupied due to the arrival of migrants from another occupied patch in the metapopulation. The processes of extinction and colonization were modelled using the method of Day & Possingham (1995). The assumptions behind these processes and how they are used to generate a transition matrix are detailed in Appendix S1.

We minimize the probability of population extinction by defining the local cost of being in a state, and then minimize the expected sum of these costs over time. The cost of being in a state **s** is the number of unoccupied patches in that state:

- ( eqn 5)

where *M* is the number of patches.

#### Case study 2: pushing the boundaries – optimal management of invasive species

The HS method offers a considerable advantage over SDP because it generates policies for problems with any sized state space. To explore the utility of the algorithm compared to SDP we use a pest eradication problem in which we seek to optimally eradicate an invasive species from a metapopulation.

We envisage a metapopulation of *M* patches arranged in a ring structure (Fig. 2). Patches may either be infected or susceptible. The state space therefore has size 2^{M}. Susceptible patches may become infected via their nearest connected neighbours with probability *P*_{infect} × *I*, where *I* is the number of infected neighbours. In each time step only one management action may be performed on one of the patches. A management action consists of eliminating the pest in the managed patch – that is transforming an infected patch to a susceptible patch with probability *P*_{recover}. Managers may also choose to do nothing, which has no effect. The total number of possible actions at any state is *M* + 1. Each management action has a cost of 50 units. Doing nothing has no cost. Each susceptible patch is rewarded with 100 units. The local reward function can therefore be written as:

- ( eqn 6)

where *s*(*i*) = 1 if patch *i* is susceptible and zero otherwise. The cost of action *a* is given by *μ*(*a*).

The management objective is to eliminate the pest species by maximizing the number of susceptible patches for the least cost. We do this by maximizing the expected sum of local rewards (Eqn 6). To find an optimal management policy where the problem is to maximize reward rather than to minimize cost (as in the malleefowl example), we replace the ‘*min*’ in Eqns 1–3 with a ‘*max*’ expression, and replace the local cost terms with local reward terms. The global error estimate in the HS algorithm remains the same, as we seek to minimize the global error regardless of the objective of our value function.

We use this problem to test the HS algorithm for problems with very large state spaces. The ring problem is ideal because we are able to determine the optimal policy for up to 12 patches using SDP, and the optimal policy follows a simple rule of thumb. The optimal policy for managing a ring structured population with all patches initially infected is to start somewhere in the ring and then proceed to manage at either end of the susceptible chain that is created until all the infected patches become susceptible and the infection is removed. We test the HS algorithm for networks of up to 25 patches (equivalent to 33·6 million states).

### Results

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

#### Heuristic sampling and SDP: malleefowl example

The performances of the HS and SDP algorithms for the malleefowl case study are compared using simulation (see algorithm 2 of Appendix S3 for pseudocode) in Fig. 3. Lower cost solutions are better solutions. The cost of the HS policy rapidly converges to the cost of the optimal SDP policy, so that the optimal policy can be obtained for a computational budget of 2000 simulations per time step. Results for a maximum-gain greedy (myopic) heuristic are also shown. The greedy heuristic considers all management options in the current time step and chooses the action that gives the highest immediate returns. The greedy heuristic evaluates many actions as the same because they have the same cost in the next time step. Because the greedy heuristic cannot discern between these strategies, it may choose an action that is sub-optimal over a longer time horizon. This leads to a poor strategy compared to the optimal solution.

High computational costs meant that we could only evaluate relatively few (20) simulation runs of the HS algorithm on our personal computer. Having few simulation runs means that the HS solution in Fig. 3 is susceptible to stochastic fluctuations, which is why it appears to out-perform the globally optimal SDP solution for very large computational budgets (10 000 simulations/time step). Plots of the 95% confidence intervals for the SDP and HS policy performance show that the width of the confidence intervals are substantially greater than the variation between the HS and SDP solutions (Fig. 3).

#### Managing the malleefowl

The HS algorithm was used to compute the best policy for initial state 10011101 over a period of 20 years, subject to a computational budget of 2000 simulations at each time step (Fig. 4). For this initial state and budget the HS policy was identical to the optimal SDP policy.

A rule of thumb that works for most initial states is: re-introduce into the largest unoccupied patches as frequently as the budget will allow, and spend the rest of the budget on baiting, starting with the smallest patches first. However, there are exceptions to the rule. For example in Fig. 4 once all patches are occupied (time step 13), the optimal solution includes baiting the second-largest patch (patch 7) before baiting all of the smaller patches (e.g. patch 6). Consequently this rule of thumb would not hold for all states. In contrast, the HS algorithm with a sufficient computational budget is able to pick up cases where the rule of thumb gives a poor policy.

#### Effect of the computational budget: simulation time and relative error

As the state space gets large, we found that the SDP algorithm becomes cumbersome to evaluate, and the HS algorithm is a better choice. Figure 5 shows this trade-off between computational time and accuracy for the malleefowl problem with two different patch networks and various computational budgets. The 8-patch metapopulation is the Bakara conservation park example. For the 10-patch metapopulation, a patch network with random patch areas and inter-patch distances was generated. Patch areas and inter-patch distances were selected from a uniform distribution using the same range of values as the 8-patch network. Increasing the computational budget of the HS algorithm results in roughly exponential increases in the evaluation time. Increasing the number of patches in the metapopulation slows down both the HS and the SDP algorithms. The HS algorithm is faster than the SDP algorithm for all budgets tested except for the 8-patch metapopulation with a computational budget of 5000 simulations per time step.

Figure 5b shows the relative error associated with each metapopulation for each of the computational budgets. Positive errors mean that the HS policy gives a higher cost than the SDP policy. Negative errors mean that the HS policy gives a lower cost than the SDP policy (due to stochasticity in individual runs – on average the HS policy does not out-perform the SDP policy by more than 0·2% for any of the scenarios tested). The relative error decreases exponentially as the computational budget is increased. The error is very large and variable with low computational budgets because the HS algorithm makes poor choices if too few simulations are used to generate the optimal action. As the computational budget increases, the standard error of the HS algorithm decreases. The standard error is larger in the 8-patch metapopulation than the 10-patch metapopulation because the two problems were simulated separately with a different random seed. This is not likely to be a general trend and the variation in the performance will depend on the random seed that determines the particular trajectory that the system takes.

Although the HS algorithm is slightly faster than the SDP for the 8-patch metapopulation for all budgets less than 5000 simulations per time step, the relative error is high. This makes SDP the best algorithm for this metapopulation. However, according to Fig. 5 the SDP algorithm is considerably slower than the HS algorithm for the 10-patch metapopulation and the HS algorithm is able to achieve low-error results with moderate computational budgets (e.g. 2000 simulations/time step). For the 10-patch problem the HS algorithm could be successfully employed to achieve a high quality policy in a shorter time than the SDP algorithm. As the number of patches in the metapopulation increases, it is expected that the HS will become markedly faster than the SDP and the trends seen in Fig. 5 will become increasingly apparent.

#### Solving very large state space problems

To investigate the performance of the HS algorithm when the state space is very large, we turn to our ring-structured optimal eradication problem.

Simulations were run for networks of 4, 8, 12, 20 and 25 patches. The number of states, run times and computational budgets used in each case are given in table 3 in Appendix S3. In each case all patches were initially infected. The simulations were run for 20 time steps and the accumulated reward recorded. In each case the HS algorithm followed the rule of thumb exactly (for the 4, 8 and 12 patches, the rule of thumb was identical to the optimal solution. The optimal solution cannot be computed for the 20 and 25 patch cases).

In general, computational requirements mean that SDP methods can only solve problems for managing spatially explicit metapopulations when there are less than about 15 patches. However with HS we were able to solve a problem 3 orders of magnitude larger than this in a reasonable time (table 3 in Appendix S3). Although the 20 and 25 patch networks are highly simplified examples, they could not have been solved using SDP methods.

#### Effect of the initial state on HS efficiency and accuracy

The HS algorithm was run on the 8-patch eradication problem for starting states with 1–8 patches occupied (see table 4 in Appendix S3). In each case the HS algorithm found the optimal SDP action. For this problem the HS algorithm evaluates faster as the number of initially infected patches decreases. This effect is likely to be problem-dependent and the ideal computational budget will depend on the particular structure of the transition probabilities. The best method where the model is complex is to use the largest computational budget that is practical and compare the output to expert heuristics.

### Discussion and concluding remarks

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

Metapopulations create notoriously hard optimization problems because the size of the state space increases exponentially with the number of patches in the network. Few attempts seem to have been made to obtain optimal management plans for spatially explicit metapopulations with more than three patches, with the exceptions of Possingham (1996) and Westphal *et al.* (2003). We have demonstrated that the HS algorithm produces good policies for two metapopulation management problems with very large state spaces, and shown that the HS algorithm can produce results that are better than greedy heuristics, which are currently used as a default rule of thumb for solving problems with state spaces that are too large for SDP.

The research has implications for ecological disciplines outside of conservation biology. SDP has been effectively employed in behavioural ecology (Mangel & Clark 2000; Tenhumberg, Tyre & Roitberg 2000), bio-control in agriculture (Shea & Possingham 2000), fisheries and game harvesting (Walters & Hilborn 1975; Hilborn 1976; Johnson *et al.* 1997), and reserve selection (Possingham *et al.* 1993; Costello & Polasky 2004; Meir, Andelman & Possingham 2004; Wilson *et al.* 2006). The curse of dimensionality applies in all of these areas, and the HS algorithm could be easily used in these disciplines to improve the results obtained by simple heuristics.

Other attempts have been made to use heuristics to make conservation decisions when the curse of dimensionality precludes the use of SDP. For example Wilson *et al.* (2006) and Bode *et al.* (2008) recommend a one-step heuristic to prioritize global resource allocation for protected areas. McDonald-Madden *et al.* (2008) took a similar approach but also accounted for imperfect knowledge of the contents of potential sites in their heuristic. In a problem closer to the case studies in this paper, Nicol & Possingham (2010) attempted to use a greedy heuristic to schedule restoration of habitat within a metapopulation. The heuristics in these studies look only one time step ahead, but the HS method looks as far ahead as possible given the constraints of the computational budget and the acceptable error level. By searching further ahead, the HS policy approximates the optimal policy better than (or at least as well as) the one-step heuristics. Although the greedy heuristic may give good results in some cases, we showed that it did not perform well in the malleefowl problem (Fig. 3). The HS algorithm obtained the optimal policy where the greedy heuristic failed.

Attempts have been made to deal with the curse of dimensionality in agriculture and forestry (Garcia 1999; Garcia & Sabbadin 2001; Forsell 2009). A noteworthy strategy is the graphical MDP approach (Peyrard *et al.* 2007; Forsell *et al.* 2009) which solves problems with large state spaces and multiple state variables. It also captures the spatial structure of the landscape in the underlying graphical model. This approach has many parallels to metapopulation problems in conservation and may be an alternative to the HS algorithm for large state space metapopulation problems.

One caveat of the HS algorithm is that unlike the SDP algorithm, the HS algorithm does not guarantee optimality. However, the HS still has great potential to be used on complex conservation problems, because as we increase the computational budget, the errors decrease (or at least do not increase) (Péret & Garcia 2004). We get this performance because we control the global error and remove the possibility that poor decisions are made due to overly sparse sampling, resulting in decisions actually becoming worse as the amount of sampling increases (Péret & Garcia 2004). Although the HS method does not necessarily provide optimal policies, HS may be thought of as obtaining the best policy possible given the constraints of computational capacity.

There is no rule of thumb that predefines an adequate computational budget for a given problem. This is determined primarily by the acceptable error and also by the structure of the transition probabilities in the specific problem. If only a few daughter states are likely to be visited, then lower computational budgets will find the optimal solution quickly. However, if many daughter states can be reached from a state, more simulations are required to obtain a representative sample of possible outcomes and higher computational budgets are required. In reality, we rarely know the optimal solution. In this case the HS need only out-perform heuristics to provide a useful solution. We advocate judging the quality of the HS solution by comparing it against expert heuristics. As a general rule the largest computational budget that is practical (in terms of both computing power and time elapsed) should be used.

Although the HS algorithm can solve very large state space problems rapidly, when the action space is also large (as in the malleefowl problem), the HS algorithm evaluates comparatively slowly. We were able to efficiently solve the ring-structured metapopulation with 26 actions, but solving the malleefowl problem with 82 actions was slow. To our knowledge there are few algorithms that can efficiently approximate optimal policies for problems with very large action spaces, although some very recent work shows that such algorithms are being developed (Forsell *et al.* 2009; Todorov 2009).

The HS algorithm provides one method to circumvent the curse of dimensionality that limits the use of SDP in conservation decision-making problems. We showed how the HS algorithm can be applied to two metapopulation management problems and show that the HS algorithm can obtain near-optimal policies for a relatively modest computational budget. We demonstrated that the HS algorithm can obtain a policy with optimal cost in a shorter time than SDP on networks as small as 10 patches, and that it is capable of solving problems that have state spaces that are orders of magnitude larger than what is achievable with SDP. Using HS methods will advance managers’ ability to solve complex conservation decisions in systems with large state spaces.

### Acknowledgements

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

The authors wish to thank Hugh Possingham, Tara Martin and the three anonymous reviewers for their useful comments on early versions of the manuscript.

### References

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

- 2004) Baiting operations:
*Western Shield*review – February 2003. Conservation Science Western Australia, 5, 31–50. ( - 1957) Dynamic Programming. Princeton University Press, Princeton, NJ. (
- 2000) National Recovery Plan for Malleefowl. Technical report, Environment Australia, Canberra, ACT, Australia. (
- 2008) Cost effective global conservation spending is robust to taxonomic group. Proceedings of the Natural Academy of Sciences of the United States of America, 105, 6498–6501. , , , , , , & (
- 2008) A state-dependent model for the optimal management of an invasive metapopulation. Ecological Applications, 18, 748–761. & (
- 2004) Dynamic reserve site selection. Resource and Energy Economics, 26, 157–174. & (
- 1995) A stochastic metapopulation model with variability in patch size and position. Theoretical Population Biology, 48, 333–360. & (
- 2009) Planning under risk and uncertainty: optimizing spatial forest management strategies. PhD thesis, Swedish University of Agricultural Sciences. (
- Management of the risk of wind damage in forestry: a graph-based Markov decision process approach. Annals of Operations Research, doi: 10.1007/s10479-009-0522-7. , , , , & (in press)
- 1999) Use of reinforcement learning and simulation to optimize wheat crop technical management. Proceedings of the International Conference on Modelling and Simulation (MODSIM) (eds L. Oxley & F. Scrimgeour), pp. 801–806. The Modelling and Simulation Society of Australia and New Zealand Inc., Hamilton, New Zealand. (
- 2001) Solving large weakly coupled Markov decision processes: application to forest management. Proceedings of the International Conference on Modelling and Simulation (MODSIM) (eds F. Ghassemi, D. Post, M. Sivapalan & R. Vertessy), pp. 1707–1712. The Modelling and Simulation Society of Australia and New Zealand Inc., Canberra. & (
- 2004) Optimizing reserve expansion for disjunct populations of San Joaquin kit fox. Biological Conservation, 117, 61–72. , , , , & (
- 1976) Optimal exploitation of multiple stocks by a common fishery: a new methodology. Journal of the Fisheries Research Board of Canada, 33, 1–5. (
- 2004) The importance of environmental vari ability and management control error to optimal harvest policies. The Journal of Wildlife Management, 68, 585–594. & (
- 1997) Uncertainty and the management of mallard harvests. The Journal of Wildlife Management, 61, 202–216. , , , , , & (
- 1996) Optimal translocation strategies for enhancing stochastic metapopulation viability. Ecological Applications, 6, 1268–1280. (
- 2000) Dynamic State Variable Models in Ecology: Methods and Applications. Oxford Series in Ecology and Evolution. Oxford University Press, New York, NY. & (
- 2004) Captive breeding programs and their contribution to
*Western Shield*:*Western Shield*review – February 2003. Conservation Science Western Australia, 5, 122–130. ( - 2001) Using stochastic dynamic programming to determine optimal fire management for Banksia ornata. Journal of Applied Ecology, 38, 585–592. , & (
- 2008) The need for speed: informed land acquisitions for conservation in a dynamic property market. Ecology Letters, 11, 1169–1177. , , , & (
- 2004) Does conservation planning matter in a dynamic and uncertain world? Ecology Letters, 7, 615–622. , & (
- 2010) Should metapopulation restoration strategies increase patch area of number of patches? Ecological Applications, 20, 566–581. & (
- 2004) On-line search for solving Markov decision processes via heuristic sampling. 16th European Conference on Artificial Intelligence (eds R. de Mántaras & L. Saitta), pp. 530–535. IOS press, Valencia, Spain. & (
- 2007) A graph-based Markov decision process framework for optimising integrated management of diseases in agriculture. Proceedings of the International Conference on Modelling and Simulation (MODSIM) (eds L. Oxley & D. Kulasiri), pp. 2175–2181. The Modelling and Simulation Society of Australia and New Zealand Inc., Christchurch, New Zealand. , , & (
- 1996) Decision theory and biodiversity management: how to manage a metapopulation. Frontiers of Population Ecology (eds R.B. Floyd, A.W. Sheppard & P.J.D. Barro), pp. 391–398. CSIRO, Melbourne. (
- 1993) The mathematics of designing a network of protected areas for conservation. 12th Australian Operations Research Conference (eds D. Sutton, F. Cousins & C. Pierce), pp. 536–545. The Australian Society for Operations Research Inc., Adelaide. , , & (
- 2001) Making smart conservation decisions. Conservation Biology: Research Priorities for the Next Decade (eds M.E. Soulé & G.H. Orians), pp. 225–244. Island Press, Washington. , , , & (
- 1996) Effect of age at release on the susceptibility of captive-reared malleefowl
*Leipoa ocellata*to predation by the introduced fox*Vulpes vulpes*. EMU, 96, 32–41. & ( - 1997) Efficacy of fox control in reducing the mortality of released captive-reared malleefowl,
*Leipoa ocellata*. CSIRO Wildlife Research, 24, 469–482. & ( - Puterman, M. ed. (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, New York, NY.
- 2006) Optimal eradication: when to stop looking for an invasive plant. Ecology Letters, 9, 759–766. , , , & (
- 1999) Optimal fire management for maintaining community diversity. Ecological Applications, 9, 880–892. , & (
- 2007) Minimise long-term loss or maximise short-term gain? Optimal translocation strategies for threatened species. Ecological Modelling, 201, 67–74. , & (
- 2000) Optimal release strategies for biological control agents: an application of stochastic dynamic programming to population management. Journal of Applied Ecology, 37, 77–86. & (
- 2005) Existence value and optimal timber-wildlife management in a ammable multistand forest. Ecological Economics, 55, 365–379. & (
- 2006) Optimal reserve selection in a dynamic world. Biological Conservation, 131, 33–41. , & (
- 1998) Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts. & (
- 2000) Stochastic variation in food availability inuences weight and age at maturity. Journal of Theoretical Biology, 202, 257–272. , & (
- 2000) The effectiveness of a large-scale baiting campaign and an evaluation of a buffer zone strategy for fox control. CSIRO Wildlife Research, 27, 465–472. , , & (
- 2009) Efficient computation of optimal actions. Proceedings of the Natural Academy of Sciences of the United States of America, 106, 11478–11483. (
- 2000) Marine protected areas for spatially structured exploited stocks. Marine Ecology Progress Series, 192, 89–101. & (
- 1975) Optimal harvest strategies for salmon in relation to environmental variability and uncertain production parameters. Journal of Fisheries Research Board of Canada, 32, 1777–1784. & (
- 2003) The use of stochastic dynamic programming in optimal landscape reconstruction for metapopulations. Ecological Applications, 13, 543–555. , , & (
- 2006) Prioritizing global conservation efforts. Nature, 440, 337–340. , , & (

### Supporting Information

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion and concluding remarks
- Acknowledgements
- References
- Supporting Information

**Appendix S1.** Obtaining viable management options and deriving the transition matrix for the conservation of the Bakara malleefowl.

**Appendix S2.** A population viability analysis for the effects of predation on the malleefowl.

**Appendix S3.** Supplementary information on Péret and Garcia’s heuristic sampling algorithm.

As a service to our authors and readers, this journal provides support ing information supplied by the authors. Such materials may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Filename | Format | Size | Description |
---|---|---|---|

MEE3_69_sm_suppmat.zip | 117K | Supporting info item |

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.