The 2023 International Planning Competition

In this article, we present an overview of the 2023 International Planning Competition. It featured five distinct tracks designed to assess cutting-edge methods and explore the frontiers of planning within these settings: the classical (deterministic) track, the numeric track, the Hierarchical Task Networks (HTN) track, the learning track, and the probabilistic and reinforcement learning track. Each of these tracks evaluated planning methodologies through one or more sub-tracks, with the goal of pushing the boundaries of current planner performance. To achieve this objective, the competition introduced a combination of well-established challenges and entirely novel ones. Within this article, each track offers an exploration of its historical context, justifies its relevance within the planning landscape, discusses emerging domains and trends, elucidates the evaluation methodology, and ultimately presents the results.


INTRODUCTION
Automated planning (Ghallab, Nau, and Traverso 2004) is dedicated to the development of systems that can reason, strategize, and make decisions to accomplish specific algorithms.At its core, automated planning involves the formulation of actions, their prerequisites, and effects on the world, enabling machines to effectively address real-world challenges.Since 1998, the International Planning Competition (IPC) (Coles et al. 2012;Long and Fox 2003;McDermott 2000;Vallati et al. 2015) has been organized on a biennial or triennial basis with the objectives of promoting the advancement and evaluation of planning methodologies and coordinating the creation of new, challenging benchmarks.
This article provides an overview of IPC-2023, which featured a record-breaking number of five tracks, illustrating the rich diversity within the planning research community.This installment of the IPC is a reflection of the ongoing evolution within the field of planning, encompassing long-established tracks like classical planning, while also introducing new ones such as a numeric, and a probabilistic and reinforcement learning track.These latter instances underscore the significance of well-established tracks coexisting with the emerging trends of collaboration with the machine learning and reinforcement learning communities alongside the necessity of having efficient numeric reasoning in planners.
Portfolio submissions have been a hotly debated topic in the IPC.Given that many planning techniques work well in some domains but encounter difficulties with others, distributing the available time between different components or selecting components based on instance features typically achieves higher performance than running only the individual components.There are two main criticisms of portfolios: one concerns the attribution of success (running a portfolio of components A and B can surpass component A, even if A does most of the heavy lifting in the portfolio); the other pertains to scientific insight (analyzing the success of a portfolio is more challenging than that of its components).Some planners can clearly be seen as portfolios, and some can be clearly seen as nonportfolios.However, there is a large gray zone in between with techniques that use multiple components in a more integrated way.For example, LAMA (Richter, Westphal, and Helmert 2011) is typically not regarded as a portfolio but runs multiple different search algorithms and heuristics.Any specific definition of portfolios would provide an incentive to develop techniques that circumvent this definition, whether they are useful or not.For instance, if loose coupling of planning techniques were to define a portfolio, search methods that interleave multiple techniques in a sophisticated way would not be classified as portfolios but would still benefit from running all of their components.We decided not to draw this line and allowed portfolio submissions.
The five IPC-2023 tracks were as follows: The Classical Track is the oldest track and centers on fully observ-able environments where actions are atomic and have deterministic effects.In 1998, the Planning Domain Definition Language (PDDL) (McDermott 2000) was established as the standard for this and other tracks, and it continues to be developed and used to this day.The Numeric Track concentrates on deterministic planning problems with numeric state variables, emphasizing quantitative aspects of planning.Numeric planning problems can be expressed in the 2.1 version of PDDL (Fox and Long 2002).The HTN Track is focused on planning tasks that involve hierarchical structures.This is the second iteration of this track (Behnke, Höller, and Bercher 2021).As in the first iteration, this year's competition also uses the Hierarchical Domain Definition Language HDDL (Höller et al. 2020).The Learning Track evaluates systems that learn domain-specific knowledge and use it to assist a general planner in solving unseen tasks from the same planning domain.This track also uses PDDL as planner input language (Fern, Khardon, and Tadepalli 2011).The Probabilistic and RL Track is an advancement from the earlier probabilistic tracks (Younes et al. 2005), encompassing probabilistic and reinforcement learning elements.In the 2011 competition, the Relational Dynamic Influence Diagram Language (RDDL) (Sanner 2010b) has been introduced as the standard language for this track.RDDL was developed within the context of the competition, and an extended version is used in this year's competition (Taitler et al. 2022).
In the following sections, we survey each of these tracks in turn.A table summarizing the metadata of the competition across the tracks is given in Table 1.

History and motivation
The classical track has been part of the IPC since the inaugural competition in 1998, and this is the tenth IPC featuring it.The classical track focuses on the core of the planning problem without any of the extensions covered by the other tracks.
In the first three IPCs, all planners competed on the same playing field, but since 2004, there are subtracks for optimal planners (that guarantee plans with minimal costs) and satisficing planners (that have no such guarantee).With optimal planners, the goal is to find optimal plans for as many tasks as possible within some given resource limitations, while a good satisficing planner should discover short plans as quickly as possible.Historically, the satisficing track focused mostly on plan quality and ignored solution time.In 2014, an agile track was added that focused on solution time and ignored plan TA B L E 1 Metadata of the competition, including the tracks how many editions each track previously held, when was the last time the track took place, how many entries were submitted to each track in the 2023 event, the input language, and a link to each of the tracks webpage for additional information.

Track
Ed. quality.In 2023, we again offered an optimal, a satisficing, and an agile subtrack.

Last
Compared to previous iterations of the competition, we emphasized PDDL features that make modeling a planning task easier.That is, we tried to promote the idea that writing a PDDL specification of a domain should be made as easy as possible, and planning systems should be able to perform normalization and preprocessing of the input planning task automatically.Our domains make full use of conditional effects, negative preconditions, quantifiers, disjunctions, imply conditions, and negative goal conditions.While conditional effects, negative preconditions, and to some degree quantifiers occur in previous IPC domains, the use of disjunctions and negative goal conditions is new and some planners did not support it yet.Our hope is that by introducing these features, planner authors will be pushed to add support for them.As an intermediate solution, we offered a tool to compile these features away.We also ran all planners on both the original domain, and one with those features compiled away, counting the better result in each domain.

New domains introduced
The classical track used seven new domains.In their selection, we focused on getting practically interesting domains without having particular planning techniques in mind, that is, the encoding or structure of the tasks is not intentionally optimized such that a particular technique works well.
Three domains, Folding, Ricochet Robots, and Slitherlink, were based on the domains previously used in Answer Set Programming competitions (e.g., Gebser, Maratea, andRicca 2016, 2020).The reformulations of these domains used here contained zero-cost actions, disjunctions over static predicates, and negative goal conditions.The variant of the Rubik's Cube domain (modeling a well-known puzzle) used this year was formulated with a large number of conditional effects that are impossible to compile away without a significant blow-up of the planning tasks.Another domain, Recharging Robots, models a cooperative task of multiple robots that need to exchange a battery charge with each other in order to achieve the common goal.This domain was formulated with disjunctive preconditions, universal quantifiers, imply conditions, and conditional effects, all of which are possible to compile away without a significant blow-up of the final representation.
The Labyrinth domain was based on a board game where the agent has to travel through a maze that changes between the agent's movements.This domain contains auxiliary zero-cost actions, and it can be challenging for the grounding process as the maze can be arbitrarily reconfigured and therefore the agent can eventually move between almost all pairs of locations in one step.Finally, the winner of the Outstanding Domain Submission Award, Quantum Circuit Layout Synthesis (Shaik and van de Pol 2023), models a real-world problem of mapping a logical quantum circuit expressing a quantum algorithm to a real-world quantum circuit hardware platform.The tasks from this domain seem to be relatively easy to solve suboptimally, but getting optimal (or near optimal) solutions remains challenging.
All domains, except one, come with a task generator providing an opportunity to generate a different set of tasks possibly with a different scaling than the one used in the IPC.The only exception is the Quantum Circuit Layout Synthesis domain, which comes with PDDL formulations of the problems from the standard benchmark set used in the area of layout synthesis for quantum computing (Tan and Cong 2020).Moreover, each domain (again, except Quantum Circuit Layout Synthesis) is released with either a solver, or its generator provides at least a suboptimal solution alongside the output task.

Evaluation methodology
We used a submission framework very similar to the previous classical tracks at IPC 2018.Participants registered by email and we created repositories for them.They then pushed planner sources together with recipe scripts for the container solution Apptainer (Kurtzer, Sochat, and Bauer 2017).This way of handling submissions was tested in 2018, and has proven useful for several reasons: first, using repositories simplifies access management and code updates for possible bug fixes.Second, using a container solution simplified the build process on our side, in particular, dealing with a diverse set of library dependencies of the different planners.It also unifies the build setup and makes reusing the planners in a reproducible environment much easier.
Compared to 2018, we changed our bug-fixing policy: where in 2018, much effort was spent in identifying and catching bugs in the planners, we tried to outsource this effort to the planner authors who are more familiar with the behavior of their code.To that end, we had several rounds of test runs planners were evaluated on a subset of the competition instances.The results of these test runs and our parsed results where then published to the planner authors.They were then responsible for checking whether their planner performed as expected and our scripts interpreted the results correctly.If this was not the case, the authors could send pull requests with bug fixes both to their planners and to our scripts.We reviewed these pull request to ensure they did not tune the behavior of the planner to the known instances.This was an experimental step towards a potential full automation of the IPC similar to the Grid-Based Path Planning Competition 1 .
After compiling the planner images, each image was run on all instances of a track for 30 min (optimal and satisficing tracks) or 5 min (agile track) with 8 GiB of available memory.In the optimal track, the score of an instance was 1 if it was solved optimally, and 0 otherwise.In the satisficing track, the score of a solved task is the ratio  * ∕ where  is the cost of the cheapest discovered plan and  * is the cost of a reference plan.The score on an unsolved task is 0. It is important that the reference plan is independent of the participants (Seipp 2019).We found optimal reference plans for most instances and plans that are at least as good as any plan found by a participant for most of the remaining cases.In very few cases, we had to fall back to using the best plan discovered by a participant as the reference solution.In the agile track, the score of a task is 1 if the task was solved within 1 s and 0 if the task was not solved within the resource limits.If the task was solved in  seconds (1 ≤  ≤ 300) then its score is 1 − log() ∕log(300).This scoring accounts for the exponential growth in task difficulty while ignoring small differences in the subsecond range.In all tracks, each task, thus, gets a score between 0 and 1.These scores are summed up to achieve the final score of a planner.All domains have the same number of tasks, so normalizing with the domain size is not necessary.
After the competition, all domains, tasks, reference plans, and bounds were made available both through our homepage as well as the general planning task collection http://planning.domains.All planners are available as repositories on GitHub including their Apptainer recipies.They also can be installed through planutils (Muise et al. 2022).Planner abstracts for all planners are also available through the competition homepage.They include post-competition analyses where the authors analyze their performance retroactively.

Competition results and discussion
A total of 65 entries made by 23 teams participated across all classical tracks.Out of the 65 entries 25 self-identify as portfolios but as discussed earlier, the definition is not clear cut and other entries that use fall-back components for unsupported features could also classify as portfolios.We accepted multiple submissions of a planner if the authors suspected a strong difference in performance between two variants of their planner.Additionally, most planners participated in multiple subtracks.Thus, these 65 entries are based on 24 code bases.As the line between a single planner being configured differently for two subtracks and two different planners that share the same code base is blurry, we asked planner authors to submit separate planner abstracts if they consider two of their entries to be different planners.According to this metric, 34 distinct planners participated in the classical tracks.The teams consisted of 47 distinct authors from 19 affiliations where many authors contributed to more than one entry.We were happy to see that roughly half of the authors participated in an IPC for the first time.
As in the previous iterations, the classical planners submitted to the IPC 2023 utilized a wide range of techniques stretching to almost all corners of the classical planning area: From explicit state-space search to symbolic search to decoupled search, from partial order reduction to symmetry breaking to dominance pruning, from abstraction heuristics to delete-relaxation heuristics to heuristics based on linear programming.Competitors submitted lifted planners, ground planners, as well as portfolios of various techniques, equipped with several different grounders.
Overall, we think that IPCs have become a showcase of the wealth of classical planning techniques.This also means the final results should not be read and interpreted in a way where the winners represent the current state of the art and other entries do not (anymore).All techniques implemented in the submitted planners have their strengths and weaknesses and should be considered based on the structure of the task at hand.Moreover, the set of competition domains selected this year (or any other year) is by no means a representative of all possible tasks, and so the overall scores of the competing planners are necessarily biased.As discussed before, we tried to steer this bias in the direction of practically interesting domains rather than problems that are easily solved by a particular technique.Nevertheless, the IPC should encourage competitors to invest in their implementations, it should nurture healthy level of competitiveness as well as collaboration between researchers in the community, and above all it should be fun.
In the optimal track (see Table 2), 11 out of 22 competitors have a poor (or completely lack) support for conditional effects, which incurred a serious penalty in the overall scores because one of the domains (Rubik's cube) can not be solved without it.The structural variety among the selected domains seemed to be able to successfully probe various strengths and weaknesses of different planning techniques: Even planners in the second half of the leaderboard were able to achieve the highest (or close to the highest) score in some of the individual domains.The performance in domains Folding, Labyrinth, and Ricochet Robots seems to be what differentiate the top-ranking planners from the rest.The winner of the optimal track, Ragnarok (Drexler et al. 2023), is a portfolio planner of an explicit state-space search, decoupled search, symbolic search, and a lifted planner with various heuristics.It solved the highest number of tasks in four out of seven domains and only one below the highest score in the rest.
In the satisficing and agile track (see Tables 3 and 4), the planners performing best in individual domains were spread more evenly across the leaderboard than in the optimal track.Also, the support for conditional effects was more common than in the optimal track.The winners of the satisficing track, Scorpion Maidu (Corrêa et al. 2023b) and Levitron (Corrêa et al. 2023a), scored highest in only two domains.Both are portfolio planners that differ in that Levitron incorporates a lifted planner whereas Scorpion Maidu does not.The DecStar (Gnad, Álvaro Torralba, and Shleyfman 2023) planner was the winner of the agile track which, surprisingly, did not perform best in any indi-vidual domain but did well across all domains, especially in Rubik's Cube.
Our baseline planner, LAMA (winner of IPC 2011) (Richter, Westphal, and Helmert 2011), surprisingly achieved very high scores in the agile and satisficing track.No planner scored higher in the agile track and only three competing planners scored higher in the satisficing track.However, as it was considered a baseline, not a participant, it was not considered when determining the tracks' winners.
Overall, the biggest challenge across the board seemed to be the support for PDDL features such as quantifiers, disjunctions, imply conditions, and negative goal conditions.Since for every domain, we provided an alternative (automatically generated) formulation without these features, planners were not penalized in their final scores for not supporting them.However, it was often the case that planners performed better on the alternative formulation than on the original one.The most striking case was the Slitherlink domain, which uses just negative preconditions over static predicates (which is usually not problematic for current planners) and negative goal conditions.Yet, very few planners were able to solve any task in the original formulation of this domain, but they did much better in the alternative formulation where the negative goal conditions were compiled away with our tool.
Numeric planning, in a formal sense, does not represent a substantial departure from classical planning.Both types of planning still rely on a deterministic transition system to model the world.However, there is a key distinction: while classical planning problems typically induce finite transition systems, numeric planning problems can result in infinite state spaces.This characteristic makes numeric planning problems undecidable in general, as discussed by Helmert (2002); decidability can be recovered only if variables are properly bounded (Gigante and Scala 2023).Numeric planning allows the definition of numeric conditions (e.g., ( +  ≥ 10)) and effects (e.g., ( ∶= 10 + )).Albeit numeric planning can in its general form support nonlinear expressions both in the conditions and in the effects, to lower the participation barrier, we only consider two simpler fragments: simple numeric planning (SNP) (Scala et al. 2020) and linear numeric planning (LNP).SNP is a simpler subset of numeric planning, where numeric variables can only be increased or decreased by a constant and numeric conditions are limited to linear expressions.This subset is expressive enough to allow the formalization of interesting problems.LNP extends SNP by not only allowing numeric variables to be assigned directly, but also allowing increases, decreases, and assignments to linear combinations of variables.
Preconditions and goals in the evaluated problems were arbitrary propositional formulas.That is, they included quantifiers, negative preconditions, disjunctive preconditions, and any number of numeric predicates.Conditional effects were not considered.Terms were either literals as in classical planning, or numeric terms of the form () {≥ , >, =} 0 where () is a linear expression.
Similarly to the classical track, the numeric track also featured three different subtracks: optimal, where plans with minimal costs are guaranteed, agile, which only focused on solution time and ignored plan quality, and satisficing, which aimed to find the best possible plans in the given time limit.
In the optimal track, the optimization functions were always a minimization.Problems were either a minimization of action costs, or a metric function that was a weighted sum with positive coefficients where each involved variable could only be increased by the actions.

New domains introduced
Given that resources and numeric state variables are a natural extension from classical planning problems, unless the domains were chosen carefully, there was the risk of evaluating numeric planners on solving mainly Boolean structures.Therefore, there was an emphasis on selecting domains that exhibited a notable numeric structure.
The origins of the domains used for the competition were a mix of domains used in previous IPCs (such as Zenotravel or Settlers from the IPC-5), interesting domains introduced in various publications over the years (such as FO-Sailing or Counters) and some new domains (such as Drone or Ext-Plant-Watering).A total of 20 domains were selected to be used in the competition, of which seven were LNP domains and 13 SNP domains.In turn, each domain had 20 instances selected with various degrees of difficulty.

Evaluation methodology
We collected a grand total of six planners from two different teams.The first team submitted the NLM-Plan system, which exposed four different heuristic search planners.These four planners are described in detail in an extended abstract 2 .The second team submitted OMTPlan, which exposed two different SMT-based planners.Those are described in detail in a recent publication (Leofante 2023).All participants have been given a GitHub account that they used to upload the source code of the planners.We then compiled all the source code and run all experiments in the Cirrus HPC Service at EPCC.Each run used a Intel Xeon E5-2695 (Broadwell) with a limit of 8-GiB memory limit per process.The competition used exactly the same scoring formulas as the classical planning track.All planners and benchmarks are available at https://github.com/ipc2023-numeric.

Competition results and discussion
As not all submitted planners supported the LNP fragment, we had different rankings for SNP, LNP, and for both together.The NLM-based planners won all subtracks.More precisely: NLM-CutPlan won the Optimal LNP, NLM-CutPlan SAT2 won the Satisficing SNP and the Agile SNP subtracks, NLM-CutPlan Orbit won the two other optimal tracks, that is, Optimal SNP and Optimal SNP + LNP, and NLM-CutPlan SAT won all the remaining tracks, as summarized by Table 5.
As baselines, we used a well-known greedy best first search algorithm with state of the art heuristics from the literature (Scala et al. 2020) and  * with the numeric extension of the ℎ  heuristic (Scala et al. 2017) for the satisficing and the optimal track, respectively.These are implemented in the ENHSP planning system 3 .The competing planners were not able to improve the baseline used in the satisficing track, showing the strength of the current heuristic there.On the other hand, NLM-Plan notably improved the capacity of numeric planners on proving optimality.Finally, considering the competitors, no planner had total dominance: OMTPlan was the best planner in 17% of the instances.When considering the number of SNP + LNP instances solved per domain, OMTPlan tied with NLM-Plan in five domains and won in two domains.
We envision a number of challenges for the community.Where, in fact, there has been a substantial advancement in the optimal setting, for the satisficing setting the baseline proved to be even more competitive than the participanting planners.By a closer look at the raw data, we can, however, observe that this gap is only pronounced in SNP problems.For linear planning problems, and in particular over the linear version of the Sailing domain, the baseline solved zero instances, while NLM-CutPlan Sat solved up to the 14th instance.Therefore, it is possible that much more can be done for problems requiring more advanced forms of reasoning, for instance, those involving nonlinear dependencies.Another aspect worth investigation is portfolio solutions.Indeed, as observed before, no planner had total dominance.Properly characterizing numeric planning problems would allow the creation of portfolios, opening the possibility of combining the best approaches into one single planner; this approach has already proven to be useful into classical planning, and is likely that the same applies to the numeric context, too.

History and motivation
Hierarchical Task Network (HTN) planning allows automated planning formalisms to be extended with complex hierarchical expert knowledge, imposing additional structural constraints for valid plans (Bercher, Alford, and Höller 2019).A planning task is represented not only by a set of actions and an initial state but also a set of tasks to be achieved.Primitive tasks can be achieved by executing them as actions, whereas compound tasks are achieved by recursively replacing them with subtasks, following certain conditional decomposition methods.In Total Order (TO) HTN planning, all sets of subtasks as well as the set of initial tasks are totally ordered, that is, they are sequences.This restriction is popular and relevant since it renders the formalism decidable, as opposed to semi-decidable Partial Order (PO) HTN planning where arbitrary ordering constraints can be imposed between tasks (Erol, Hendler, and Nau 1994).The high expressive power of HTN planning is exploited by many applications, for example, cooperative robotics (Bevacqua et al. 2015), AI in video games (Vellido, Fdez-Olivares, and Pérez 2020), or assistance for complex handicraft tasks (Behnke et al. 2019).
IPC-2023 featured the second iteration of an HTN planning track, following IPC-2020, which focused entirely on HTN planning.As in 2020, all benchmark problems were provided in the HDDL format (Höller et al. 2020), in its adapted version for the IPC 2020. 4 We selected a total of 22 TO planning domains and 10 PO domains for evaluating planners.While the majority of planning domains were already a part of IPC-2020 (namely 20 TO domains and eight PO domains), four new domains were introduced.
#SAT (a.k.a.SharpSAT, TO domain) models a simple algorithm (Birnbaum and Lozinskii 1999) for exact model counting, that is, counting the number of satisfying assignments for a given propositional formula.This task is #ℙ-complete and thus a challenging task even for very small propositional formulas.Lamps (TO) models a variant of the game "Lights Out," which is about an  ×  field with lamps that can either be on or off.Switching a lamp forces all horizontally and vertically connected lamps of the same status (on or off) to also toggle.This reachability-based procedure can be easily modeled with an HTN, but is hard to express using classical planning.Ultralight-Cockpit (PO) models emergency procedures in ultralight aircraft.Given the current situation of the aircraft and the nature of the emergency, the plan represents instructions to the pilot on how to handle this emergency and how to recover the aircraft's situation safely.Coloring (PO) encodes a version of the tiling problem (van Emde Boas 1997), which is frequently used for complexity reductions.Given a set of available tiles, each having a color at one of its edges, the task is to fill an  ×  square with these tiles, s.t., touching edges have the same color.The outer edge has no required color.This problem is ℕℙ-complete for unary encoded .The encoding uses the idea of proof encoding double-exponentially time-bounded Turing Machine (Alford, Bercher, and Aha 2015).

Evaluation methodology
We offered two times three subtracks: TO and PO HTN planning on the one hand, and agile, satisficing and optimal scoring as in the classical and numeric tracks on the other hand.We measure plan length in terms of the number of actions, that is, primitive tasks, that are executed in the hierarchical solution.
Each submission was allowed up to three configurations per track.Planning approaches submitted by the same group were considered different submissions if the approaches' inner workings are sufficiently different.Planners were allowed to use up to 8 GiB of RAM and 30 min of running time per instance.
As in the classical track, participants had to notify the organizers via email of their intent and were then provided with a git repository for their planner.The HTN track also used Apptainer to automate compiling and running the planners.After the feature-stop, we performed test runs of the planners on the IPC domains.The organizers then searched for anomalies in the runs and notified the participants, who were given the opportunity to propose a pull-request to fix these issues, which were checked and accepted by the organizers.

Competition results and discussion
We received a total of eleven distinct submissions.Only a single submission, SIADEX (Fernandez-Olivares, Vellido, and Castillo 2021), participated in IPC-2020 as well.IPC-2020 TO winner HyperTensioN (Magnaguagno, Meneguzzi, and de Silva 2021) and runner-up Lilotane (Schreiber 2021) did not participate in 2023 (although two 2023 submissions, Lifted-Linear and LTP, build upon Lilotane).Some HTN planning approaches did not compete because they are (co-)authored by some of the organizers (Alford et al. 2016 The submissions are roughly balanced in terms of ground versus lifted planning: Linear-Simple, Linear-Complex, OptiPlan, PandaDealer, PANDApro, and PANDA  use a grounding procedure to simplify and prune the problem before planning.Aries, SIADEX, LTP, and Lifted-Linear do not perform grounding but rather operate directly on the parametrized description. Table 6 shows the top performing systems for each subtrack.Across all TO tracks, PandaDealer was the most successful planner.It was followed rather closely by PANDA  (agile, satisficing) and PANDApro (optimal).PandaDealer builds upon PANDApro and features a look-ahead technique, which evaluates and prunes branches early by checking state-related conditions that are propagated from preconditions and effects of primitive tasks (Olz et al. 2021).In the PO tracks, Linear-Simple and Linear-Complex dominated the satisficing and agile subtracks.Their winning configurations use ground progression search via PANDA  or PandaDealer after linearizing the problem into a TO problem-falling back to regular POHTN planning Since planners behave differently across domains, it is important to also consider domain-dependent results.For example, in the TO satisficing track, the best configuration of TOAD was able to score the best in seven domains-two more than the winner (a PandaDealer configuration).It, however, performed poorly in several domains (including three without any solved instance)-TOAD had better peak performance, but was less reliable than PandaDealer.Let us also consider a virtual planner  which, for each TO domain, selects the planner which solved most instances.Overall  solves 590 instances-45 more than any single planner (PandaDealer)-and, at least, features PandaDealer, SIADEX, LTP, and TOAD.
The IPC 2020 was dominated by lifted planners-both winners and the runner-up of the TO track were lifted (HyperTensioN, SIADEX, and Lilotane).In both tracks, there was a significant performance gap between the lifted and grounded planners.Out of this year's winners and runner-ups, only Aries (runner-up PO optimal) is lifted.Lifted planning did prove to remain worthwhile especially in the TO agile track: Four domains were solved best by SIADEX alone and three domains were solved best by LTP alone.All in all, the results, however, emphasize the progress brought by the latest progression search approaches with efficient and effective grounding (Behnke et al. 2020), new pruning strategies (Höller and Behnke 2021), and informed search decisions (Höller et al. 2019).These approaches outperform lifted and translation-based approaches on the majority of domains but do not fully dominate them.In terms of POHTN planning, this year's results suggest that many of the considered PO problems can be solved effectively by TOHTN planners through careful transformation.It remains to be seen whether such compilations continue to gain traction in future research to accelerate POHTN planning.

History and motivation
The learning track was introduced 15 years ago and 2023 saw its fourth appearance at the IPC.The track consists of two separate phases: one for learning some type of domain knowledge given a set of training instances from a planning domain; and the evaluation phase, in which a solver uses the learned domain knowledge to solve unseen instances from the same domain.
In the previous editions (2008, 2011, and 2014), the domains, example instances and instance generators for the training phase were made available to the participants, so they could learn the domain knowledge on their machines.Afterwards, the participants submitted the learned domain knowledge and their solvers to the organizers, who ran on the evaluation tasks.Although the aim was to learn domain knowledge that allows to scale up to larger and more complex instances, usually the instances used in the first and second phase were of similar difficulty.
In this year's edition, we made three fundamental changes to the learning track: first, we fully automated the learning phase and let participants submit a learner that computes domain knowledge for a set of approximately 100 easy instances per domain without user intervention.Second, we evaluated the second part of their submission, a planner that uses the domain knowledge, on 30 easy, 30 medium, and 30 hard instances from the same domain.Third, we extended the traditional STRIPS fragment of the Planning Domain Definition Language (PDDL) with typed objects and negative preconditions.
Our main motivation for these changes is to (1) increase reproducibility of research on the topic of learning for planning, where currently it is very hard to set up fair comparisons to baseline learning approaches, (2) encourage research on learning scalable domain knowledge instead of simply learning how to tweak the configuration of a planning system for a given domain, and (3) to motivate the community to support a richer task input language.
The IPC 2014 edition of the Learning Track featured a "Best Learner Award" given to the system that maximizes the difference in plan quality between system versions running without and with the learned domain knowledge.In contrast, we only consider the latter, because for many systems, it is not clear what would constitute a nonlearning version and how to ensure it has reasonable performance.

Domains
The competition used ten classical planning domains from earlier IPCs: Blocksworld, Childsnack, Ferry, Floortile, Miconic, Rovers, Satellite, Sokoban, Spanner, and Transport.Their hardness ranges from being optimally solvable in polynomial time (e.g., Spanner), to PSPACE-complete domains like Sokoban (Culberson 1997).To obtain reference plans for the evaluation tasks, we developed domaindependent solvers for all ten domains and validated all reference plans using the Unified Planning framework. 5 All benchmarks, domain-dependent solvers, and reference plans are available via the competition website, as well as all code submitted by the participants.

Evaluation methodology
We originally envisioned both a single-core CPU-only environment and a multi-core environment with GPU access.Since only one planner, Muninn, opted for the latter, we canceled the GPU variant.Thankfully, the Muninn authors agreed to compete in the CPU environment even though this heavily disfavors their approach.We limited the time and memory for computing domain knowledge to 24 h and 32 GiB, respectively.For the evaluation phase, we allowed at most 30 min and 8 GiB of memory per task.To evaluate the submissions, we used the same quality and agile scores as in the classical tracks.Since there were 10 domains with 90 evaluation tasks each, the maximum score was 900 points.Both the learners and the solvers were allowed to generate improved domain knowledge and plans as they progressed.Since the two metrics yielded the same ranking, we only report the quality score in the following.

Competition results and discussion
A total of six teams, each with one submission, participated in the learning track: ASNets 2023 (Hao et al. 2023) ported Action Schema Networks (Toyer et al. 2020)  of weights for all tasks in a domain.GOFAI (Torralba and Gnad 2023) learns which action schema instantiations are likely part of a plan.Then it uses this information to partially ground the given task (Gnad et al. 2019) and solve it with a Fast Downward configuration optimized by SMAC (Hutter, Hoos, and Leyton-Brown 2011).HUZAR (Gzubicki, Lachowicz, and Torralba 2023) (Drexler, Seipp, and Geffner 2023).
Table 7 shows the quality scores of the submitted systems.We omit NPGP from the table because it failed to learn domain knowledge (DK) in the evaluation domains due to the inherent limitations of the PGP learner it builds upon, that is, it cannot handle domains with hierarchical typing nor action schemas using constants. 6ASNets 2023 also experienced difficulties, failing to learn DK in three domains and producing weak DK in seven others, receiving a quality score of just 29.1 points.Even though Muninn is optimized for GPUs, it was able to learn DK in all domains and even surpassed all other competitors in the Spanner domain.However, in total, it obtained only a score of 226.3 points.
Vanir, which specifically targets polynomial domains, only produced domain knowledge files for five domains, but the quality was high, achieving the highest score among competitors in three domains (Ferry, Rovers, and Satellite).Overall, Vanir achieved an overall score of 342.6 points. 7HUZAR is the runner-up winner of the competition with an overall quality score of 467.0 points, even though it stops after finding the first plan.
The competition was won by GOFAI (508.5 points) with its anytime approach to iteratively obtaining cheaper plans.GOFAI managed to achieve scores as high or higher than the Fast Downward SMAC baseline that participated in the IPC 2014 Learning Track in eight out of 10 domains, indicating significant progress since the last iteration of the competition 9 years ago.
Despite the strides, the competition highlighted key areas for improvement, especially since GOFAI scored lower than the nonlearning baselines LAMA 2023 (Richter and Westphal 2010) and FDSS 2023(Büchner et al. 2023) in five and eight domains, respectively.Future challenges are to develop robust learning systems that can handle more PDDL features, new domains and task distributions, to create learning algorithms that scale to harder domains, and planning systems using DK that outperform domainindependent classical planners on very large instances.

History and motivation
The probabilistic and RL track of IPC-2023 represents a significant departure from the probabilistic tracks held under the IPC umbrella in previous years.Since 2011, the probabilistic track presented problems described using the Relational Dynamic Influence Diagram Language (RDDL) Sanner (2010b).This year, the track continued this tradition but introduced a single track with problems described in an extended version of RDDL.While previous iterations of the track focused on discrete MDPs and PODMPs, this year, the emphasis shifted entirely to continuous and mixed discrete-continuous MDPs, including a single industry-contributed discrete problem among the eight problems featured in the track.
There were two main goals in this year's track: (Goal 1) This goal aimed to redirect the community's focus toward realistic problems that showcase state exogenous noise, the effects of stochastic actions, concurrency, and critically mixed discrete-continuous transition dynamics.(Goal 2) Motivated by the observation that both the reinforcement learning and planning communities are interested in the types of domains outlined in the first goal, the second goal aimed to unite the probabilistic planning and reinforcement learning communities under a common competition and software framework.This aim also resulted in the name change of the track.
To enable the seamless integration of both reinforcement learning (RL) and planning methods, the original Java-based RDDLSim (Sanner 2010a) simulation framework has been replaced with a Python-based framework called pyRDDLGym (Taitler et al. 2022) that includes a fast vectorized simulator.This framework serves as an auto-generation toolkit for OpenAI Gym environments (Brockman et al. 2016), directly generated from raw RDDL files.As a result, pyRDDLGym is fully compatible with the standard interaction model and framework of the RL community.Additionally, it accommodates model-based planning methods by providing the RDDL domain and instance files that describe the model along with compilations both to Jax computation graphs (Bradbury et al. 2018) as well as an extended Algebraic Decision Diagram (XADD) (Sanner, Delgado, and de Barros 2011) format to represent the dynamic Bayesian network and influence graphs for the factored MDP underlying a ground RDDL instance.
JaxPlan, a model-based back-propagation planner, Taitler et al. ( 2022), is included out-of-the-box with pyRD-DLGym.It is provided to assist competitors in getting started, and used as an evaluation baseline, as explained in the Evaluation Methodology section.
Participants in the current year's probabilistic and RL track engaged in competitions spanning eight distinct domains.The track achieved an all-time high in registration numbers, with 29 competitors showing interest and signing up during the initial phase.In the second stage, which took place a month and a half prior to the competition, all participants were tasked with submitting a working planner.This stage served as a dry-run to assess the infrastructure and procedures, resulting in only four competitors successfully advancing to the competition.

New domains introduced
As part of the comprehensive overhaul of this year's track, all eight domains offered were entirely new.The problems exhibited a diverse range of properties; some were goal-oriented, while others involved steady-state control.One notable challenge this year was an oversubscribed multi-agent version of the Mars-Rover problem, inspired by (Yliniemi, Agogino, and Tumer 2014).An interesting addition was the RecSim domain-a Recommender Systems domain contributed by Google (Mladenov et al. 2020).Although it was the only discrete domain in the competition, it featured a vast number of enumerable state and action spaces, reaching up to 40,000 actions per time step in the largest instance of the competition.
Furthermore, a UAV problem with multiple drones was introduced based on a simplified version of the dynamics described in Hull (2007).The UAV problem incorporated both controllable and uncontrollable dynamic-model UAVs, with uncontrollable UAVs not contributing to the reward.This aspect called for model reasoning to determine which parts of the space were worth ignoring.
The domains were also categorized by problem type, ranging from classical control problems like Mountain-Car and RaceCar to operations research (OR) problems such as Reservoir control and Power Generation, and even extending to e-commerce.

Evaluation methodology
The online phase of this track took place in early June 2023, spanning a week.At the beginning of the week, three instances of each domain, gradually increasing in difficulty, were released to competitors.These instances were generated using prereleased instance generators, and their generation parameters were also provided.Competitors had 1 week to fine-tune their methods, and at the week's end, they submitted their planners.These submitted planners were evaluated on the three instances that had already been released, as well as on two additional instances that had not been seen before.The difficulty level of these new instances was designed to be more challenging than the easiest instance but less difficult than the hardest.The objective was to maximize the reward within a finite time horizon.
The scores for each instance were precisely normalized within the [0,1] range.Any episode's reward that fell below the maximum reward achievable through a random policy or a no-operation policy (default unperturbed behavior) was set to zero.Conversely, the method with the highest accumulated reward, including the organizers' baseline planner, set the upper boundary at 1, and all other scores were normalized accordingly.JaxPlan served as a baseline and was executed in two modes, functioning as two separate baselines: straight-line planning (SLP) mode and deep reactive policy (DRP) mode (Bueno et al. 2019).The average reward for each instance was computed as the mean of 50 independent trials.Each trial had a fourminute allocation, with an additional 60 min provided before the trials for automatic hyper-parameter tuning.
The overall winner of the competition was the method that achieved the highest accumulated score across all instances, totaling five instances over eight domains for a total of 40 instances.Consequently, the score ranged from 0 to 40.

Competition results and discussion
Four teams successfully qualified for the online stage of the competition by meeting all the requirements, which included submitting an abstract and passing a dry-run evaluation.These approaches were categorized into one planning approach and three model-free RL approaches.One of the RL teams and the planning team represented academia, while the other two RL teams were from industry.
Only the planning approach managed to complete a successful final submission that could be evaluated.This winning team of DiSProD (P.Chatterjee, A. Chapagain, R. Khardon) built upon their previous DiSProD work (Chatterjee et al. 2023), a gradient-based search approach capable of propagating distributions between time steps.To accommodate the time constraints of the competition (4 min per trial), a more time-efficient variant was employed.This year's competition provided a fascinating perspective on the field of probabilistic planning and learning, particularly in terms of the problem types offered.The initial registration numbers set an all-time record, and the contributions of domains and code from the community suggest a strong interest within both the research and industry communities in the types of challenges presented in this year's competition.However, the significant drop in the number of competitors at each stage of the competition, and the fact that only one team successfully crossed the finish line, with that team opting for a planning approach, adds complexity to the overall picture.
The most apparent challenges include the necessity for methods capable of solving both continuous and discrete problems and the requirement to tackle tasks encompassing control, operations research (OR), navigation, and more, without domain-specific adjustments.We would like to emphasize two notable challenges: Instance size: One of the major factors affecting the difficulty level of the instances is the size of the state and TA B L E 8 Summary of results for the Probabilistic Track for the two baselines JaxPlan-SLP and JaxPlan-DRP and the winning planner DiSProD.Each entry counts the number of tasks won per domain.Exploiting structure: This year's track placed considerable emphasis on problems with inherent sparse transition structure that can be leveraged for efficiency.For example, consider an HVAC problem with one heater and two rooms.In the first instance, the rooms can be isolated, while in the next, they can be adjacent and allow for heat transfer.Model-free RL methods have no knowledge of this underlying transition structure and are unable to exploit any knowledge of the independence that occurs in the first instance.In contrast, model-based planning methodologies are able to exploit this independence, for example, DiSProD's gradient-based optimization method would exploit the fact that one room's temperature does not depend on the other room since such a dependence would not arise in the calculated derivatives (Table 8).

SLP
In summary, this year's track drew participants from academia and industry, showcasing interest in complex challenges.The competition highlighted the need for versatile methods to handle diverse tasks, ultimately favoring a differentiation-based planning approach.Integrating model-based planning and model-free RL approaches remains a challenge in this evolving field.Equally importantly, we were not aware that any methods attempted to leverage generalized planning and RL approaches (Sanner and Boutilier 2009;Sharma et al. 2023) that could exploit domain structure to generalize learning over instances, which we see as a key to improving the efficiency of RL methods while providing broad generalization abilities to new instances with little (or no) need for perinstance adaptation.

CONCLUDING REMARKS
As in previous iterations, the International Planning Competition has been instrumental in advancing the field of automated planning, bringing together researchers from various domains to push the boundaries of planning methodologies.IPC-2023 was notable for its diverse set of tracks, reflecting the diversity within the planning research community.
The classical track continued to evolve, emphasizing features to facilitate modeling while maintaining optimal, satisficing, and agile subtracks.The numeric track explored the capabilities of state-of-the-art planners in dealing with numeric reasoning; an effective numeric reasoning is a crucial building block to handle planning models that are closer to reality.The HTN track highlighted the significance of efficient grounding and domain-specific approaches in handling complex hierarchical tasks.The learning track addressed the fully automatic acquisition and application of domain-specific knowledge, showcasing advancements while revealing areas for further development.Lastly, the probabilistic and reinforcement learning track introduced new complexities and challenges to the field, paving the way for hybrid methods, combining tools from the planning and learning communities.
IPC-2023 demonstrates the advances within the planning community, embracing new trends, and coexisting with traditional methodologies.With recent advances in AI blurring the historical barriers between AI fields, future planning competitions might consider opportunities for the cross-pollination and integration of planning in novel application-focused tracks to further showcase the capabilities of planning methodologies in the presently burgeoning AI ecosystem.
spaces.While differentiable planning approaches such as DiSProD scale relatively well versus instance size (and are mostly limited by memory constraints), RL methods are significantly impacted by instance size since increases in the number of state and action variables of the underlying MDP typically lead to challenges with effective exploration and an explosion of the sample complexity (and therefore learning time) required to obtain a reasonably performing policy.
Coverage for the optimal classical track.Best results per domain highlighted in gray.
TA B L E 2 Scores for the satisficing classical track.Best results per domain highlighted in gray.Hapori Greedy scored 0 due to a bug and is not shown.
TA B L E 3 Scores for the agile classical track.Best results per domain highlighted in gray.Hapori Greedy scored 0 due to a bug and is not shown.
Winners of the Numeric Planning Track by subtrack.
TA B L E 5 Winners and runner-ups of the HTN track.In case of multiple top rankings of a single system, only the 1-2 best configurations are shown.
TA B L E 6Aries in cases where the linearization proves to be unsolvable.PANDApro and PANDA  were the best approaches, which performed direct POHTN planning.
to Tensorflow 2. This neural network architecture can encode a generalized reactive policy by learning a common set 23719621, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/aaai.12169 by Karlsruher Institut F., Wiley Online Library on [09/05/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License TA B L E 7 Plan quality scores for the Learning Track.For the missing entries, no domain knowledge was learned.