Optimized massively parallel solving of N‐Queens on GPGPUs

Continuous evolution and improvement of GPGPUs has significantly broadened areas of application. The massively parallel platform they offer, paired with the high efficiency of performing certain operations, opens many questions on the development of suitable techniques and algorithms. In this work, we present a novel algorithm and create a massively parallel, GPGPU‐based solver for enumerating solutions of the N‐Queens problem. We discuss two implementations of our algorithm for GPGPUs and provide insights on the optimizations we applied. We also evaluate the performance of our approach and compare our work to existing literature, showing a clear reduction in computational time.


BACKGROUND
We first review the N-Queens problem in more depth, before discussing modern GPGPU architectures and features upon which we build.

2.1
The N-Queens problem The N-Queens problem asks how many non-attacking configurations exist when placing N queens on an N × N chessboard.A non-attacking configuration is one in which no queen can attack any other queen on the chessboard.Two queens can attack one another if they are both occupying the same row, column or diagonal.The problem owes its roots to Max Bezzel who in 1848 asked how many possible placements of eight queens on a conventional (8 × 8) chessboard exist. 2 Figure 1 illustrates an example of a non-attacking configuration, which is one of the 92 non-attacking configurations (solutions) for N = 8.This problem was later generalised 2 to the N-Queens problem as known today.
A distinction should be made on the alternative formulation that is sometimes used for the N-Queens problem.Some literature 3,4 quotes a variant, which is that of discovering a single non-attacking configuration of N queens for an N × N board.For this work, we consider the original variant of the generalized problem, namely that of enumerating all solutions for a given value of N (and in principle even outputting each solution).
The N-Queens problem has long served as a challenge for mathematicians, programmers and machine learning models alike.Discovering solutions for smaller values of N, is relatively 'cheap' computationally with modern hardware, even using naïve solving approaches, due to the relatively small search space.For larger values of N however, the number of possible solutions to be enumerated is vast and requires a combination of 'smart' algorithms and their efficient implementations.Real-world applications for the N-Queens problem are documented in literature, 5 such as very large-scale integration (VLSI) testing and deadlock prevention.Importantly, algorithms for constraint satisfaction problems such as N-Queens, and their implementations, can be applied to problems of a similar nature.
As of yet, the number of non-attacking configurations is known for all N ∈ [1, 27], with the latest addition being that of N = 27 . 67][8][9] This problem lends itself to parallelization, due to the triviality of deciding if a configuration is non-attacking or not, paired with the vast number of candidate solutions to be checked.

Computing on GPGPUs
NVIDIA's compute unified device architecture (CUDA) 10 brings support for general-purpose computation on supported NVIDIA Graphics Processing Units (GPUs) through a programming interface, drivers, and various tools.GPGPUs expose their superior mathematical capabilities and massively parallel environment for use in 'generic' tasks*.
F I G U R E 1 A non-attacking configuration of 8 queens on an 8 × 8 board shown with the per-queen attack indicators.
responsible for sharing data and coordinating computation on the device(s).The host system launches kernels of work on the devices associated with it, which are executed by a number of threads.The threads in each kernel launch are logically partitioned into blocks, each of which can be mono-, bi-or tri-dimensional.In turn, blocks are logically grouped into a grid which can also be either mono-, bi-or tri-dimensional.
The dimensionality of the blocks and grid is important for tasks exhibiting spatial locality, however, for other tasks this geometry is of little significance.
In hardware, GPUs are made up of multiple Streaming Multiprocessors (SMs), each of which is allocated a number of blocks which reside † and execute on it.The number of blocks allocated to an SM depends on factors such as memory requirements and configuration, block size, and other hardware-specifics.
In the SM, a resident block is further partitioned into batches of (currently) 32 threads, called warps.All threads in a warp execute in lockstep and should ideally not diverge in their execution.Thread divergence should be avoided as it typically reduces the number of threads executing in parallel in the warp, increasing the overall time required for the computation to be performed.Section 2.2.1 discusses recent changes in hardware which enable greater flexibility with respect to diverging threads in warps.
Threads on their own are 'weak' to compute as a unit, but power is leveraged from the large number of concurrent threads at any given moment in a massively parallel environment such as this.Another special consideration of GPGPUs is the handling of memory.There are several types of memory with different access costs, scope, and sizes.Global memory is the largest memory type on the GPU in terms of capacity, which is visible to all threads across all blocks.This type of memory is the most costly to access (in terms of clock cycles) even when coalescing requirements are met, but it is typically the largest memory type available.It is a means of communicating data between host and device, as it is memory both sides can manipulate.Shared memory is an on-chip memory region available per SM which has block-scope and is relatively limited in size, albeit being significantly faster than global memory when its access requirements are fulfilled.Additionally, a number of 32-bit registers are available to each thread with thread-scope.Using warp-level primitives, it is possible for threads within the same warp to efficiently perform collective operations involving communication by sharing register contents.
Shared memory is divided across a number of banks.Whilst requests by threads to different banks are serviced simultaneously, access to the same bank by threads from different warps may result in a bank conflict.When a bank conflict occurs, the requests are serialized, reducing the overall throughput of shared memory.However, read access by multiple threads within the same 32-bit word causes a single read operation which is subsequently broadcast to all threads involved.Bank conflicts should be avoided whenever possible as they degrade performance significantly.
Typically, communication between host and device happens through memory transfers over a shared bus.On multi-GPU systems the host can individually transfer data to each device as necessary, however, it is possible for data to be transferred directly between devices.The latter may benefit from superior transfer speeds if the devices are linked via a GPU-to-GPU bus such as NVLink. 11mmonly, GPUs are either used to assist in the computational effort of a solver running on the host (hybrid solver 12 ), or act as solvers themselves coordinated by the host (complete solver 13 ).In a hybrid solver, the cost of memory exchanges between host and device, along with initial kernel costs, should be weighed against the speedup the GPUs offer to the overall computation.In the case of complete solvers such costs are usually irrelevant considering the overall solving effort, but a bigger challenge that arises is that of mapping conventional algorithms to an implementation suitable and optimized for the GPU environment.Consequently, new techniques and adaptations to the algorithm(s) are likely necessary to achieve a good mapping.
In terms of the programming model, a number of programming languages are supported by the CUDA toolkit.Our work is primarily using C, with a number of lower-level optimizations detailed in Section 5.5.The translation from the high-level programming language to GPU machine instructions is a multi-step process.Initially, high-level C code is compiled into the PTX (Parallel Thread Execution) instruction set which is an assembly-like language, abstracting away hardware details and using 'register variables' in place of registers.PTX instructions can then be compiled into SASS ‡ assembly through a process which maps the PTX code to device-specific SASS code and performs operations such as register allocation.The details on how this conversion is performed, along with the SASS Instruction Set Architecture (ISA) are mostly undocumented and the produced instructions are intended not to be modified.It is therefore very difficult to gauge post-compilation specifics, such as register pressure from the PTX level.

Comparison of GPGPU architectures
GPGPU technology is being continually improved in order to increase computational performance and to introduce expanded hardware support for a wider range of operations.
NVIDIA names different GPU generations after famous scientists such as Pascal, Turing, Ampere, and so forth.Devices of each generation are classified by their Compute Capability (CC) version number, which identifies the particular features that the device supports.For a generation, there may be several such versions implemented by different hardware.
This evolution of hardware results in a high degree of volatility which often comes at a cost to optimization approaches.For instance, the Volta architecture supports, for each SM, up to 2048 resident threads with up to 32 registers per thread and 96 KB of shared memory. 14Its successor, Turing, halves the maximum number of resident threads, keeping the same register file size which leaves 64 registers per thread, and also reduces the size of available shared memory to 64 KB per SM. 15 This changes once again by the succeeding architecture, Ampere, where a total of 164 KB of shared memory is available to a maximum of 2048 threads per SM, each of which can access 32 registers. 16In short, not forward compatible for optimization.This volatility often binds optimizations to particular generations relying on the characteristics of the architecture, which may render them ineffective or even a hindrance on others. 17gister use per thread is dependent on a multitude of factors, one of which is the subset of instructions involved in the computations and the combinations thereof.The evolution of hardware sometimes introduces support for specialized operations via a single instruction which would otherwise be constructed using multiple instructions with intermediate results.For instance, the Volta architecture introduces support for 32-bit mask creation in the form of the bmsk instruction in the PTX ISA.Whilst the specifics of hardware performance and the translation of such instructions to SASS is not publicized, it is safe to assume their purpose is to optimize specific operations for the benefit of the overall computation-an assumption supported by literature. 18,19rther to the above, NVIDIA introduced 'Independent Thread Scheduling' since the Volta architecture which, to some degree, mitigates the effects of warp divergence albeit at the cost of a register.Prior to Volta, threads in a warp shared a single program counter, but since Volta, each thread maintains an independent program counter and its own stack space.More specifically, in this architecture, schedule optimizers are introduced, which group threads of a warp which diverged into sub-units of threads which are in sync with each other, to be run in parallel.Warp divergence should still be avoided in current architectures as it may still have a profound effect on performance.
Of course, different generations of GPUs are better suited for different tasks, however, software optimizations aiming to get closer to optimal performance have to be done with a high degree of device-specificity.As Feinbube et al. highlight 17 compilers and programming languages alone are not sufficient to achieve optimal performance.They observe some of their optimizations being a detriment to performance on some architectures and beneficial on others.We show that this continues to be the case with current architectures and is a view we echo through our work.

Scalability on distributed GPGPU systems
The model of computation described in Section 2.2, is host-centric as it works under the assumption that a host computer shares a bus with the devices (GPUs) attached to it, and is coordinating the computation in one or more devices.Devices attached to the host machine may use the shared bus to communicate if a direct link between them is absent, at the cost of lower data transfer rates.
A host system is limited in the number of devices it can house (vertical scaling), therefore increasing computational power beyond this point is achieved by increasing the number of such host systems (nodes) participating in the computation (horizontal scaling).Communication becomes an even greater concern for horizontal scaling, as communication between devices on these nodes is subject to even more restrictions.Commonly, the Message Passing Interface (MPI) 20 is utilized in such cases, for which CUDA-aware implementations exist, but the mediums involved may introduce additional overheads and slow down the communication process, which is the reason we chose not to use it in our implementation.
Work partitioning across devices is typically coordinated manually and influences the design of device-side code to mitigate any performance penalty.However, loosely coupled workloads with no cross-device communication requirements do not have to account for such communication penalties and form an ideal scenario.A problem may be partitioned into several sub-problems 21 which can be tackled individually and independently of one another to form such a workload, which is the approach we opted for as detailed in Section 4.1.

APPROACHES TO SOLVING N-QUEENS
We review related work in the field focusing in particular on parallel approaches for solving the N-Queens problem, and present the DoubleSweep algorithm from which the backbone of our solver is derived.

Related work on parallel N-Queens solving
The N-Queens problem has been approached from a plethora of angles throughout its existence.When the aim is to identify a single non-attacking configuration for a given value of N, several techniques have been explored which do so in a 'fast' manner using search-based algorithms 4,[22][23][24] or by applying a pattern directly. 3timating the number of non-attacking configurations (solutions) for any N was recently proven possible. 25However, the effort of identifying the exact number of non-attacking configurations remains significant and requires brute-force-like algorithms with search-space limiting heuristics.
Often such algorithms rely on backtracking and are based around what is known as Somers' Algorithm, which we will discuss at length in Section 3.2.
The theoretical fundamentals of divide-and-conquer approaches for the problem have also been explored deeply 26 that further evidences that this style of algorithm is suitable for the N-Queens problem.Such algorithms treat the board as a 'ladder' upon which a search is performed, recursively attempting to place a queen in a valid position on each rung, and backtracking when dead-ends are reached.
With all such approaches, the search space needed for enumeration is super-exponential as claimed in Reference 25, namely the limit is the number of solutions.Note that this is the minimum complexity and heuristics are needed to eliminate fruitless paths.The effectiveness of such heuristics can be observed and contributes to the reduction of the search space, placing higher values of N within the realm of possibility.The performance of implementations for such algorithms and respective heuristics is of paramount importance, which puts focus on the parallelization and optimization of such implementations.
Highly parallel approaches have been used in distributed environments to identify the number of solutions for instances of the N-Queens problem such as N = 24, which was first calculated by Kise et al. 7 using a 34-node cluster of CPUs, or the later work of Caromel et al. 8 who solved the N = 25 instance using a grid of 260 machines.Both approaches serve as good examples of highly parallel approaches and underline the difficulty of the task at hand.
8][29][30] These attempts are the product of heavy optimization of implementations and the adaptation of algorithms such as the aforementioned Somers' algorithm, to account for the specialities of the GPU environment, and achieve good performance.The need for such bespoke optimizations arises from the 'irregularity' of the computation at hand relative to the expectation of a certain computation structure of GPUs. 27other challenge faced in the use of GPUs with this type of problem is the rapid evolution of hardware which often creates a mismatch between optimization techniques and evolving architectures.Likewise, tooling provided for this hardware may not have matured enough to make optimal use of the hardware. 17Architecture-specific optimizations beyond what compilers offer are therefore quite common and are also a prominent component in our work, discussed in Section 5.5.
Of course, GPU hardware is not the only hardware relevant to the N-Queens problem.Previous work has successfully utilized Field Programmable Gate Array (FPGA) hardware.Most recently, the work of Preußer et al. 6 used a naïve search algorithm with a search space limiting heuristic to discover, over the course of a year, the latest solution to the N-Queens problem, namely N = 27.This result remains unverified to date but is nevertheless a significant achievement 9 that follows the earlier achievements of the authors in calculating N = 26, once again using FPGA hardware.These two latest results highlight the shift in focus away from conventional algorithms and the 'traditional' computational model into parallel algorithms and less conventional highly parallel computational hardware.General purpose GPUs provide several advantages over FPGAs for such computations, such as their relative ease of programming which renders them applicable to a multitude of tasks, without special re-programming requirements, as well as their wide-spread availability as consumer hardware and conventional cloud computing platforms owed in part to reasonable prices driven by mass production.

Applications of GPGPUs to similar problems
N-Queens is often seen as a 'benchmark' problem, acting as a proxy for developing techniques, algorithms, and optimizations that apply to other problems similar in nature.The problem of Boolean satisfiability 31 (SAT) is a notable example which has received the attention of the GPGPU community.Successful attempts have been made in harnessing the capabilities of GPUs to accelerate the solving of SAT instances 12 as well as developing GPGPU-based SAT solvers. 13,32cursive algorithms often do not map directly to the massively parallel environment of GPUs meaning that significant adjustments have to be made to any implementation of such algorithms.Meyer et al. 32 present an implementation of a recursive divide-and-conquer algorithm for solving 3-SAT instances by decomposing the implementation of the algorithm to a pipeline of six kernels, each with a single function.This stepped design breaks the recursive algorithm into major components, each of which is performed by multiple threads at once with little chance for divergence between them, and with global synchronization enforced by the host.

The DoubleSweep algorithm
Perhaps a more common approach to enumerating solutions for the N-Queens problem, is using a backtracking search over all possible placements of queens on the board.Such algorithms typically begin by placing a queen in the first row of an N-Queens board, and recursively exploring subsequent rows, making a non-attacking queen placement on each, until either the board is completed, or a row is found where no such placement can be made.
In the latter case, the search backtracks to a row where the queen can be moved to a different position, and resumes from that point.Notably, Jeff Somers 33 provided an efficient iterative implementation for such an algorithm, representing the state of the board in part using three 32-bit words, tracking the columns blocked by queens, viewing them only as rooks, and a further two to track blocked diagonals and anti-diagonals in the current state of the board.
The DoubleSweep algorithm combines basic word-level parallelism with basic ideas of look-ahead techniques 34,35 from the domain of Boolean satisfiability (SAT) solving.A key difference is that DoubleSweep propagates placements through the whole board in such a manner as "unit-clause propagation" excludes unsatisfiable branches as part of SAT-solvers.This process identifies rows with only one open cell left, whereby a queen placement must be made, and repeats following every successful placement until a fixed-point, or a row or column with no possible placements is identified.Another key difference here is that DoubleSweep begins placements in the central row of the board rather than the first (top-most) row.
This branching heuristic helps make the propagation step more efficient, as central placements are more influential to the remaining rows.
DoubleSweep uses N words to represent the full board with current propagations on top of the three words used by Somers.In addition, the (anti-)diagonal-words used are 64-bit wide so that via a "sliding window" one can slide the bishop-moves over the whole board (back and forth) via the (word-level) shift-operations as explained in Section 5.

DOUBLESWEEP-LIGHT-A GPU-CENTRIC APPROACH
The DoubleSweep algorithm presented in Section 3.2 contains a number of features which are powerful, but incur significant costs when implemented for GPUs.Perhaps most significant is the branching nature of the algorithm which introduces further data-dependent branches in an board, but instead propagates the rows directly following the row in which the placement was made only once, which reduces the overall degree of divergence between threads in a warp.More specifically, given a partial configuration of queens on a chessboard (referred to as a 'state') and a backtracking limit (i.e., the index of a row beyond which backtracking is not permitted), the function advance_state presented in Listing 1 is applied.This function identifies a column (Lines 4-9) suitable for a queen placement in the current row (i.e., the row following that of the last placement), making the placement and advancing the state (Lines 10-14) if such column is found, or backtracking (Line 16) and re-trying otherwise.In essence, backtracking is performed when no queen can be placed in the current row due to conflicts with previously placed queens.The reasons for limiting backtracking are detailed in Section 4.2.
It is worth noting that the form a state (s) takes in Listing 1 and subsequent listings is that of a structure, the contents of which include the current row (current_row) and a list of structures (row_at) each containing per-row information.The per-row structure holds information such as the index of the queen placed on that row (current_queen_index) and the index of this row in the state (row_index).The process described in this section is presented visually in Figure 2. First, the initial state (1) is the partly complete non-attacking configuration upon which DoubleSweep-Light is applied.The red region signifies rows (0 to 3 from the top, counting from 0) which should not be modified (i.e., backtracking will be limited to not modifying of any of these rows).Initially, advance_state is applied which results in the placement of a queen in row 4 in one of the two possible positions seen in (1) to yield (2). From here, derive_queens is applied starting from the following row (row 5).This row is 'unit' (i.e., there is only one position available), hence we place the queen there yielding (3).The placement of this queen results in the following row (row 6) becoming unit which results in another placement as seen in ( 4).Once again, the placement of this queen makes the following row (row 7) unit, and this cascade effect continues as seen in ( 5) and ( 6) resulting in a complete non-attacking configuration through derivations.
If we found we could not place a queen, then we would backtrack undoing the unit derivations and explore the only other possible position of the queen in row 4.

Parallelizing DoubleSweep-Light
A crucial component in the design of this algorithm is its dependency-free parallelizability, whereby parallel search paths have no reason to converge or share information between them.
As DoubleSweep-Light is a backtracking-based algorithm, its exploration of the search space can be visualized as a tree of candidate configurations, as shown in Figure 3. Parallelizing this algorithm across disjoint searches can be achieved by splitting the search tree into a forest of disjoint sub-trees.The sub-trees produced can then be operated upon by a number of parallel DoubleSweep-Light searches, without risk of overlap or dependencies between the searches.More specifically, a certain depth (level) of the search tree may be chosen, the sub-trees of which can be used as starting points for non-converging parallel searches.
It must be noted that the first level of the tree in Figure 3 has been reduced to just four states, as the removed states are symmetries of the remaining states on the vertical axis.This part of the algorithm is further discussed in Section 4.2.
As discussed in Section 4, DoubleSweep-Light attempts to complete the placement of queens on a partly complete non-attacking configuration.Therefore, to parallelize the search, it is enough to construct a pool of incomplete non-attacking configurations which act as starting points for parallel search paths.Section 4.2 presents the specifics of this process.
Besides the diversification of the search via unique search paths, DoubleSweep-Light is designed to require no awareness of parallel searches or previously explored paths.This is particularly important to maintain the loosely coupled nature of the solver and not limit its horizontal scalability.
F I G U R E 3 Visualization of search tree for DoubleSweep-Light.

Initial state pool generation
To generate a pool of initial states (i.e., a set of partial non-attacking configurations), a DoubleSweep-Light search is performed up to a certain depth.More specifically, a range of rows is chosen on an N-Queens board, which are to be populated with queens.The maximum number of possible (partial) states can easily be calculated for a given cut-off depth, however such naïvely generated states often contain a large number of 'invalid' search starting points such as ones which cannot be advanced further, therefore the number of valid, advanceable states generated is often far smaller.
The state generation process produces partial states, which can be advanced at least once.These states have a certain number of 'locked' rows, meaning that when DoubleSweep-Light is performed on these states, these rows must not be altered.The advancement algorithm presented in Listing 1 takes into consideration the index of the last locked row, on or prior to which backtracking must not occur.
In practical terms, to generate a pool of approximately s many states for a given value of N, a ladder-climbing approach is employed, as presented in Listing 3. Initially, a row r = ⌊log N (s)⌋ (Line 2) is chosen, under the assumption that all naïvely generated states obtained by populating the first r many rows are valid and advanceable.A pool of states S 1 is subsequently generated by applying a modified version of DoubleSweep-Light (seen as the function gen_states) which stops placing/deriving queens after a certain row (Lines 3-8).
If |S 1 | < s, r is incremented and state generation is repeated, until some generated pool S i satisfies |S i | > r, then a choice is made to either for some constant factor f ≥ 1, or to discard S i and keep S i−1 .Here, s is treated as a soft limit, and the constant factor f serves as means of determining the hard upper limit.The flexibility in the upper limit is employed as the number of states generated by locking on subsequent rows may vary wildly, however discarding a pool of states for slightly exceeding the desired number of states in the pool s is undesirable.⌉ cells of the first row are considered for queen placements.This operation reduces the search space by approximately half, as any states generated by queen placements on the remaining cells of the first row would be symmetries of those, on the vertical axis.For the generated states, the number of solutions found under each would have to be doubled, unless N is an odd integer, in which case the solutions are doubled for all states except those with a queen in the

If at any point
In terms of implementation, the state generation process described above can be implemented on the device-side, however, we saw little benefit in doing so considering it is a relatively quick process, which has stringent memory requirements that the GPU environment is not suited for.Instead, we implemented this process on the host-side.Our implementation on the host-side is parallelised across ⌉ many threads, in the same fashion explained in Section 4.1.More specifically, each thread t i begins by placing a queen on the i th column of the first row in its (initially empty) state, which remains untouched throughout the state generation process.
The pool of initial states generated needs to be sufficiently large to supply every participating solver.Dependent upon the number of nodes, devices, the maximum number of concurrent threads, and the over-subscription factor per device (discussed further in Section 5), a very large pool could be produced having high memory requirements that exceed the available memory of the system.State generation is implemented with the option of using secondary storage as memory.
After generation, the pool of states is shuffled randomly using the Fisher-Yates shuffle algorithm. 36This step is important to counter the effects of irregular work distribution caused by some states having more possible candidate solutions than others.By shuffling the pool randomly concentrations of 'heavier' states are broken, meaning partitions given to devices are likely to contain a more uniformly balanced workload.Further discussion on this point can be found in Section 6.

IMPLEMENTATION ON GPUS
For our implementation of DoubleSweep-Light, we used CUDA-C as the high-level language with targeted optimizations through inline assembly code where appropriate.The details of the optimizations we applied are presented in Section 5.5.The lightweight design of DoubleSweep-Light paired with its algorithmic efficiency allows it to be implemented in a constrained environment, such as a GPU thread.Each GPU thread acts as an isolated DoubleSweep-Light solver, which operates on a different starting point to the rest.
The search begins with a pool of states being generated (as described in Section 4.2).A subset of the states in the pool is then transferred to the GPU, and the search kernel is launched with at least as many threads as states in the pool.The cumulative number of threads in the grid is typically significantly higher than the number of cores in the GPU.This degree of over-subscription reflects on the number of blocks in each kernel and allows for finer balance of work across SMs.This workload does not benefit from spatial locality, therefore we have grouped threads in mono-dimensional blocks which in turn are part of a mono-dimensional grid.
We have implemented two DoubleSweep-Light kernels, one relying on shared memory and one exploiting the superior performance of registers, described in Sections 5.1 and 5.2 respectively.

Shared memory-based kernel
Shared memory offers significantly cheaper access costs than global memory as discussed in Section 2.2.The shared memory-based implementation of DoubleSweep-Light has each thread in each block reserve a portion of shared memory for its computation exclusively.In the beginning, the thread transfers its corresponding state to shared memory, which is represented as a struct comprised of the following information: 1. Per-row projections of conflicting diagonals caused by placed queens (two 64-bit bit vectors, see Section 5.3).
3. The index of the current row (a single 8-bit integer).
4. The indexes of placed queens on the current state (array of N 8-bit integers).
Having the indexes of the queens on the board renders the other pieces of data, such as the occupied columns bit vector, redundant, as they can be derived.However, this data plays a crucial role in the implementation explained below, trading memory for reduced repeated computation.
The per-row indexes are necessary in this instance to facilitate backtracking.During backtracking, placed queens need to be removed from the state, meaning the tracking of diagonals, antidiagonals and occupied columns needs to be updated, which can only be done knowing the position of the removed queen.Traditional recursive implementations would use the call stack for this purpose, but due to limitations in size and control of data we chose to manually track this data in an iterative implementation instead.
The size of this struct varies depending on the value of N (known at compile time), as a result of the array member.The remaining components are 8-byte aligned and laid out as shown in Figure 4 with the 8-byte boundaries highlighted.This structure minimizes padding, in an effort to reduce the overall size of state pools and allow a greater number of states to be stored in the device's memory as well as shared memory.We note that this layout results in a well-packed struct on tested compiler versions, however, padding and struct member layout is ultimately determined by the compiler.
During computation, each thread in a warp operates on the data in its shared memory, by performing the DoubleSweep-Light procedure as outlined in Section 4. Warp divergence cannot be eliminated completely in the implementation of DoubleSweep-Light as the number of iterations Layout of per-thread data in shared memory.
made by each thread is data-dependent.Warp synchronization barriers are interleaved between state advancement and propagation in an effort to re-converge divergent threads where appropriate.
Following every successful propagation and derivation, threads within a warp vote to determine if all threads involved have completed their task.
If at least one thread votes against stopping in the ballot, the warp continues, with completed threads being left inactive.Performance may degrade if all but a few threads in a warp remain active, however, generally, we found this to not be the case.Additionally, the choice to store queen indexes as 8-bit integers was driven primarily by the constrained size of shared memory.This paired with the data-dependent access patterns exhibited by the threads renders bank conflicts unavoidable.Due to the high compute load of the kernel, however, the overall impact of such conflicts does not seem to greatly impact the kernel.
Each thread uses an unsigned 64-bit integer to count the number of solutions it finds for its given state.After successfully advancing its state, each thread increments this counter by either 1 or 0 depending on if the state is a complete non-attacking configuration or not.This is 'cheap' to establish computationally purely through the occupied columns bit vector since having N set bits in this vector guarantees all N columns have been populated and the configuration is non-attacking.In practice, this is a simple comparison with a compile-time generated bit mask, which paves the way for further optimization discussed in Section 5.5.
After all threads in a warp vote to vacate, the individual results of each are accumulated in a common global memory location, through an atomic add operation.At the end of the computation, this location (known to the host) contains the number of solutions found for the set of states given to that device.It is the job of the host system(s) to collect results across devices and accumulate them.Formerly, we implemented result accumulation using a warp shuffling operation, however, such operations are currently not available for 64-bit types, and we deem them as offering insignificant performance gains w.r.t. the overall computation.
Reliance on shared memory does in some cases impact the number of threads per block.Our goal is to maximize the utilization of SMs and concurrent solver threads.We calculate the size of the block in a warp-centric manner, taking into account the size of each struct t in bytes, the maximum shared memory size s in bytes, the maximum number of warps in a full block w, and the maximum number of threads in a full block m by first calculating the number of threads per block The number of blocks in the grid is easy to calculate for a pool of p many states, as It must be noted that some architectures support multiple blocks residing in the same SM, provided sufficient resources are available for all of them to co-exist.It may be preferable depending on the architecture's capabilities for multiple smaller blocks to be launched versus maximally sized ones.Due to the high degree of architecture-specificity associated with this decision, we opted to maximize the block size as means of achieving good performance irrespective of the architecture specifics, a decision biased in part by the hardware available to us.Additionally, the transfer of data from global to shared memory forms an insignificant portion of the computation so global memory access coalescing is not considered in this instance.

Register exploitation for memory latency reduction
As discussed in Section 2.2, register space can be viewed as the fastest 'memory space' available.Whilst registers are not addressable memory, adjustments to the kernel described in Section 5.1 give the opportunity to the compiler to place this data in registers.We have applied these changes to form a register-based kernel implementing DoubleSweep-Light.
Members of the state struct of each thread are loaded directly into thread-local variables.Surrounding code has been adjusted to implement DoubleSweep-Light using these variables instead of addressable memory, without significant change to the code flow.The only component of the struct which remains in shared memory is the array member as that can't be housed in registers.
Shared memory requirements are significantly reduced.For a given value of N, each thread requires N bytes of shared memory.Given the maximum number of threads per block b, number of threads in a warp w, and shared memory size s we calculate the number of threads per block as The benefits of this change are two-fold; The reduced shared memory requirements allow for full blocks to be allocated per SM.Typically, a warp is comprised of 32 threads, and there can be up to 1024 threads per block.Even the earlier architectures featuring 48,000 bytes of shared memory per block per SM would end up having at least one full block per SM for all N ≤ 46.Additionally, register space is thread-local and significantly faster and with less stringent access requirements than shared memory.
It must be noted however that high-level code has no direct control over register allocation.Attempts to interfere with the compilation toolchain in register utilization are objectionable, as beyond violating programming standards, they often hinder compiler optimizations and overall result in performance loss.Likewise, exhaustion of register space has adverse effects on overall performance for reasons outlined in Section 2.2.1.
In our experiments, this kernel generally compiled without excess register usage and resulted in significant performance gains discussed in Section 6.We note however that registers are by no means a plentiful resource, rendering this kernel nonideal for some past and potentially future architectures.For instance, devices with compute capability 6.2 support 2048 threads resident on each SM at a time, sharing 32,000 registers between them.To achieve maximum thread residency, each thread must use 15 or fewer registers of which two are reserved for reasons presented in Section 2.2.1.During compilation, the compiler reports 25 registers being used by this kernel.Therefore, the decision on which kernel is better suited for the resources at hand should be made on a case-by-case basis.

Diagonal tracking
As described in Section 5.1, each thread uses two 64-bit bit vectors to track the projections of diagonals placed in its state.Although these 16 bytes may seem a hefty sacrifice to make considering space constraints, this is an integral component of the DoubleSweep-Light implementation that allows us to effortlessly determine which columns are non-conflicting with diagonals from placed queens, for a given row.
Initially, the state each thread is assigned contains the pair of bit vectors V d and V ad respectively, with bits set to match the queens currently tracked in the state.Subsequently, following every queen placement by the thread, this pair of values is updated to reflect the change.In essence, each bit in these bit vectors corresponds to a diagonal/anti-diagonal respectively, therefore for an N × N board, there may be up to 2 × N − 1 diagonals and as many anti-diagonals to track.For instance, Figure 5 depicts the mapping of diagonals to the diagonal tracking bit vector for a given 8 × 8 board, where 0 or 1 represent the absence or presence of a queen in the corresponding diagonal respectively.This process applies similarly to anti-diagonals.
During solving, and upon placement of a queen on a row r and column c a pair of bit masks are calculated, one for the diagonal and one for the anti-diagonal m ad = (1 ≪ c) ≪ (64 − N − r) which are then used to compute the updated value of respectively.When eventually this placement is undone (during backtracking), the bits set by m d and m ad are simply toggled off as V d = V d &m d and To determine which columns are non-conflicting for a given row r, we utilise the above pair of bit vectors along with the bit vector tracking the blocked columns B that each tread maintains.First, we extract the projections of diagonals for the current row p d = V d ≫ r as well as the anti-diagonals p ad = V ad ≫ (64 − N − r), and then derive a bit word of available columns for this row a = B|p d |p ad &X, where X is a bit mask with N set bits, computed at compile time.Following this, the positions of set bits in a correspond to the columns where a queen can be placed without conflicting with existing placements.
F I G U R E 5 Mapping of queens on diagonals to the diagonal tracking word.
small.This is ideal for a situation like the above, since the aforementioned computations, and especially the checking of available columns, are performed very frequently in our implementation.Newer device architectures (discussed further under Section 5.5) introduce hardware support for bit mask calculations.

Impact of datatype conversions
Currently, NVIDIA GPUs feature 32-bit wide registers.Unlike shared memory, using types smaller than 32 bits in high-level code yields no benefit or reduction in register space requirements.Throughout experiments, we observed that the use of such types in fact came at a cost to performance.
In many cases, their use forced the compiler to append type conversion instructions (cvt) in the resulting PTX code to enforce the properties of the respective type, which carried through to the SASS code.
A simple change of some types in the high-level code eliminated the need for such instructions and resulted in approximately 20% higher performance for this kernel overall.Whilst not strictly an optimization, in compute-bound kernels such as this one, performance gains can be made through the removal of unnecessary instructions.
The CUDA Programming Guide 10 provides details on the throughput of arithmetic instructions including type conversions.It is noteworthy that in the latest compute capability versions as of yet (8.0 ≤ cc ≤ 9.0), only 16 conversions from 32-bit types to smaller ones can be performed per clock cycle per SM, potentially acting as a bottleneck.

Reflections on architecture changes and optimization
Through the chain of evolution of NVIDIA architectures, several features have been introduced or removed, overall amounting to incremental performance gains and offering several advantages.Adversely however, such fluctuations in design and feature availability complicate the identification of areas where optimizations are applicable and hinder portability across architectures.
We have developed our solver using features of the CUDA library which are compatible with a large range of device architectures and opted to hand-tune the solver for specific architectures, namely Pascal, Turing, and Ampere.The process of identifying those optimizations is the product of reading the resulting PTX code to identify areas of improvement and performing isolated micro-benchmarks on alternative formulations of some operations.We note that our control from the higher-level CUDA-C code is limited to just introducing inline PTX assembly instructions, which may not reflect in the later translation to device-specific SASS instructions.To ensure optimizations had an effect, we analyzed the resulting PTX code as well as SASS instructions after compilation for the targeted architectures following the steps of Abdelkhalik et al. 18 Optimizations typically arise from identifying opportunities for shortcuts the compiler did not take.For instance, during the propagation sweep of DoubleSweep-Light, we go over unpopulated rows and compute the bit word a of available columns for that row as described in Section 5.3.We then apply the standard library function __popc on a, to compute the number of set bits.If only one bit is set, then the row is unit and a placement should be made in the only available column, which is the index of the set bit.To identify that index, the standard __ffs function can be used, which finds the index of the first (least significant) set bit (one-indexed).The PTX code produced by the compiler for the __ffs function is shown in Listing 4 where a is stored in the register variables %r1.Since there is no single instruction to find the first set variable in PTX, the process is performed by first reversing the bits of a from least to most significant (brev.b32),then determining the number of left-shifts i needed to bring the most significant set bit to the most significant bit position of the type (bfind.shiftamt.u32),and finally add 1 to i.
In this instance, we know however that there is exactly one set bit, therefore the reversal of the bits of a can be omitted, along with the offsetting of the resulting index by 1 as that is not useful in our application either.Such factors simplify the PTX instructions needed as shown in Listing 5, where we use bfind.u32 to identify the natural index of the most significant (and only) set bit directly.This improvement is especially important each of which may be of different specifications to the rest, even at different physical locations.A job is typically submitted through the login node and scheduled automatically to run in a node when sufficient resources become available.Jobs hold various parameters for the work that will eventually be carried out and often limit the time available to carry out such work.The specifics of cluster topologies, available hardware, limits, job scheduling, and so forth, may vary greatly between clusters.
Overcoming time limits in such environments can be a challenge and is typically achieved through checkpointing.The application running as part of a job is thereby issued a signal by the job scheduler when the execution time limit is about to be reached, to perform the necessary actions and save the current state of the computation before exiting gracefully.For CPU-side computations, the signal can be handled by the main process to terminate the computation.However, this is more difficult for GPU-based applications since the host system has to signal the device to stop and return the partial progress made.This is further complicated in applications such as ours where there is no communication between the host system and the device throughout the computation, since all threads of the device act as isolated solvers, relying on on-chip memory resources only.
We decided against implementing this functionality, as the associated changes to the kernel would incur significant performance costs.We instead chose to partition the state pool empirically, for problem sizes likely to surpass time limits of the cluster, and instead dispatch multiple jobs to tackle each sub-pool.

PERFORMANCE EVALUATION
To evaluate the performance of our DoubleSweep-Light implementations described in Section 5, we tackled a range of problem sizes N ∈ [14,25] on various systems with different GPU device architectures.In our earlier work 1 we presented solving times for N ∈ [14, 20] shown in Table 1 obtained using the Shared memory-based implementation of DoubleSweep-Light described in Section 5.2.These results were collected from two systems housing two GTX 1080ti (Pascal architecture) and one RTX 3090 (Ampere architecture) GPUs respectively, by performing ten runs of our solver over the same input for each test case.We note that at the time, the elimination of vertically symmetric states was not implemented.
In this work, we present results for both the register-based and shared memory-based implementations of DoubleSweep-Light we collected using Swansea University's GPU cluster.This cluster is comprised of six identical GPU nodes, each of which houses eight NVIDIA A100 GPUs (Ampere architecture).Unlike our earlier results and for reasons outlined in Section 5.6, our control over the systems (nodes) involved in this cluster was limited to the scheduling of jobs and accumulation of results.Therefore, to obtain results with as little interference from other concurrent jobs, we submitted jobs requiring 8 GPUs, meaning they would occupy a full node.
Figure 6 presents the solving time in seconds required for each N ∈ [19, 24] using our shared memory-based kernel (Kernel 1) and the register-based kernel (Kernel 2) implementations.For each value of N, a state pool was generated which was subsequently shuffled as described in Section 4.2, and the two kernels were submitted as separate jobs to the cluster, each tasked with tackling that same input state pool.The sizes of pools generated for each experiment are shown in Table 2, alongside the index of the last row that was 'locked' to produce this number of initial states.
For each experiment, we aimed to generate 80,000,000 initial states, which would equate to 10,000,000 per device over eight devices.This large factor of over-subscription was chosen to allow for finer control of workload per thread in the devices, however, we note that due to the nature of the problem, the time needed to tackle each cannot be estimated accurately to employ a better work-balancing heuristic.In some instances such as N ∈ {20, 23, 24} the number of states generated was substantially smaller than the desired state pool size.In these instances, 'locking' and exploring a further row exceeded the limit by a significant amount and the state generator reverted to the earlier pool as described in Section 4.2.However, even in such instances, there were sufficient states available for each device, and despite potentially worse balancing of work between devices, the overall solving time does not appear to deviate significantly.The two kernels perform well and we observe that as expected, Kernel 1 is consistently slower than Kernel 2 for reasons identified in Section 5.1.
Varying the number of devices used to solve an instance of the problem shows a quasi-linear improvement in overall solving time.For instance, Figure 7 shows how the time needed to solve N = 22 varies with the number of A100 GPUs involved in the computation.In theory, there should be a linear improvement in solving time as more devices are involved, however, this is not the case; we attribute the quasi-linear trend to factors such as imperfect work balancing between devices.As mentioned earlier, it is not possible to accurately determine the solving time required per state, therefore it is often the case that some devices complete solving earlier than others and remain idle while the others continue to solve, as was the case in this instance.
Results presented so far were collected via a single job submission that carried out all work.The cluster used for our results however imposes a strict 48-h limit on jobs to enforce fair resource sharing, which was insufficient time to collect results for N = 25.Following the solving time progression we had observed up to that point, whereby the time needed to solve a problem instance n with some solver configuration is approximately 8.35 times that of n − 1 using the same configuration, we estimated the time necessary to carry out the full computation would be approximately 5.6 days based on the time taken to solve N = 24 using eight A100 GPUs.With this very crude approximation, we generated a large initial state pool of 20,746,561,752 states by locking up to row 9, and partitioned into 20 sub-pools.Despite the considerable memory requirements of the state generation process which peaked at 1.75 TB of total memory usage (RAM and persistent storage as explained in Section 4.2), the overall state generation process lasted approximately 40 min.
To tackle these sub-pools of states, we manually scheduled 20 jobs over the course of 2 weeks, each using eight A100 GPUs and solving one sub-pool of states, and completed the computation in a combined 670,747 s of run time or approximately 1 week and 18 h.We launched individual jobs in succession with intervals between them to enable better sharing of resources with other users than what is implemented by the job scheduler.
It is worth noting that the combined solving time here is the result of accumulating the time each job required, and is a reflection of the total solving time necessary for N = 25 on eight A100 GPUs.The time necessary would however be reduced greatly in a scenario where all 20 jobs execute concurrently over 160 GPUs.In such a scenario, the overall time would be that of the longest-running job, which in this instance was 46,481 s or approximately 13 h.
Figure 8 presents the solving time taken by each job submitted, to solve the respective sub-pool of states.It is noteworthy that during state generation, we elected to not shuffle the pool of states but rather shuffle each individual sub-pool, to save on time and persistent memory input/output operations as it can be a demanding process for such large data sets.We shuffled each sub-pool in an effort to distribute the workload better between the GPUs involved in tackling that pool, but we accepted that the time each job required to complete would likely not align with other jobs, which is the case.There are large discrepancies between job completion times, for instance between Job 2 and Job 20, which we attribute to the absence of shuffling of states in the state pool before partitioning into sub-pools, as some initial states will have more possible solutions to be explored, dependent where the first few queens are placed.
Results presented in this paper show a significant improvement in performance over our earlier results, shown in Table 1.The performance gains were made by optimizing the state generation process to eliminate vertical reflections of generated states, introduction of the register-based kernel implementation, and several hand-tuning optimizations to the implementation to maximize the performance of both kernels.Use of 64-bit counters for result accumulation by each solver thread is also being made in our solver currently unlike our earlier work, the negative performance impact of which we have countered with hand-tuning optimizations.
The N-Queens problem has attracted the attention of the optimization and parallel processing communities, with several contributions in the literature of GPU-based solvers such as ours.We highlight that data in literature is collected using different methods and tooling over different hardware, making it difficult to produce an objective comparison.Ideally, such a comparison of approaches would be performed on similar hardware using similar library versions and tooling, and even in a controlled environment such as this, design choices influenced by the current state of the art

3 .
The algorithm introduced in this paper is DoubleSweep-Light, which is a lighter version of DoubleSweep: unlike Somers' algorithm, branching starts at the top row, and proceeds (only) to the next row, while the data-structure of DoubleSweep is used only for one sweep down without iteration until the first row is found with at least two open cells or the end of the board is reached as illustrated by Figure2.In light of the high complexity of implementing such dynamic algorithms on GPUs in an efficient manner, this light version, which compromises on some of DoubleSweep's advantages, was chosen as a means of balancing algorithmic complexity with the need for an optimized implementation for our solver.
implementation.As such, we have made a number of adaptations and created the DoubleSweep-Light algorithm, to bridge the divide of algorithmic performance and feasible optimizations in implementation.Implementation details of this algorithm for the massively parallel GPU environment are detailed in Section 5. DoubleSweep-Light, like DoubleSweep, works by making a placement of a queen on the board, followed by a propagation step.The main difference between the two is the propagation step: DoubleSweep-Light does not perform full propagation through multiple sweeps over the F I G U R E 2 Step-by-step application of DoubleSweep-Light.
the nearest multiple of w, to obtain the number of threads per block b ′ b ′ = min(b − (b mod w), m)

F I G U R E 7
Impact of varying the number of devices (NVIDIA A100) involved in the computation of N = 22.F I G U R E 8 Time taken per job, for each of 20 jobs used to solve N = 25.
advance_state ( s , l o c k e d _ i d x ) : l e t cr ← s .current_rowwhile locked_idx ≤ cr.row_index < N do : State advancement algorithmFollowing the advancement of a state by applying advance_state, the propagation step is performed on the state s by applying the function derive_queens presented in Listing 2. This function identifies which column (if any) in the current row of s is free (Lines 3-8) and if such column exists, places a queen, repeating the same operation in the following row (Lines 9-11).
l e t i d x ← cr .current_queen_indexforeachi∈]idx, N[ do : i f ¬ h a s _ d i a g o n a l ( s , i ) ∧ ¬b l o c k e d _ c o l ( s , i ) then : i d x← i break i f idx ≠ cr.current_queen_index then : place_queen ( s , cr , i d x ) l e t x ← min ( cr.row_index + 1 , N − 1 ) s .current_row ← s .row_at [ x ] return ⊤ e l s e : c r ← s .row_at [ cr.row_index − 1 ] return ⊥ Listing 1: ← s .row_at [ s .current_row .row_index + 1] goto s t a r t Listing 2: The DoubleSweep-Light algorithm.
1A B L E 1 Solving times in milliseconds from our earlier work1for different values of N across a number of benchmark configurations.Solving time in seconds required to tackle each N ∈ [19, 24] using both the register-based and shared memory-based kernels over eight A100 GPUs.
TA B L E 2 Number of states in state pools generated for our experiments.