Analog Photonics Computing for Information Processing, Inference and Optimisation

This review presents an overview of the current state-of-the-art in photonics computing, which leverages photons, photons coupled with matter, and optics-related technologies for effective and efficient computational purposes. It covers the history and development of photonics computing and modern analogue computing platforms and architectures, focusing on optimization tasks and neural network implementations. The authors examine special-purpose optimizers, mathematical descriptions of photonics optimizers, and their various interconnections. Disparate applications are discussed, including direct encoding, logistics, finance, phase retrieval, machine learning, neural networks, probabilistic graphical models, and image processing, among many others. The main directions of technological advancement and associated challenges in photonics computing are explored, along with an assessment of its efficiency. Finally, the paper discusses prospects and the field of optical quantum computing, providing insights into the potential applications of this technology.


Introduction
In 1965, Intel co-founder Gordon Moore formulated an empirical observation that the number of transistors in a microprocessor will double nearly every two years, the statement which is known as Moore's law. [1,2] This prediction was followed by the forecast of reaching a saturation point by 2015. The progress of conventional computer architectures was very close to Moore's vision, and reaching the saturation point was just a matter of time. The miniaturization of silicon transistors recently managed to break the 7-nm barrier, which was believed to be the limit.

DOI: 10.1002/qute.202300055
Also, Moore's law usually comes with several essential indicators, such as the processor's thread performance and clock frequency, which reached the point of saturation much faster than the density of the transistors. All of these factors limit the scaling performance of modern computers. However, there are other reasons for the saturation of conventional computing power growth, which are the consequences of Moore's law. For example, increasing the number of transistors allows one to obtain more powerful systems. Still, the processing speed will inevitably decrease with the concomitant increase in heat production, while increased energy consumption is connected with the growth of the performance. Another critical issue is the so-called von Neumann bottleneck, [3] arising from the architecture design. It refers to the computer system throughput limitation due to the characteristic of bandwidth for incoming and outcoming data. [4,5] All these issues pose severe problems to the future of conventional computer development. As a result, the alternatives to von Neumann systems started to emerge. [6,7] One turns to alternative hardware architectures and purposebuilt devices to keep up with the scaling performance. As such, universal quantum computing promises to decrease the algorithmic complexity of solving challenging tasks by exploiting the entangled states. However, in contrast to this high-risk and highreward strategy (also discussed below in the optical setting), there is an option to replace electrons with photons but remain in the scope of classical or classical with a transient quantum coherence regime of optical computing. The motivation for such transition is clear since photons move at the speed of light, have low heat production, have high density and can be efficiently coupled to matter to exploit nonlinear behaviors. Moreover, optical technologies have matured and entered our everyday lives, such as fiber optic channels that carry the global traffic of information or optical readers of compact disks. However, the conversion of photons into electrons is required for compatibility with complementary metal-oxide-semiconductor (CMOS) architectures. Such conversion takes a significant portion of energy, slows down the overall process of information processing, and presents a severe technological bottleneck in this type of hybrid technology.
Despite these difficulties, optical hardware is exploited in computing devices. For example, different application-specific www.advancedsciencenews.com www.advquantumtech.com photonic hardware can operate on a reasonable scale in data centres for heavy machine learning (ML) applications and large-scale optimization. Moreover, neural network (NN) architectures are nearly ideally suited for optical hardware with the potential to achieve high efficiency, fast computing times, and low energy consumption due to the desired physical properties of the photonic systems. Nevertheless, at this point, optical computing can not be associated with mainstream technology. It is unlikely that optics will ever replace electronics as the universal platform in the foreseeable future. The additional reason is the technological inertia accumulated through the years by significant investments in CMOS technologies. Partially, the rapid development of what we call conventional computers in the early years led to an ever-increasing gap with computing using photonics, which will occupy its own place in the domain of application-specific hardware.
There are many excellent reviews on the topic of optical computing. The challenges of modern computing and new opportunities for optics are discussed in Ref. [8]. This work presents the latest research progress of analogue optical computing, focusing on three main directions: vector/matrix manipulation, reservoir computing (RC) and photonic Ising machine. Moreover, it covers the topic of computing efficiencies, such as the ratio of performance and power dissipation and the error/precision interplay of such hardware. Another excellent review considers analogue optical computing in the context of artificial intelligence (AI) applications. [9] This work provides an overview of the latest accomplishments of optical computing, considering the realization of different AI models and NN paradigms. One can find additional information in other reviews, [10][11][12][13][14][15][16] which appeared due to the recent interest in deep learning (DL) methods and their success in many domains.
What differentiates our review from those listed above is that we treat analogue optical computing using the concept of universality of the underlying dynamical systems description. The advantage of optical computing comes from ultrafast emulation of the dynamics. [17] We focus on physical optimizers that exploit bifurcation dynamics and threshold operation and aim at solving nonlinear problems, therefore, going beyond the speed-up of performing the linear operations that optics is so efficient at.
We organized our review as follows. Section 2 provides a short history of optical computing together with the modern analogue computing platforms focusing on NN implementation and other neuromorphic systems. Section 3 discusses the specialpurpose optimizers and several examples of such devices. This section connects the operational regimes of such machines with the complexity classes and addresses the scalability of this approach. Section 4 focuses on the physics of optical computing devices based on laser networks, optical parametric oscillators (OPOs) in fiber, photon or polariton networks, as well as their mathematical models. The second part of this review investigates the mathematical structures of different assignments and their emulation by the physical systems. The following Section 5 lists a wide range of possible applications across different applied domains. The final part consists of our subjective perspective on the future technological development of optical computing field in Section 6 and passing remarks about quantum optical devices in Section 7. Finally, Section 8 summarizes the results.

Analog Optical Computing
Modern technologies demand vast data flows, creating various challenges for the development of the semiconductor industry and pushing classic electrical circuits to their physical limits. The developments range from more mainstream such as optical components that can be integrated into traditional computers or play the role of specific hardware, dealing with computationally heavy tasks or supplementing such calculations, to ambitious ones, such as all-optical digital computer architecture.

Brief Prehistory of the Optical Computing
Although optical computing is an emerging technology that has gained more momentum over time (especially considering the popularity and efficiency of the latest data-driven approaches), many significant advances have been made in previous decades. Therefore, before describing the particular systems, their advantages and their applications, we briefly discuss the progress that enabled the future developments. More information and additional details can be found in Ref. [18].
The generic optical processor architecture comprizes three plane parts: the input, the processing and the output planes. Early on, the input plane was a fixed image slide with its later change to a spatial light modulator (SLM), introduced to perform the input signal conversion. The processing plane can be composed of lenses, nonlinear components, or holograms, while the final output part is composed of photodetectors or a camera.
The first promising applications for optical processors were pattern recognition tasks, which influenced the prototypes of optical correlators. The simple architecture called 4-f was based on the work on spatial filtering, see. [19] The Fourier transform property of a lens is the standard function of many optical computing schemes, taking advantage of the speed and parallelism of light. The second type of correlator architecture was presented in 1966 by Weaver and Goodman, [20] which is called the joint transform correlator (JTC), see Figure 1.
Before 1950, there were significant steps in development of optical technologies such as the theory of image formation in the microscope, [21] developed by Abbe, the development of phase contrast filter by Zernike [22] and the appearance of the information optics after Elias Snitzer in 1952. [23,24] Other major inventions of that period were holography (by Gabor,1948) [25] and the development of the laser in 1960. [26,27] The consequent introduction of the off-axis hologram allowed the separation of the different terms of the reconstruction solving 3D reconstruction tasks [28,29] in 1962 by Leith and Upatnieks, which basically led to practical holography and was further enhanced by Lohmann, creating the first computer-generated hologram [30,31] in 1966. Early SLMs were based on the Pockels effect with few prospective devices. [32][33][34] Liquid crystal technology is the most commonly used technology for SLMs today. Another significant step was the invention of the first optical transistor, [35] the hope for small integrated circuits.
The period from 1980 to 2004 was vibrant and productive. Active progress was going in the field of holography, particularly new encoding methods and the point-oriented methods were developed to achieve high quality and high diffraction efficiency optical reconstructions of the computer-generated holograms. [36] Figure 1. The optical setup of the joint transform correlator (JTC). Reproduced with permission. [18] Copyright 2010, de Gruyter Academic Publishing.
More than 50 types of SLM were introduced in the eighties and nineties. [37] Optical transistors presented another active area of research with the appearance of the micro-electromechanical systems (MEMS) technology. [38] Vertical-cavity surface-emitting lasers (VCSELs) and the self-electrooptic effect devices entered the markets in 90s. [39] In general, many aspects of modern optical interconnections and their components were introduced and studied during this period.
The optical technologies development provided the necessary experience in the capabilities of the optical devices and led to the maturation of the experimental element base. Optical computing received a second chance after the success of so-called deep NNs, which share many similarities with the previous neural-like optical architectures.

Modern Optical Computing
Today, numerous research topics benefit from the progress in optical computing; therefore, the field is no longer so well defined. For example, some of the algorithms initially developed for pattern recognition using optical processors are now used successfully in digital computers. Other fields, such as biophotonics, largely benefit from past optical processing research.
Transistor is the fundamental building block of modern electronic computers. Therefore, one must find an equivalent optical transistor to replace electronic components with optical ones. To assemble such transistors into the higher-level components to create an all-optical computer's central processing unit (CPU), one has to design the optical processor and optical storage and organize the optical data transfer. However, such an approach faces many challenges, while the potential of optics in large architecture consisting of higher-level components can be seen as some-what speculative. [40] Among persisting problems are the scalability of the optical logic devices due to the bad logic-level restoration, cascadability, fan-out and input-output isolation, energy consumption issues and non-miniature device footprint. Moreover, coupling these potential devices with the electrical components will require conversion of information carried from photons to electrons, which is relatively slow and energy-consuming.
However, the development of integrated photonics continued. [41] It led to attempts to create linear logic elements, such as all-optical logic gates, [42,43] improve the existing optical transistors and develop new ones in the context of the all-optical processing. [44,45] One can use SLM, micro-lens array and holographic elements in free space to realize optical linear interconnection. Such linear elements are essential components in various optical computing devices.
Nonlinearity is another essential component in optical schemes; however, its realization meets specific difficulties as light beams pass through each other unperturbed in a pure vacuum. To force these beams to interact, one has to set up a highenergy experiment, which is challenging to realize in practice. There are two other ways to realize the nonlinearity: introduce the digital readout mechanism, implemented by the chargedcoupled device, send it to a computer with further nonlinear processing before feeding it back to SLM, or develop fully optical nonlinear activation materials with high enough intensity of the beams (utilising absorption, refraction or scattering processes). Nonlinearities can be divided into local (as needed in neural architectures) and global systems (such as RC systems, see below). Combining the linear and nonlinear elements led to the developing of specialized isolated devices. As a result, optical computing research has seen a resurgence in activity, centring around new developments in photonic hardware accelerators and neuromorphic computing.

www.advancedsciencenews.com www.advquantumtech.com
Neuromorphic computing usually denotes the use of integrated systems to mimic neuro-biological architectures. Although it is very close to the domain of AI, with the stress on the word "artificial", which deals with the intelligent designed machines or agents, we will use neuromorphic computing in the general sense to describe any neural systems, be it brain or nature-inspired or artificially designed. Modern key focus areas are trying to emulate the neural structure and functionality of the human brain, including probabilistic computing, which incorporates uncertainties. Optics has required ingredients to emulate NNs. [13,46]

Optical Neural Networks
Optics has long been considered as a promising technology for implementation of matrix multiplication and interconnects. Artificial neural networks (ANN) have been widely exploited for industrial and fundamental applications, and this new technological demand created a renewed case for photonic NNs. Although most ANN hardware systems are electronic-based, their optical implementation is promising because of their built-in parallelism and low energy consumption. Disparate ANNs vary by types of constituent elements, mathematical operations and the architecture used. In photonic ANNs, the mathematical operations are mapped to the characteristics of optical propagation set by programmable linear optics and nonlinearity. A scalar synaptic weight describes pairwise connections between artificial neurons. At the same time, the layout of interconnections can be given as a matrix-vector operation, where the input to each neuron is the dot product of the output from connected neurons with assigned weights.
Photonic realizations of ANNs fall into three categories. First, free-space systems rely on diffraction, Fourier transforms, etc. [47,48] They have high scalability and can simultaneously process large numbers of neurons but suffer limited connectivity. One example is a reconfigurable, scalable two-layer NN for the classifying phases of a statistical Ising model. [49] Second, SLMs program linear operations and Fourier lenses implement the summation by collecting the light power encoded signal. However, in the case of free optics, the nonlinear optical activation functions are realized in a complicated manner, for example, with the laser-cooled atoms with electromagnetically induced transparency. [49] Finally, on-chip designs based on wavelength multiplexing [50] or beamsplitter meshes [51] can achieve programmable all-to-all coupling but need to scale better. One on-chip design was proposed in Ref. [52], where the optical platform takes advantage of encoding information in both phase and magnitude, thus making it possible to execute complex arithmetic by optical interference, which suits performing handwriting recognition tasks. Mach-Zehnder Interferometers (MZIs) can perform many functions, such as dividing and modulating the light signals, separating the reference and the main light beams, and implementing a complex-valued weight matrix.
Tunable waveguides can multiply optical signals, while wavelength-division multiplexing can add signals. Wavelengthdivision multiplexing can be achieved by the accumulation of carriers in semiconductors, [53,54] electronic currents, [55,56] or photoninduced changes of the material. [57] To achieve the full po-tential in on-chip architectures, one must require long-range connections between neurons, assisted by photonic waveguides that outperform metal wires connections of conventional electronics but fall behind free-optics solutions. In particular, silicon photonic platforms demonstrated efficient neuromorphic architectures. [50,51,55] Coherent input light can be transformed using an array of beam splitters and phase shifters. [58] This is done by assigning inputs to different waveguides and modulating their power. Another approach to weight configuration is modulating the effective refractive index of signal-carrying waveguides. Nonvolatile synapse implementations, also known as all-optical, do not require electrical inputs for tuning. Instead, they use optically induced changes in chalcogenide materials to control light propagation in waveguides. [59] This could lead to improved heat dissipation.
A scheme based on homodyne detection has several scaling advantages over on-chip approaches. It has linear chip-area scaling and constant circuit depth. [60] The input vector is encoded onto a pulse train and fanned out to an array of homodyne detectors. Each detector computes a product between the input and a row of a matrix encoded into optical pulse trains. The accumulated charge on the homodyne detector performs matrix-vector multiplication. The output is sent through an electrical nonlinearity and converted back to optical signal using a modulator. The advantage of the homodyne detection scheme is that the matrix elements (weights) are encoded optically and can be dynamically reconfigured. This procedure requires a reduced number of photonic components: the number of modulators, detectors, and beamsplitter grows linearly with the number of neurons. The homodyne detection architecture can be parallelized to implement general matrix-matrix multiplication by routing the light out of plane. [60] This is useful in practical NNs that reuse weights (either natively in convolutional layers or through batching).
Nonlinearity in ANN is required to implement the thresholding effect of the neuron. Some photonic devices exhibit nonlinear neuron-like (gate-like) transfer functions. However, the challenge is to achieve cascadability. Photonic neurons need to be able to respond to multiple optical inputs, apply a nonlinearity, and produce an optical output that can drive other photonic neurons. Integrated photonic solutions use either optical/electrical/optical (O/E/O) or all-optical design to achieve such cascadability. In the O/E/O approach, nonlinearities may be introduced during the E/O conversion stage by employing lasers, saturated modulators or photodetector-modulators [61] or in the electronic domain only (e.g., the nonlinear behavior of spiking photonic neurons can be achieved using a superconducting electronic signal pathway. [62] ) NN architectures can take different forms: feed-and backforward, layered and recurrent, spiking or continuous etc. Different neural models have unique signal representations, training methods, and network topologies. Weight configurations can differ depending on the training type: supervised training, unsupervised or programmatic "compilation." Topology describes the graph structure of neuron connectivity, and often it is advantageous to ANN operation to constrain the topology to guide weight configurations. Therefore, hardware implementation details may differ between different ANN, while the key technologies necessary for practical realization include active on-chip electronics and light sources. Many photonic architectures have already been demonstrated: recurrent ANN, continuous-time and Figure 2. a) One layer of an optical NN with k layers consists of matrix-vector multiplication (grey) and non-linearities (red). b) One-level implementation. Matrix multiplication is performed by combining the input and weight signals and performing balanced homodyne detection. The final signals are sent through a non-linear function (red), serialized, and sent to the following layer's input. Reproduced with permission. [60] Copyright 2019, American Physical Society.
programmed by compiler; [50] feed-forward, single-valued and externally trained ANN; [51] spiking, feed-forward ANN with both external and local training; [57] a feed-forward multilayer ANN created using semiconducting few-photon light-emitting diodes and superconducting-nanowire single-photon detectors; [55] diffractive networks with a nonlinearity. [49] The computational tasks solved by these platforms cover the main functions attributed to ML and AI: image and audio recognition and classification, simulation of dynamical systems, combinatorial optimization and many other applications, which we will discuss in Section 5. Some of the architectures and their different experimental realizations are shown in Figures 2-4.
The key merit of NN hardware is the level of energy consumption, which can be evaluated as petaMAC (multiply-accumulate operations) per second per mm 2 processing speeds [63] and attojoule per MAC energy efficiencies. [64] In general, current optoelectronic hardware offers great advantages for implementing ANN, but eliminating the electrical contribution will inevitably be beneficial. For practical applications of neuromorphic photonic systems, one needs to reduce heat dissipation during information transfer between electrons and photons. Improving optical sources, high-efficiency modulators, and photonic analogue-todigital interfaces can reduce the need for electronic processors. However, current photonic platforms lack functionality such as logic gates, high-level compilers and assemblers, analoguedigital-analogue conversion, and memory. While photonics has advantages in connectivity and linear operations over electronics, on-chip memory is challenging. In-memory computing, where processing is performed in situ with an array of memory devices called memristors, has been established; [65,66] however, reading and writing at high frequencies is still challenging. The recent trends in the development of the ANN show the increasing demand to lower the power consumption of the devices. At the same time, the requirements for parallelism and scalability remain the same through the years. [67] Thus, the optical domain offers a promising solution to future hardware requirements.

Reservoir and other Neuromorphic Computing Systems
RC is a recurrent NN-based framework for computation that maps input signals into a specific computational space of the fixed nonlinear system dynamics. This system is usually called a "reservoir", and its state is passed to a simple readout mechanism, specifically trained to get the final output. [68] The original concepts of RC can be traced to the liquid-state machines [69] and echo-state networks. [70] Many physical systems can reproduce this computational framework, and the optical/photonics domain is no exception. The extension of RC to deep hierarchical RC allows one to create more efficient models and simultaneously investigate the inherent role of layered composition in recurrent structures. Another promising research direction is to combine RC with quantum physical systems to access larger computational space.
The idea of RC is to exploit the rich nonlinear dynamics of controllable nonlinear systems and simultaneously overcome the disadvantages of recurrent architectures with their challenging and time-consuming training for both hardware and software systems. The RC training is performed only at the readout stage, as the reservoir dynamics are fixed. This readout framework enjoys the benefits of a particular photonic physical system, such as speed or energy consumption, reducing learning costs. In fully functioned 2-layer all optical NN the first layer comprizes a linear operation by the first SLM (SLM1) which encodes a certain pattern and a nonlinear activation function based on the electromagnetic induced transparency at magneto-optical trap (MOT). The second layer contains the second SLM (SLM2), converting four beams into two output beams at camera C3. The collimated coupling laser beam passing lens L1 is incident on the SLM1, which generates four beams at the focal plane of L3, which is monitored by a flip mirror (FM) and camera C1. Four beams are imaged on the MOT through a 4-f system comprising L4 and L5. A probe laser is going opposite the coupling beam, which is imaged on camera C2 through L5 and L6. L7 and L8 achieve further amplification. Four beams are incident on SLM2, generating two beams and then focusing on camera C3. Reproduced with permission. [49] Copyright 2019, Optica Publishing Group.
Another RC benefit is learning temporal (dynamic) dependencies compared to the feed-forward architectures used for static (nontemporal) data processing. The simplicity of the training procedure in RC is attractive. However, accessing complex dynamics without rigorous understanding can lead to many problems. Operating within the RC framework usually needs extensive experiments and experimental verification due to the need for a unified theory of RC. Another disadvantage is the performance instability due to the noise present, typical for nearly chaotic dynamical systems.
Nevertheless, many successful cases of RC are being applied to practical problems, such as temporal pattern classification, time series forecasting, pattern generation, adaptive filtering and control, and system approximations. Moreover, RC can be used conventionally for static data processing. The first all-optical implementation of RC was demonstrated within a simple optical delayed feedback loop combined with the nonlinearity of an optical amplifier. [72] Concerning the free-space optics principles, an image processing task was successfully solved using a predesigned configuration with a diffraction grating and Fourier imaging with randomly interconnected microring resonators. [73] The reservoir consisting of a diffractive optical element was described based on an 8 × 8 laser array (of VCSELs) and an SLM. It showed rich dynamics with the potential for scaling up. [74] Further modifications of this setup with a laser illumination field and digital micromirror device allowed one to realize the large-scale RC scheme with 2025 diffractively coupled photonic nodes applied to a time series prediction task. [75] The recurrent 24-node silicon photonic NN, in which microring weight banks configure connections, was programmed using a "neural compiler" to solve a differential system emulation task with a 294-fold acceleration against a conventional benchmark. [50] Some hybrid architectures, such as opto-electronic devices, similarly benefit from the RC concept. For example, excellent performance has been obtained for speech recognition, [76][77][78] chaotic time series prediction, [77,79,80] and radar signal forecasting, [81] with the operating speed in the megahertz range and the potential to increase it to gigahertz speed, at the same time preserving the state-of-the-art numerical accuracy. Additional cases of successful RC have been reported in literature. [68,76,82] We will consider the NN and RC architectures cases involving quantum effects in Section 7.1. Another example is illustrated by

Nonlinear Optimization with Optical Machines
Problems that can be solved by optical hardware cover a wide range of optimization problems: linear and nonlinear in binary, integer, real or complex variables, with or without constraints. Such a general framework can include many applications across social sciences, finance, telecommunications, aerospace, biological and chemical industries. [83] Nonlinear optimization problems are quadratic to second order around the vicinity of the optimal solution so the usual simplification involves Quadratic Programming (QP) that minimizes quadratic functions of variables subject to linear constraints. Such approximation can be successfully performed even outside the feasible solutions space. QPs can be met in the least squares regression or as a part of a bigger problem, such . Complex-valued coherent optical NN. a) Scheme with an input layer, multiple hidden layers, and an output layer. b) The schematic of the optical neural chip in implementing complex-valued networks. The single chip performs all stages, such as the input preparation, weight multiplication and coherent detection. The division and modulation of the light signals (i 1 − i 6 ) are realized by the MZIs (red). Green MZI separates the reference light. Blue MZIs are used to implement the 6 × 6 complex-valued weight matrix. Grey MZIs are used for on-chip coherent detection. c) The workflow of the ONC system. Reproduced with permission. [52] Copyright 2021, Springer Nature. as support vector machine (SVM) training. The apparent correspondence between the QP objective function and 2-local spin Hamiltonians of various physical systems allows one to map the problem into the physical setup. Here, the variables are associated with "spins" and the objective function with a "Hamiltonian" that encodes the interactions patterns and strengths between spins.
A system can find the optimal solution or ground state of a spin Hamiltonian using either quasi-or non-equilibrium regimes. In thermodynamic equilibrium, the system may use quantum annealing with a time-dependent Hamiltonian. This involves adiabatic evolution from an initial trivial Hamiltonian to a final Hamiltonian encoding the original objective function. However, this process can become inefficient for larger systems and sophisticated Hamiltonians due to a shrinking spectral gap. [84] Many open non-equilibrium gain-dissipative systems, such as lasers and photonic or polaritonic condensates are non-Hermitian systems and, therefore, do not have a ground state. Instead, they tend to minimize losses on the route to coherence. One can use geometric analogies to describe their operational principle as the approach of the surface of the optimization cost function (loss landscape) from below. There are two main processes that lead to loss minimization: bosonic stimulation below the threshold and the coherence of operations at the threshold responsible for the quality of the solution. After increasing the gain to the point where it overcomes the linear losses and is Figure 5. a) Schematics of proposed reservoir computing (RC) architecture. The electric input signal u(t) is coded on optical pulse, (t) is coded by optical modulator. The sinusoidal nonlinearity is achieved by the electro-optic conversion (F in ). RC system comprizes the L-array of optical cavities with N temporal nodes with N × L virtual nodes. Photodetectors (PDs) get the output signals from optical circuit and generate nonlinear conversion (F out ). The digital processing unit detects signals |x l | 2 and weights and sums them to obtain the final output y(t). b) Schematic of the waveguide layout for input mask circuit with the three types of installations for optical connection, and c) input mask reservoir circuit. The input weights h l and reservoir parameters can be tuned by the phase shifter, MZI and variable optical attenuator setup. Reproduced with permission. [71] Copyright 2021, Springer Nature. stabilized by the nonlinearity of the gain saturation, the emergent coherent state minimizes the losses (equivalent to maximization of the total number of particles). It hence achieves the loss minimization mapped into the objective spin Hamiltonian. The system elements' resulting evolution closely resembles the Hopfield Networks' dynamics, proposed to be used to solve quadratic optimization problems forty years ago. [85] Despite the successes and a lot of excitement generated back then, the optimizers based on Hopfield networks were almost forgotten primarily due to the high connectivity required between neurons and the concomitant evolution time of the networks used by classical architecture. Hopfield networks have recently regained interest because they can now be emulated with physical systems such as electronic circuits or photonic NNs. Photonic systems have an advantage over electronic systems because they operate on a much faster time scale and can carry many signals through a single optical waveguide. This means that a photonic implementation of Hopfield networks as optimizers can have large dimensionality, dense connectivity, and quickly converge to the optimal solution.

Spin Hamiltonians and Hard Optimization
Real-life decision or optimization problems can be challenging for conventional classical computers. These problems can often be reformulated into finding the ground state of a spin Hamiltonian, which can be emulated with a simulator quantum, classical or hybrid. The overhead in the number of additional variables needed during this mapping is at most polynomial. However, such spin model Hamiltonians are experimentally challenging to implement and control. Two classes of spin Hamiltonians are generally considered: Ising and XY. The Ising model attracts the most attention because many challenging discrete combinatorial optimization problems can be mapped into it. [86] The minimization of the Ising Hamiltonian with the coupling matrix J (the minimization of the quadratic unconstrained binary optimization problem (QUBO)) is formulated as Going beyond quadratic would lead to a k-local spin Hamiltonian: This problem known a higher-order binary optimization problems (HOBO) appears in many contexts including including k-SAT, [87] number factorization, [88] computing the partition function of a 4D pure lattice gauge theory, [89] molecular similarity measurement, [90] and molecular conformational sampling. [91] In the XY model "spins" are continuous vectors s j = (cos j , sin j ) and the quadratic continuous optimization problem (QCO) becomes and directly applicable for phase retrieval problems. [92][93][94][95] When phases j are restricted to take on discrete values 2 ∕n with an integer n > 2, Equation(3) becomes the n−state Clock model with applications in chemical materials and protein folding research. [96] The appearance of continuous spins is a common feature in many optical systems because short photonic impulses can be characterized through amplitude and phase variables. Some of the optical hardware for ML take advantage of this feature. For example, complex-valued NNs, [52] or more unusual concept of analogue transformations using a nonlinear set of functions were proposed. [97] The computational complexity of a problem is determined by the time or number of operations required to solve it as the problem's size increases. A problem belongs to the P class if a polynomial algorithm exists for solving it. If a polynomial algorithm exists for verifying a solution, the problem belongs to the nondeterministic polynomial-time (NP) class. Most difficult problems in NP are called NP-complete. They are equivalent to each other in that either all of them or none admit a polynomial-time algorithm. A problem is called NP-complete if the existence of an efficient algorithm for its solution implies the existence of such an algorithm for all NP-complete problems. If a decision problem with a yes or no answer is NP-complete, then its corresponding optimization problem is said to be NP-hard. This means that NPhard problems are not any easier to solve than the corresponding NP-complete decision problems. The computational complexity of the Ising model has been studied and shown to be NP-hard for certain cases. [98] The existence of universal spin Hamiltonians has been established, meaning that all classical spin models can be reproduced within such a model. [99] The mapping of various NP problems, including Karp's 21 NP-complete problems, to Ising models with polynomial overhead has been formulated. [86] The procedure for creating "hard" instances for spin Hamiltonians was developed based on statistical physics, here the hardness of problems is connected to a first-order phase transition in a system. [100][101][102][103] Indeed, even a medium size hard instance is difficult to solve on a classical computer due to the exponential growth of operations with size. Having a unified set of optimization problems with tunable hardness is an ongoing research direction. It will allow for an objective benchmark of classical and/or quantum simulators and algorithms. Otherwise, it would be hard to evaluate the performance of state-of-the-art platforms and methods.
Current research made a good starting point in developing a standardized procedure for performance evaluation. For example, the "optimization simplicity criterion" was recently proposed to identify computationally simple instances. [104] Optical machines with their mode selection operation often follow the dominant eigenvalue of the coupling matrix and find minimizers that correspond to the signs of the principal eigenvector components. If the minimizers of a given problem have this property, the solution will be found easily in polynomial (at most quadratic) time. One such popular example is the Ising model on the Möbius ladder graph. [104] By rewiring the Möbius ladder graph to random 3-regular graphs, one can probe an intermediate computational complexity between P and NP-hard classes with several numerical methods. Another way to construct instances for testing involves planted ensemble technique. [103,105]

Description of Physical Optical Platforms for Optimization
The concept of using simulators or analogue processing devices predated the modern computers. [106] In recent years, different physical platforms have been competing to solve classical optimization problems faster than conventional hardware. This competition has led to the emergence of a new field of coherent networks/Hamiltonian simulators at the intersection of laser and condensed matter physics, engineering, and complexity theories. Various physical systems have emerged as promising platforms for solving computational problems. We shall overview some of these systems following the structure we used in Ref. [15].

Laser Networks
A new generation of complex lasers, such as degenerate cavity lasers, multimode fiber amplifiers, large-aperture VCSELs, and random lasers, have many advantages over traditional laser resonators when used for computing. [107] They have many spatial www.advancedsciencenews.com www.advquantumtech.com degrees of freedom and controllable nonlinear interactions. The spatial coherence of emission can be tuned over a wide range and the output beams may have arbitrary profiles providing pairwise couplings. These properties allow complex lasers to be used for RC [108] or to represent the coupling matrix interactions. In laser networks, coupling can be engineered by mutual light injection from one laser to another. This introduces losses that depend on the relative phases between the lasers and drives the system to phase-locking that minimizes losses. [109] Degenerate cavity lasers are particularly promising as Hamiltonian simulators as all their transverse modes have nearly identical Q which leads to simultaneous syncronization of a large number of transverse modes. [107] The evolution of the N single transverse and longitudinal modes class-B lasers are described [110,111] by where for laser i, Equation (7) is widely used to describe the emergence of coherent behavior in complex systems. [112,113] It has been demonstrated that the probability of finding the global minimum of the XY Hamiltonian agrees between the experimental realization of the laser array and with numerical simulations. [111] However, simulating the Kuramoto model Equation (7) on the same matrix of coupling strength yields a much lower probability of finding the global minimum. This implies that amplitude dynamics provide a mechanism to reach lower energy states. Consequently, cavity lasers can be used as an efficient physical simulator for finding the global minimum of the XY Hamiltonian and solving phase retrieval problems. A particularly successful example was a digital degenerate cavity laser. [95] It is an all-optical system that uses a nonlinear lasing process to find a solution that best satisfies the constraint on the Fourier magnitudes of light scattered from an object.

Optical Parametric Oscillators
A network of coupled degenerate optical parametric oscillators (DOPOs) has been used to minimize the Ising Hamiltonian. [114][115][116][117][118][119][120] DOPO is a nonlinear oscillator with two possible phase states above the threshold that can be associated Figure 6. Schematics of the coherent Ising machine (CIM) with the feedback mechanism. The time-multiplexed pulse degenerate parametric oscillator is formed by a non-linear crystal (periodically polarized lithium niobate (PPLN)) in a fiber optic ring cavity containing 160 pulses. The feedback signal couples the independent pulses in the cavity and is computed from the measurements from different pulse fractions. Reproduced with permission. [114] Copyright 2016, American Association for the Advancement of Science.
with the Ising spins. These artificial Ising spins are encoded by the optical phase of short laser pulses generated by a nonlinear optical process, that is, optical parametric amplification. The DOPO-based simulator, coherent Ising machine (CIM), is a gain-dissipative system in which the ground state of the Ising Hamiltonian corresponds to the lowest loss configuration. The optimal solution is found by driving the system close to the nearthreshold regime, where other local energy minima are still unstable.
Currently, most successful implementations of CIMs use a fiber-based degenerate DOPOs and a measurement-based feedback coupling, in which a matrix-vector multiplication is performed on the field programmable gate array (FPGA) embedded in the feedback loop, see the scheme depicted in Figure 6. The computational performance of such a scalable optical processor, bounded by the electronic feedback, was demonstrated for various large-scale Ising problems. [114][115][116] The comparison of a possible CIM's speedup over classical algorithms is an ongoing study. [117,121] Furthermore, the ability to implement arbitrary coupling connections [114] between any two spins implies better scalability than the solid-state based annealer, that is, D-Wave machine. [115] In CIM, each Ising spin (a DOPO) is described by where Ψ i = A i exp[i i ] is the complex amplitude, i is the phase, p is pump intensity, and linear and nonlinear losses are normalized. A portion of the light is extracted from the cavity after each round trip, homodyned against a reference pulse to produce Ψ i that is fed to the FPGA, where the last term of Equation (8) is computed and converted back to light for the next round trip. CIM's essential elements are DOPOs with an unconventional operating mechanism called mode selection or gain-dissipative principle. Each DOPO is prepared in a linear superposition state of different excitations to implement a quantum parallel search.
The cost function is mapped to the effective loss, photon decay rate, of the given network by setting the coupling coefficient proportional to the J ij , which encodes the information about the given task. The ground state of the Ising Hamiltonian corresponds to an oscillation mode with the minimum network loss. The system reaches the ground state with a minimum loss at the threshold pump rate. It starts oscillating as a single stable mode, which triggers photons' stimulated emission and affects the saturation for all the other modes. Detecting this single oscillation mode will give us the solution to the desired problem.

Networks of Light-Matter Particles
Photons have both attractive and not properties concerning computational assignments. However, despite the commonly known optical platforms, such as free optical setups or systems of lasers, it is possible to bind the photons with the matter wave excitations. This gives rise to unique designs, combining the photons with matter, such as exciton-polaritons. Exciton-polaritons (or simply polaritons) are quasi-particles that result from coupling of light confined inside semiconductor microcavities and bound electron-hole pairs. They are bosons and can form a condensed state above a critical density. [122] The design and choice of material allow for control of the polariton mass and the realization of solid-state nonequilibrium condensates at room temperature in organic structures. In the limit of small gain, solid-state condensates resemble equilibrium Bose-Einstein condensates. They approach lasers in the regime of high gain.
Similarly to polaritons, gas of photons confined in a dye-filled optical microcavity can form a macroscopic coherent state. [123][124][125][126] Experiments have shown that polariton or photon condensate lattices can be constructed using various techniques. One such technique is optical engineering of polariton lattices by injecting polaritons into specific areas of the sample using an SLM. [127][128][129][130][131] Additionally, potential landscapes have been engineered to confine polaritons or photons. [132][133][134] The evolution of gain-dissipative condensates in a lattice is described by rate equations derived from the tight-binding approximation, [135,136] taking the form of Stuart-Landau equations: where Ψ i is the complex amplitude, U is the polariton-polariton interaction strength, i is the effective injection rate (gain minus loss). The coupling strength is complex  ij = J ij + iG ij where the dissipative part of coupling is J ij and the Josephson part is G ij . If |J ij | ≫ |G ij | and the injection feedback is introduced that modifies the injection rate to bring all amplitudes to the same value A th , for instance bẏi = (A 2 th − |Ψ i | 2 ) then Equation (9) reaches the fixed point which is the minimum of the XY Hamiltonian. [135] The total number of quasi-particles in the system becomes where i is the phase of the i−th condensate. It follows from Equation (10) that if the total injection, ∑ i to stimulate that number of quasi-particles is minimized, then the XY Hamiltonian is globally minimized (Figure 7). To minimize the Ising Hamiltonian or n-states Clock Hamiltonians a resonant pump can be added to the system. [137] The resonant pump can be tuned with the condensate frequency, or n times the condensate frequency for n > 1 (n : 1 resonance). In this case, the dynamics of the oscillators (e.g., photons or polaritons) obeẏ where h(t) is the strength of the resonant excitation that grows in time until it reaches its stationary value H. At the fixed point, the total number of quasi-particles in the system reaches At n > 1, the last term of Equation (12) forces the phases to take discrete values in 2 ∕n and, therefore, minimize the Ising Hamiltonian (for n = 2) or the n-state Clock Hamiltonian (for n > 2).

Hybrid Optical Machines
If physics provides the route to optimization we can ask if we can combine the best optimization algorithms with the optics functionality. The idea of using a hybrid approach to optimization, combining the best of both optical and electronic technologies has the potential to greatly improve the efficiency and effectiveness of optimization processes. One possibility has been recently realized by Microsoft by building a hybrid opto-electronic platform that performs matrix-vector multiplication in the optical domain while iteratively applying the gradient descent to the minimum of the objective function using electronics. [138] This codesigning of unconventional hardware and algorithms paves the way for scalable architectures with compute-in-memory operation and spatial-division multiplexed representation of variables.

Mathematical Description of Optical Optimizers
Many existing optical machines can be described as the evolution of a set of N classical degrees of freedom. The variety of optical platforms, such as atoms, polaritons, excitons, photons, etc., shares many similar features in their mathematical description. We present here the structured list of the main equations used in the context of nature-inspired physical systems and algorithms in Figure 8. There are a few main reasons to highlight the unified picture of these equations. The first one is to show that all of the presented equations represent the same phenomena of the minimization principle and bifurcation dynamics, unifying many equations from math, physics, theory of NNs, etc., see. [139] The second important feature is represented through the structure of the list. One can easily find the correct transformation between two chosen equations. This is the reason we non-rigorously placed them in the order so that the canonical Andronov-Hopf oscillators (AHO) model resides at the top of the list and the most straightforward gradient resides at the bottom. One can land at the required equations starting from the canonical model by using the proper transformation in the neighborhood of the bifurcation leading to the solution or omitting some of the terms or derivatives. Moreover, the difference between the presented equations appears to lie only in the chosen parametrization of the system. The optimization process can be done differently, even in the scope of classical dynamical systems depending on the chosen parameters. Such a unified framework allows one to merge many empirical results or to work in the same framework of a universal model for a better comparison of results ( Figure 8). The Principle of Least Action, the Principle of Minimum Power Dissipation (or Minimum Entropy Generation) or the Variational Principle are good demonstrations that "optimization is built into physics". [140] In Hamilton's formulation, the fundamental Principle of Least Action states that a true dynamical trajectory of a system between an initial and final configuration in a specified time is found by choosing the one among the set of possible imaginary trajectories that makes the action locally stationary (in other words have least action). Such a variational task is an excellent example of physics spawning complex problems. For even more complicated tasks, one can consider the formulations of the Principle of Least Action for classical and quantum field theories. We do not include the explicit Hamiltonian equa- ) in the second block in the Figure 8. However, they also connected with the presented equations and served as a perfect entry point for considering the whole list from the physicist's point of view. Within the scope of this review, we restrict ourselves to classical systems with a discrete number of degrees of freedom and focus on PDEs that can be mapped into Newtonian-like equations of motion. Additionally, we do not pay much attention to the changes in the original equations of motion, such as Lagrange multipliers, holonomic constraints or relativistic factors.
Also depicted in Figure 8 is the well-known classical gradientdescent dynamics with the target cost function defined by the gradients − ∑ j≠i Q ij (x j ), which is the most straightforward equation among the presented dynamical systems. One can connect it with gradient descent with momentum (see the centre of Figure 8) or the classical momentum (CM) method, [141] or it's improved version-Nesterov accelerated gradient-descent [142] ).
Kuramoto model is a well-known mathematical model used to describe synchronization phenomena occurring in a system of coupled oscillators. [143][144][145] One can obtain this model from the AHO equations using the transformation that involved the eigenvalues and eigenvectors of the coupling matrix at the neighborhood of the Hopf bifurcation, directly derive it from a nontrivial dissipative Hamiltonian or view this model as the gradient descent over the cost function corresponding to the classical XY Hamiltonian.
The bottom part of Figure 8 consists of Hopfield NN and coherent Ising machine description. Hopfield NN is a recurrent artificial NN and can be viewed as the gradient descent variant with the effective projection term with characteristic time and the gradient terms − ∑ j≠i Q ij (x j ), which are usually represented through − ∑ j≠i J ij (x j ), where (x j ) is the projection function and J ij are the coupling strengths. [85] CIM equations are very close to the Hopfield description. CIM is a network of OPOs, in which the "strongest" collective mode of oscillations corresponds to an Figure 8. The ordered list of the main models from different branches of science used in the context of the optimization. The most general equations are closer to the top, starting with the canonical AHO model, which encompasses all equations below through certain transformations. In contrast, the simpler ones, like gradient descent, are located at the bottom. We non-rigorously group the models according to their use of the secondorder derivative terms. The functions optimum solution while going above the threshold of a particular Ising problem. [114,115] The main difference between the classical description of CIM (which is debated to be essentially nonclassical [146,147] ) and Hopfield NN is the additional pumping term p and saturation mechanism −x 2 i . The middle part of the Figure 8 contains simulated bifurcation machine (SBM) equations, which are inspired by the adiabatic evolution of classical nonlinear Hamiltonian systems exhibiting bifurcation phenomena. [148][149][150] The higher derivative makes the connection with the physics more visible and improves the simulation algorithm's performance for specific parameters.
An alternative perspective on the connections between the physical Lagrangian/Hamiltonian systems and NN evolution is given by the Modern Hopfield networks, or dense associative memories. [151,152] Modern Hopfield networks operate with feature x i and memory (hidden) h μ neurons that evolve as continuous variables in continuous time. The characteristic times for each group are f and h . The symmetric coupling functions are chosen according to Q iμ (h μ ) = iμ f μ and G μi (x i ) = μi g i and connect only neurons from different groups, that is, a feature neuron i to the memory neuron μ and reverse. The outputs of the memory neurons and the feature neurons are denoted by nonlinear functions f ({h μ }) and g i = g({x i }) correspondingly. These functions can be represented as derivatives of the Lagrangian functions for the two groups of neurons f μ = Choosing the specific Lagrangian will define the network's dynamics (or updates rule), which minimizes the energy function. One can recover an effective theory of evolution by integrating out hidden neurons.
The upper part of Figure 8 contains the Andronov-Hopf oscillators model, [153,154] the canonical model describing the appearance of the bifurcations, which are among the essential phenomena observed in neuron dynamics, responsible for the periodic activity. The functions Q ij are accountable for the interaction between the i and j oscillators, while i , i , i , U i represent the effective gain, self-frequency, nonlinear dissipation and self-interactions respectively. Many lasers, [155] photonic, polaritonic, [136] and biological systems [156] exhibit the so-called Andronov-Hopf bifurcation at the threshold that can spawn the limit cycle behavior. AHO can be an attractive choice for the unifying framework for many models presented here, which demonstrate a variety of collective phenomena. [139] Another important property of the AHO model is its canonicity, which means that in the vicinity of bifurcation, one can get every equation in Figure 8 below AHO by a certain transformation, which is not true in the reverse case. One can also investigate the bifurcation phenomena and the time-dependent behavior of the coefficients near the bifurcation point since it is the crucial mechanism for the system to find a solution to the optimization task. AHO shares its canonicity with another model -weakly interacting NN. The network consists of N neural oscillators comprized of excitatory (x i (t)) and inhibitory (y i (t)), that evolve according to the presented dynamical equations. [156] In the local context, functions f, g are responsible for the internal behavior of the ith part of the system. At the same time, p, q represents the external interactions, the strength of which is parametrized by the parameter. The explicit transformations between the equations can be found in Refs. [157,158,159] We will omit the explicit description of the transformations that lead from the top equations to the bottom, while the detailed discussion and corresponding references can be found in Ref. [139]. Although the MEMs do not contain the optical elements, they are governed by similar optical second-order differential equations. [159] The transition from the AHO to the CIM, Hopfield of SBM equations can also be found in Ref. [139]. It is important to remember that introducing sophisticated time dynamics of the parameters can improve the minimization properties of each of the presented types of equations. For example, it is possible to introduce specific time schedules (e.g., the chaotic amplitude method that anneals the coupling terms depending on the discrepancy between the oscillator amplitude and its saturation point [160] ) or to introduce the high-order terms (e.g.,̇i k ≈ [161] ).
An additional note should highlight the Principle of Minimum Power Dissipation and its role in analogue optimization machines. It was shown that many physical systems act through this principle and perform Lagrange function optimization. [140] The Lagrange multipliers are given by the gain or loss coefficients or their time-varying parametrization; see, for example, the equations of the CIM. Depending on the characteristics of the machine, it can be helpful in many other applied domains.
The operation of optical machines consisting of N elements can be described in a unified fashion as an evolution of a set of N classical or quantum oscillators. The difference between classical and quantum comes from the system's initial state. It affects the speed and probability of finding the final state (usually a solution to a problem). If the occupation numbers of oscillators are large and somewhat uncertain and interactions are weak, then the system evolves as an ensemble of classical fields with corresponding classical-field action. [162] This analogy is valid for any bosonic oscillators, including optical: atoms, polaritons, excitons, photons, etc. For instance, the density matrix of a completely disordered, weakly interacting Bose gas with large and somewhat uncertain occupation numbers is almost diagonal in the coherent-state representation. The initial state can be viewed as a statistical ensemble of coherent states. To the leading order, each coherent state evolves along its classical trajectory. The evolution leads to an explosive increase of occupation numbers in the low-energy region of wave number space where the ordering process takes place. [162] Even if the occupation numbers are of order unity in the initial state, so that the classical matter field description is not yet applicable, the evolution, which can be described at this stage by the standard Boltzmann quantum kinetic equation, inevitably results in the appearance of large occupation numbers in the low-energy region of the particle distribution. Therefore, one can switch from the kinetic equation to the matter field description for the long-wavelength component of the field at a particular moment of the evolution when the occupation numbers become appropriately large. The optical system can be described using a classical matter field when this happens. However, the quantum dynamics before this moment plays a crucial role. This fully quantum dynamics with entanglement and superposition of states allows for complete scanning of the high dimensional space of the system until the coherent state is found. After that, the system behaves classically. while this coherent state settles to a fixed point that is a solution to a problem. During the passage to the coherent state, the quantum effects should enhance the search for the optimal state and potentially lead to the quantum speed-up.

Associative Memory Model
In this section, we present the associative memory model as one of the NN models, which exploits the links with spin Hamiltonians. This correspondence implies that many physical systems with nontrivial (non-zero) interaction potentials can be used as computational devices.
The standard model of associative memory [85] uses a system of N binary neurons, with values ±1. A configuration of all the neurons is denoted by a vector i , i = 1, .., N. The model stores K memories, denoted by μ i , μ = 1, .., K, which are also binary. The model is defined by an energy function (or, further Lyapunov function), which is given by and a dynamical update rule that decreases the energy at every update. The fundamental problem is that when presented with a new pattern, the network should respond with a stored memory that most closely resembles the input. Many physical systems we considered in Section 4 can follow the gradient of this Lyapunov function, which automatically converts them into the ANN. The theory of Hebbian learning addressed the associative memory [163,164] and describes how to prescribe the coupling coefficients between the neurons J ij (usually normalized by the number of patterns K). Usually, J ij is taken as the sum of the outer products of the stored patterns. One can find more ways to define coupling coefficients in the associative memory, for example, pseudoinverse rule, Storkey learning rule, or others. There has been a lot of work investigating this model's capacity, defined as the maximal number of memories that the network can store and reliably retrieve. It has been demonstrated that in the case of random memories, this maximal value is of the order of K max ≈ 0.14N. [85,165,166] If one attempts to store more patterns, several neighboring memories in the configuration space will merge, which produces a global minimum of the energy (13), thus preventing recovery of the stored memories. It is possible to improve the capacity close to K max = N by modifying the Hamiltonian (13) in a way that removes second-order correlations between the stored memories. [167] The simple associative memory model (13) has many benefits. First, it is quadratic in variables, which means that the energy gradient is linear to these variables. Therefore, one can easily calculate the corresponding updates of the neurons that lower the energy function (13). The following consequence of this mathematical structure is that one can reproduce the energy function (13) together with required dynamical behavior using various physical hardware systems. To build an associative memory machine, one needs to connect the elements representing the analogue variables via nontrivial interaction potential proportional to the strength J ij and project the final stable state into the discrete domain to obtain the binary states of neurons. Furthermore, the model's simplicity allows one to easily modify and incorporate other extensions. Finally, the model's universality means it is possible to solve different tasks via associative memory by mapping between tasks; for example, the classification task can be reduced to pattern recognition/restoration.
Another well-known name for the associative memory model is the Hopfield NN, a form of recurrent ANN with binary threshold nodes. Moreover, Hopfield NN shares many other similarities with the physical spin-glass model and several combinatorial optimization tasks. For example, the Hopfield model is isomorphic to the Ising model of magnetism (for zero temperature), [168] which has been extensively analyzed in physical contexts. In combinatorial optimization, finding the ground state of the Ising model is NP-hard and can be related to the QUBO. Moreover, computing the statistical sum of the spin-glass has the same NP-hard complexity class, which was a significant obstacle in calculating its various thermodynamic quantities. Other examples of tasks are the Boolean satisfiability problem or SAT [169] and weighted MAX-2-SAT.
To fully define the associative memory model, one has to specify the dynamical update rule of the neurons. For instance, the update rule can describe the discrete state of neurons in discrete time steps: with the same notation used in 13. The continuous version has the form: where x i denotes the mean state of the i-th neuron that can get continuous values in the initially defined range, h i is a direct input or bias coefficient in case the Lyapunov function (13) has nonzero field, g is a monotone function that bounds the continuous states and converts them into the discrete in the final state of convergence, for example, makes the correspondence between the variables as i = sign(x j ), and is the characteristic time of the convergence to an optimal or suboptimal solution. The analogue computation with the NN can be described as an evolution of the vector-state variables in the high-dimensional continuous space. One can precisely trace it using Equation (15). The vital aspect of such a differential equation structure is an existence of a Lyapunov function. This Lyapunov function E behind the Hopfield NN can lead to the understanding of possible final states, which appear to be attractors of the system's dynamical behavior. For both models, one can realize the dynamical state update using a particular hardware system described previously. However, one should differentiate between different regimes that can be realized on the hardware level: the task of finding the ground state (the global minimum) of the model and pattern restoration (descending on the surface of the Lyapunov function toward its nearest minimum).
The explicit formula for the Lyapunov function in the discrete variant of the model with the non-zero field is: In the case of continuous variables Equation (15), the same function has a slightly different forms: where the last term appears due to the correspondence between the discrete and continuous state i = g(x i ). For g(x), one usually picks the g(x) = tanh(x∕ ) function, where the time-dependent parameter tends to zero during the evolution of the Hopfield NN forcing the last term of the Equation (17) to disappear. [170] The essential property of the dynamical update rules is that the energy decreases through the system evolution, which leads to the final stable patterns in the phase space. The classical Hopfield NN has many modifications for the Lyapunov function, variables update rules and other features. One version is known as modern Hopfield NNs. [151] Modern Hopfield networks with continuous states can be integrated into DL architectures because they are continuous and differentiable with respect to their parameters. Moreover, they retrieve patterns with just one update, conforming to DL layers. For these reasons, modern Hopfield networks can serve as specialized layers in deep networks to equip them with memories. Possible applications of Hopfield layers in deep network architectures find their way in multiple instance learning, defence against adversarial attacks, [171] processing of and learning with point sets, sequence analysis and time series prediction, storing and retrieving reference data, for example, the training data, outliers, high error data points, prototypes and many other purposes. [151] Even more importantly, the functionality of the modern Hopfield networks can be compared with various methods from the ML domain, such as SVMs, random forest, boosting, decision trees, Bayesian methods and many others. [172,173] As we mentioned above, many optical systems can perform optimization tasks. Since there are intrinsic similarities between this task and the associative memory model, one can exploit this relation to realize Hopfield NN using optical systems. Such realizations include previously discussed laser networks, Ising machines, photon [174] and polariton systems, [131] and confocal cavity QED NN, [175] see Figure 9. The connection between the optical networks and the Hopfield model is important since it allows one to incorporate such layers into more complex optical architectures without complicated adjustments.

Higher-Order Systems
One significant extension of the Hopfield model consists in incorporating the tensor terms, which depend on the i variables polynomially in k. [177] The such extension allows one to increase the number of stored patterns to K max = k N k−1 , where k is a numerical constant. Moreover, it is possible to observe the so-called "feature to prototype transition" when increasing k in the NN training. The prototype theory provides an alternative approach to learning in which objects are recognized as a whole. Although tensor terms are assumed not to be biologically plausible, [152] they can be reproduced on some artificial physical setups. [161] From this perspective, artificial tensor platforms can significantly benefit from such technological opportunities. The higher order Hopfield NNs [178] can be written as Ω is the set of indices that excluded index l, where x l are real continuous variables, g(x) is the threshold function and is the Figure 9. a) Four nodes with the all-to-all coupling and sign-changing connectivity between spin ensembles. Blue and red show ferromagnetic versus antiferromagnetic J ij links. One can find the physical details in. [176] b) The realization of the Hopfield NN by the spin ensemble. Binary neurons s i of a single-layer network are recurrently fed back and subjected to a linear transform J with the consequent element-wise threshold operation. c) The Hopfield model exhibits an energy landscape with many metastable states. Energy-minimizing dynamics drive similar spin configurations to the stored local minimum, characterized by the basin of attraction. Too many memories make the basins of attraction vanish. d) Schematic of the associative memory problem-recalling multiple stored patterns by completing distorted input images. Reproduced with permission. [176] Copyright 2021, American Physical Society.
scaling parameter that can depend on time. Such systems can solve HOBO, see Equation (2), because of the k-local coupling. It was shown [161] that polariton systems above the threshold are described by Equation (20) describes the feedback mechanism that drives all i = |Ψ i | 2 to a priori set values th , characterizes how fast i adjusts to changes in i . Next, we proceed with the different ways of connecting the practical computational tasks with the actual physical behavior of the presented systems.

Mathematical Formulation of Applications
This section considers a range of generic applications that follow from the network's ability to solve optimization problems or/and act as Hopfield networks. We start with the simple problems from classical computer science with the corresponding mapping to the QUBO or mixed-integer problems. We then move to modern tasks that differ in information capacity and are considered to suffer from the so-called "curse of dimensionality", where it is more suitable to work with the probability distributions instead of the individual variables. However, in both cases, we do not pay attention to whether the presented mapping is efficient (like in the following subsection) or not (when one needs multiple sequential operations with a considerable amount of pre and postprocessing in between). Some of the inefficient embeddings can still possess mathematical challenges and can be improved either in the general formulation or with task-specific information. At the end of this chapter, we discuss the NN architectures and their capabilities.

Direct Encoding/Decoding
This subsection describes the connections/correspondences between different computational tasks. [169,179,180] The propositional satisfiability problem (SAT) lies at the heart of such correspondence. It is a fundamental problem determining whether a set of sentences in propositional logic is satisfactory. A clause is built as the disjunction, the logical OR (denoted by ∨) of some Boolean variables or their negations. A set of several clauses, which must be satisfied simultaneously, is the conjunction, logical AND (denoted by ∧) of the clauses. One can write a satisfiability problem in the general form: where the x i , y i are "literals", any of the original variables or their negations. The form (21) is called a conjunctive normal form (CNF), and one can easily see that any logical statement between Boolean variables can be written as a CNF. SAT is the first problem that was proven to be NPcomplete. [169,179] Currently, no known algorithm efficiently solves each SAT instance. The question of its existence is equivalent to the famous P versus NP problem. Nevertheless, many heuristics SAT algorithms can solve problem instances involving a significant number of variables, sufficient for many applications. Additionally, many versions of the SAT problems exist, like 3-SAT and the generalization k-SAT, HORN-SAT, and XOR-SAT, which can better suit particular unconventional tasks.
One specific SAT version -weighted MAX-2-SAT allows one to easily reformulate the task as QUBO. A simple 2-SAT has m clauses of 2 literals each. A MAX-2-SAT is the problem of assigning values that maximize the number of satisfied clauses. Weighted MAX-SAT gives each clause a positive weight so that the measure of violating the cost appears in the problem. To reformulate a weighted MAX-2-SAT problem as a QUBO, one has to use the fact that maximizing the weight of satisfied clauses is equivalent to minimizing the weight of unsatisfied clauses, and using the logic x i ∨ x j = x i ∧ x j . The final form looks then: which is the QUBO that has the same form as Equation (1). Thus, the connection between the SAT (that can be easily converted into weighted MAX-2-SAT by use of the Boolean logic) and QUBO is revealed. We know the Ising formulations for many NP problems. [86] For example, one can find number partitioning, graph partitioning, clique existence, binary integer linear programming, exact cover, set packing (or maximal independent set), vertex cover, satisfiability (with the emphasis on 3SAT to maximal independent set reduction), set cover, knapsack with integer weights, graph coloring, Hamiltonian cycles and paths, travelling salesman problem, Steiner trees, feedback vertex set, feedback edge set, graph isomorphisms among the covered problems, as well as some useful tricks for the near-term quantum adiabatic optimization devices. We mention some of them in a slightly different form below.

Logistics
Logistic and planning problems are usually related to the wellknown travelling salesman problem (TSP). TSP is a well-known optimization problem in which the goal is to find the shortest possible route that visits a given set of N cities and returns to the starting city. In order to solve this problem, one approach is to use a spin matrix to represent the route. Spin matrix has elements x v,i ∈ {0, 1} indexed by the vertex number v and its order in the path. The weighted edges w uv ≥ 0 from the edges set E describe the cost of travelling between two cities. The Ising Hamiltonian can be written as The first two terms in the Hamiltonian ensure that each city is in the route and appears only once. The third term in the Hamiltonian ensures that any adjacent cities in the route are connected, while the fourth term minimizes the sum of weights of all cities in the route. By choosing reasonable values for the constants A and B, it is possible to ensure that only valid routes are explored.

Financial Applications
Optimizing the portfolio selection means finding the most optimal combination of investments for an institution or individual. One of the modern portfolio optimization problem formulations has the following form Ref. [181] min where N is the number of different assets, and x i is the decision variable representing the proportion of capital invested in asset i. Here coupling coefficient J ij represents the covariance between returns of assets i and j, μ i is the mean return of asset i, and ∈ [0, 1] is the risk aversion parameter. When = 0, the model maximizes the portfolio's mean return, and the optimal solution will be formed only by the assets with the greatest mean return. When = 1, only the total risk associated with the portfolio is minimized.
There are different modifications to the portfolio optimization problem. For instance, one can introduce bounding and cardinality constraints that specify that there should be K different assets in the portfolio or/and the portion of some assets should be within certain bounds. This is achieved by The cardinality-constrained mean-variance model is a mixed quadratic and integer programming problem in the NP-hard class of problems. It can be minimized if we follow the Hopfield dynamics. [182,183] The discrete dynamics becomes where G i is a sigmoid with values in [ i , i ]. When solving any optimization problem, constraints usually appear in the energy function. However, in many cases of Hopfield networks, this is not necessary. Constraints on x i are satisfied using a sigmoid's activation function since its outputs already lie inside the desired interval. To fulfil the cardinality constraints, we begin with 3K∕2 neurons. After getting a minimum for the objective function, we remove the asset with the smallest output and repeat this process until the network has precisely K assets. These remaining assets solve the original portfolio selection problem. To satisfy the constraint ∑ x i = 1, one can use various adjustments, for instance, to evaluate the feasibility of every portfolio and change the proportions of capital x i to be invested in each selected asset. [182] A variant of the Hopfield networks for solving mixed-integer programming was constructed to solve the financial transaction settlement problem and implemented in an opto-electronic hardware. [138] The method is based on the discretized version of the MEMs equation shown in Figure 8 with annealing of the parameters (t) and (t) in where g(x) once again caps the value of x between −1 and 1 (e.g., g(x) = tanh(x)).

Mixed-Integer and Box-Constrained Programming
The financial applications discussed in the previous section are not the only ones that utilize mixed-integer programming or boxconstrained continuous optimization formulations. In fact, there is a strong correlation between discrete combinatorial problems and continuous non-convex programming, and it is often advantageous to represent discrete combinatorial problems in a continuous formulation that is conducive to many efficient methods. An early example is the Motzkin-Straus continuous formulation of the NP-hard MaxClique problem. [184] The objective of the Max-Clique problem is to identify the largest complete subgraph of a given graph such that all pairs of vertices within that subgraph are connected by an edge. If J represents the graph adjacency matrix, then the MaxClique Ising Hamiltonian to be minimized is as follows: where s i ∈ {0, 1} and takes the value of 1 if i−th vertex belongs to the maximum subgraph. The first term represents the objective to have the largest subgraph possible and the second penalizes the lack of the edge connecting the vertices in the subgraph. Parameters is a Lagrange multiplier to enforce the constraint. This problem, however, can be formulated as a continuous box-constrained quadratic optimization problem as [184] where x i is real. The optimal solution has nonzero entries with x i = 1∕d, where d is the size of the optimal subgraph. Moreover, the mixed-integer or box-constrained optimization is prevalent in numerous real-life applications. For instance, it can be used in scheduling and resource allocation problems, where decisions must be made regarding the allocation of limited resources to various tasks. It can also be applied in transportation and logistics, where it can help optimize routes and reduce costs. It can be utilized in manufacturing and production planning to optimize production schedules and minimize costs. All of the optical platforms we considered above operate with continuous amplitudes, so it seems natural that they can natively encode and process these problems in analog fashion. Recently, there have been several proposals of solving these continuous problems.
The mixed-integer programming can be addressed by a set of programmable bosonic quantum field modes since the eigenspectrum of the bosonic number operators consists of nonnegative integers so they can naturally represent integer variables. [185] Boxconstrained quadratic programming problems can be solved by using modifications to the dynamics of CIM based on OPOs [186] or opto-electronic iterative machine. [138]

Phase Retrieval
The minimization of the XY model, which is used to solve the QCO problem, is closely related to the challenging phase retrieval problem. The objective of the phase retrieval problem is to recover a signal or image from the magnitude of its Fourier transform. [92][93][94] This problem arises because signal detectors are typically only able to record the modulus of the diffraction pattern, resulting in a loss of information about the phase of the optical wave. The task is to determine a complex vector x from the measurement real vector b = |Ax|, where the matrix A is complex-valued so that [187] min Here u is a complex vector of components with amplitude 1 such that (where the the Moore-Penrose inverse of a matrix A is denoted as A † ) we rewrite Equation (30) as minimization of the XY Hamiltonian where s i = (cos( i ), sin( i )).

Machine Learning
The exponential growth of data has surpassed our capacity to process it using human and computational resources. This has led to the development of data-driven methods and a shift from classical computer science paradigms to modern ML approaches. In ML, the focus is on predicting outcomes from given data, with the richness of data significantly influencing performance. Three crucial components in ML are data, features, and algorithms. Data can be collected in various ways and can be of great value depending on the context. Features represent the properties of considered objects and are essential for the success of ML approaches. However, determining and selecting the right features can be time-consuming. The choice of algorithm depends on the context and influences accuracy, speed, and computational complexity. These components are presented in order of their significance in the ML pipeline. The components were presented according to their significance in the ML pipeline. Simply saying, one can only extract useful information from a noisy but meaningful dataset. The following subsection starts the discussion with the simple classical algorithms, which are the basis of many existing applications. Then, we outline the central ideas behind the main ML methods that will be the centre of attention for transferring into the special-purpose hardware. At the end of this chapter, we cover the wide range of capabilities of the NNs.

Regression
Regression analysis is a statistical method that has been used for many years to estimate the relationships between a dependent variable and one or more independent variables. This technique allows for the modeling of the relationship between these variables, and can be used to make predictions about the dependent variable based on the values of the independent variables. One of the most commonly used forms of regression analysis is linear regression, which assumes that the relationship between the dependent and independent variables is linear. In linear regression, a line is fit to the data in such a way as to minimize the sum of squared errors between the observed values of the dependent variable and the predicted values based on the line. This model assumes that the dependent variables denoted by y i have a linear relationship depending on the m-vector of points {x i1 , … , with an addition of the disturbance terms i in each case. This relationship can be written in the following form: www.advancedsciencenews.com www.advquantumtech.com To shorten notation we use the matrix form y = X + where: , with x i0 = 1. The linear regression task is the estimation of the values of the regression coefficients j given the data points x ij and observables y i , so that the error term = y − X is minimized. One can use different metrics for that purpose, such as the sum of squared errors of i or others. The most common parameter estimation technique is called the least-squares estimation. Here, the optimum parameter is defined through the minimization of the sum of the mean squared loss which can be connected with the conventional QP. The optimal solution can be obtained by differentiating Equation (33) and equating it to zero with respect to parameters j . In matrix notation, the solution can be written as There exist different modifications of the proposed procedure: generalized least squares, where one introduces a certain degree of correlation between the residuals i (33), or the weighted least squares, where the knowledge of the variance of observations is incorporated as the coefficients w k before each of the residual. Moreover, intrinsically different techniques can be based on maximum likelihood estimation, Bayesian methods, or regularization.
A natural extension of linear regression is in replacing linear dependence with a polynomial. In the case of one argument, it is possible to rewrite Equation (32) as Given the data points x j i , the task is the same as Equation (33) Multiple linear regression is a generalization of linear regression with more than one independent variable. The basic model for multiple linear regression can be written in a similar form: where instead of variables x ij one has a set of matrix elements X ij of size k × k. Depending on the chosen norm for the matrix, it is possible to formulate the task of finding the regression coefficients. Taking the square Frobenius norm of the matrix, the search for optimal coefficients i is equivalent to solving Equa-tion (33), except for the additional sum over the k 2 matrix elements: This can be extended further for multivariate linear regression or combined with the nonlinear basis with minor consequences concerning the parameters search and hardware operations, except for the much more complicated procedure for preprocessing the coefficients for any modification. Regression can be considered the simplest form of supervised learning.

Classification
Classification is one of the popular tasks for ML. The purpose of classification is to sort the objects among the initially defined classes. The earliest algorithms include naive Bayes and decision trees. Here, we only consider Markov random field (MRF) encoding, which is the general case for such models.
The k-nearest neighbors algorithm is a non-parametric classification method used in statistics. [188,189] It aims to classify the objects by considering their k nearest neighbors with the defined class. The consequent attaching objects to a particular group is repeated until the convergence. We omit the explicit corresponding formulas because of their similarity with the k-means, the unsupervised clusterization algorithm, presented below. Both methods are usually based on Euclidean distances and can easily be transferred to special-purpose optimization hardware.
Another classification method is called a support vector machine (SVM). SVM is a supervised learning model that analyses data for classification purposes. It aims to construct a hyperplane between the classes of training data points in a high-dimensional space, emphasising a good separation achieved by maximising its margin. SVM was introduced in Ref. [190] and standardized in Ref. [191].
Linear SVM deals with the n points x in the m-dimensional space, where each point has been assigned a binary class y i = ±1. The task is to construct a hyperplane that divides these two groups with the maximum distance between them. The so-called "hard margin" scenario assumes that the initial data is linearly separable. One can start by constructing two parallel hyperplanes, separating groups of different classes with the largest distance between these two surfaces. The target surface between these hyperplanes is called the maximum margin hyperplane. To mathematically describe these surfaces, one can write: where w j are the components of the normal vector for both of the hyperplanes, x i j are m-dimensional coordinates of the vector with the serial number i, b defines the surface shift concerning the zero coordinates and ±1 defines the class. Everything above y = 1 belongs to one class, and everything below y = −1 belongs to another. The offset of the hyperplane is determined by b∕‖w‖, while the marginal distance equals 2∕‖w‖. To maximize the marginal distance, one has to minimize the norm of ‖w‖ and hence its square ‖w‖ 2 . This task can be reformulated as the optimization problem, adding the constraints that prevent data points from being positioned into the margin min‖w‖ The natural extension of SVM is in considering a so-called "soft margin" case. It is assumed that the given data points are not linearly separable. In this case, one has to introduce a new kind of variable i = max(0, 1 − y i (w T x i − b)) for each point i, which is usually referred to as the hinge loss function, playing a regularizer role. Thus, it is possible to rewrite Equation (39) as where the constant C regulates the interplay between the pure hard margin classifier and the soft margin one. We can reformulate the problem using the Lagrangian duality: where the norm vector w is expressed through the new variables a i , so that w = ∑ n i=1 a i y i x i , and the initial task of determining the offset of the surface is expressed via b = w T x i − y i . Thus, it is possible to obtain the problem, which has an exact QP formulation. This problem can be solved with the standard quadratic algorithms, thus, can be solved using special-purpose hardware.
It is helpful to mention the nonlinear extension of the SVM, which solves nonlinear classification task and can exploit the different functional forms of kernels. One can modify the scalar dot product in the quadratic form Equation (41) by a different kernel function k(x i , x j ) depending on the properties of the analogue hardware.

Finding the Principal Eigenvector
Finding the principal (dominant) eigenvector of a given matrix J belongs to the ℙ-class of problems. However, finding such a dominant eigenvector on an ever-growing large matrix becomes a computationally intensive task incompatible with Moore's law. At the same time, a range of real-life problems would benefit from fast calculation of the principal eigenvector. The PageRank algorithm [192,193] is a well-known method for evaluating the relative importance of web pages based on the structure of the links between them. The web network is represented as a directed graph, where each page is a node and each hyperlink is an edge connecting one page to another. The PageRank algorithm computes a single score vector, known as the PageRank, for the entire database of web pages. The key assumption underlying this algorithm is that pages transfer importance to other pages via links, and thus the components of the PageRank vector determine the importance of each page. Mathematically, finding the PageRank vector is equivalent to calculating the principal eigenvector of the link-structure matrix, also known as the Google matrix. Calculating the principal eigenvector is also required in other fields such as social network analysis, bibliometrics, recommendation systems, DNA sequencing, bioinformatics, and distributed computing systems. [194][195][196] There are numerous applications of PageRank to chemistry and engineering sciences networks to investigate and analyse complex systems. As systems grow in size and complexity, the interactions between their networks and subnetworks can become increasingly complicated and difficult to track. In order to organize and study these complexities, network analysis methods such as PageRank can be used. These methods provide a way to analyze the structure of the network and identify important nodes or relationships within it. [195] For instance, MonitorRank diagnoses root causes of issues in a modern distributed system: error logs and tracing debugging information. [197] PageRank has been used for road and urban space networks, which help predict traffic flow and human movement. It was shown that PageRank is the best network measure in predicting traffic on individual roads. [198] Recent research has shown that optical systems can provide significant advantages for calculating the principal eigenvector. [196] By choosing appropriate control parameters for these optical systems, the steady state of optical networks can be used to solve an eigenvalue maximization problem. [199] This results in the identification of the energy state dictated by the signs of the eigenvector corresponding to the largest eigenvalue of the interaction matrix, that is, the principal eigenvector. In particular, estimates presented in Ref. [196] suggest that specialpurpose optical machines for PageRank calculations may offer dramatic improvements in power consumption compared to classical computing architectures.

Dimensionality Reduction
Dimensionality reduction involves the transformation of data from the space with many dimensions into a low-dimensional space, usually preserving meaningful and valuable properties from the original data. It isn't easy to handle high-dimensional data in practice due to the growth of the space volume. Dimensionality reduction is standard in data-intensive fields. It can be used in signal processing, neuroinformatics, and bioinformatics. [200,201] One can find its applications in recommender systems, [202] semantic search [203] or as a primary tool in many domains involving numerical analysis.
One of the well-known methods for dimensionality reduction is the principal component analysis (PCA). [204] The idea behind PCA is to approximate particular data with linear manifolds of lower dimensions. PCA can be alternatively interpreted as finding subspaces of lower dimension in the orthogonal projection on which the data variation is maximum.
The initial task behind the PCA is to find the best approximation of the data points using lines and surfaces. Given the set of www.advancedsciencenews.com www.advquantumtech.com vectors x 1 , x 2 , … , x m ∈ ℝ n , the aim is at finding the sequence of k k-dimensional affine spaces L k ⊂ ℝ n that find for each k, where d(x i , L k ) is the Euclidean distance from the point x i to the L k . Affine spaces L k are defined as the sets of linear combinations L k = {a 0 + 1 a 1 + ⋯ + k a k } with coefficients i ∈ ℝ, while the vectors {a 1 , a 2 , … , a k } ⊂ ℝ n form orthonormal basis in ℝ n . Equation (42) is an optimization problem. The initial vector a 0 is simply defined as the solution to The next component is found iteratively by subtracting the projection x i = x i − a 0 (a T 0 x i ) (with the scalar product a T 0 x i ) for the vectors corresponding to L j : The iterations continue until the number of the affine space k reaches the n − 1 of the initial problem space dimension. Using the identity ||x i − a j (a T j x i )|| 2 = ||x i || 2 − (a T j x i ) 2 , one can easily map this task into the QP in a i variables with the normalization constraints and the coupling matrix J ij = −x i x j . To shorten the presented notation, the iterative procedure can be written similarly to maximization taskŝ where k is the number of principal component, X is the data matrix of size n × m, w s = (w 1 , … , w m ) (s) are the weight coefficients. If the sequential operation is limited on the specific hardware system, one can still use the first iteration of the PCA method to obtain the largest eigenvalues of a matrix. One can find many alternative formulations of the PCA task, such as cancelling correlations between coordinates, that is, covariance matrix diagonalization or singular value decomposition. Singular value decomposition (SVD) is a special form of a rectangular matrix decomposition in the form where U is the unitary matrix (representing the rotation as the linear transformation of the space in the geometrical interpretation), Σ is the rectangular diagonal matrix with non-negative real numbers on the diagonal (which are called the singular values, the action of the matrix has the interpretation of the corresponding scaling by diagonal elements) and V ⊤ is another unitary matrix (with the same additional rotation interpretation). SVD is essentially vital in the standard techniques of the latent semantic analysis (LSA), [205,206] which purpose is to process documents and detect the relationship between libraries and terms.
There is a direct correspondence between PCA and SVD decomposition. To perform the PCA, one has to find the eigenvectors of the covariance matrix XX ⊤ (without the appropriate scaling factor 1 n−1 ). The covariance matrix is diagonalizable, and with the normalized eigenvectors, one can write Applying SVD to the same data matrix X gives which gives Using this correspondence, one can perform the SVD decomposition as PCA on the special-purpose hardware.

Clusterization
The most detailed description of clusterization is the separation of the objects on a specific basis. The goal can be defined as a classification without any prior information about the classes. The machine can set the number of clusters in advance or define them automatically. The algorithm determines objects' similarity by their marked features and puts the objects with many similar features in the same class. There are successful applications of clusterization in market analysis (consumer analytics), image compressing, data analytics, and anomaly detection. K-means clustering is a clustering method that aims to partition n observations into k clusters. Each of these observations is located in the cluster with the nearest mean, also called a centroid. [207][208][209] There are heuristic algorithms that deal with such an assignment; however, the problem is NP-hard.
Given a set of observations {x 1 , … , x n } in a d-dimensional space k-means algorithm aims to partition these observations into k sets {S 1 , S 2 , … , S k } to minimize the within-cluster sum of squares (or variance): where S i is the mean of points in the set S i . One usually uses an iterative technique consisting of two steps to perform such an optimization task. Given an initial set of k means m 1 1 , … , m 1 k , the first step is to assign each observation to the cluster with the nearest mean, according to the Euclidean distance. The next step is to recalculate the centroids: m t+1 Finally, the loop is run until the convergence. The algorithm uses the assigning of objects to the nearest cluster by Euclidean distance, and it is a Adv. Quantum Technol. 2023, 6, 2300055 www.advancedsciencenews.com www.advquantumtech.com suitable method for transferring its sequential operations to the specific hardware.
Mean shift is a high-dimensional-space analysis method for locating the maximum density function given a discrete number of data sampled from this arbitrary density function. It is helpful in complex hierarchical algorithms and is used in different computer vision or image processing domains.
Given data points x i in n-dimensional space, one can use the kernel function k(r), acting on the norm value r, to determine the mean shift's value. The kernel function has to be non-negative, non-increasing and continuous. One can use the flat kernel so that k(r) = 1 if r < r 0 and 0 outside. Each iteration consists of calculating the function where there are states. The maximum of F(x) is computed using the square norm.

Neural Networks
ANNs are often associated with ML. We considered the associative memory model, a simple recurrent shallow NN in the Subsection 4.6. This model can be extended to higher-order systems, simultaneously gaining many useful properties. However, optical systems are not tied only to this type of architecture. [10] Any NN can be defined as a set of neurons and connections between them. An artificial neuron's task is to take input numbers, process them in a certain way (executing a special function), and output the results. The standard mathematical transformation of one NN layer can be written as ( ∑ N i=0 w i x i + b), where w i denote the weights for the input data points x i (or independent variables), and the constant b is the shift called bias. Here, the is a nonlinear activation function. A single-layer NN that performs a similar transformation and produces a single output number is called a perceptron. The perceptrons, assembled into multilayered structures, are called multilayer perceptrons. The introduction to the NNs is presented in Refs. [210,211,212] with more modern work [213] and the latest results after the DL breakthrough. [214] The activation function plays an essential role in the NN design because the output signal would be simply a linear function in its absence. There are many functional activation functions, such as binary step function, sigmoid (or logistic function), hyperbolic tangent, etc. They allow a NN to map an input to the output appropriately. Thus, NN is considered a universal function approximator. [215] To choose the NN weights, one usually uses the backpropagation procedure, [216,217] although there are many alternatives. Backpropagation consists of tuning the NN weights according to the difference between the actual output value of the network and the predicted one, with the final goal of minimizing this discrepancy or the cost function. The tuning procedure involves computing the total discrepancy gradients on each layer, starting from the final one and updating the corresponding weight values. Through the extensive number of such iterations, there is a chance that the weights will be tuned in the desired way.
Many deep NN (NN with many layers) can be mapped into a shallow one with a significant overhead on the number of neurons in the standard layer. That means that any deep NN functionality can, in principle, be performed on a physical device suitable to a shallow architecture. With an appropriate mapping, both networks will have the same approximation qualities. [218][219][220][221] A well-trained NN can approximate many complicated algorithms, some of which are presented in this review. However, one has to provide enough input conditions and good output answers, especially when the problem is of high computational complexity. In addition, however, the resulting correlations need to be better understood. The valuable properties of the NNs go far beyond the optimization domain. We will consider some of them below.

Neural Networks and Dynamical Systems
Using ML models in the domain of physical sciences, that is, incorporating physical laws and domain knowledge into neural architectures, is called physics-informed machine learning (PIML). It provides a powerful approach to modeling different physical phenomena. This rapidly growing field can pursue many other goals. Among them are constructing better predictive models with high accuracy and reliable generalization abilities, increasing data processing rate, accelerating the dynamical processes through optimized architecture, and solving inverse problems with interpretable models. One should expect that emulating complex nonlinear dynamics should benefit from the PIML. This can be seen in the applications in weather forecasting, [222] modeling of turbulence, [223,224] nonlinear dynamics, [225,226] applications of the ML to the Koopman operator theory. [227] Optical hardware can be potentially used to speed up these applications.
The correspondence between NN architectures and dynamical systems is straightforward. Some of the NN can be viewed as discretizations of dynamical systems, which is true in reverse order-one can design NNs to have specific properties, such as invertibility. [228] This correspondence can broaden the applicability of the potential optical hardware. Their connection with dynamical systems and DL can be found in Ref. [229]. The generalization of the optimization algorithms inspired by different optical systems has canonical universality property. [230]

Probabilistic Graphical Models
Graphical models provide a natural tool for dealing with uncertainty and complexity in a wide range of tasks and establish a powerful framework for representing and analyzing complex data structures. The graph theoretic approach used in these models offers an interface for modeling data and designing efficient general-purpose algorithms. Many models used in disparate fields such as statistics, systems engineering, information theory, and pattern recognition can be considered special cases of the general graphical model formalism. Graphical models are particularly useful for representing joint probability distributions and performing inference based on observed data. [231][232][233] Probabilistic graphical models (PGMs) are graphs with the nodes represented by random variables, while edges connecting them represent conditional independence assumptions.
Probabilistic graphical models (PGMs) provide a compact representation of joint probability distributions. There are two main types of graphical models: undirected models, also known as MRFs, which are widely used in the physics and vision communities, and directed models, also known as Bayesian networks (BNs), belief networks, or causal models, which are more popular with the AI and ML communities. [231] The spin Hamiltonians are particularly useful for PGMs. We recall the Ising spin model of Equation (1). Each spin variable s i can be treated as a random binary variable so that their coupling strengths serve as the connections between random variables. Certain configurations of spin variables X = (x 1 , … , x N ) ∈ {−1, +1} N is called an assignment. The probability of an assignment in the PGM is given by where is the so-called partition function.
There are several quantities of interest in the PGMs. First, it is the inference task -the computation of the quantity Z given by Equation (54). The exact inference is the computation of Z with all possible assignments, which is a hard problem for an arbitrary graph. The running time of the exact algorithms of finding Z is exponential in the size of the largest cluster of corresponding graph nodes. There are rare cases of the Ising model graphs when it is possible to compute its partition function in polynomial time, but the problem of computing Z is generally hard. Hence, the approximate inference is widely used. Other quantities of interest can include finding the low-energy states (low energy sampling), worst margin violators, constituents of partition functions-assignment likelihood and marginal probabilities and certain moments concerning the partition function and the target value.
Some popular approximate inference methods include sampling (Monte Carlo), variational methods and message-passing algorithms. [231] Since many optical spin machines are not flexible in terms of programmability compared to conventional computers, one can hardly exploit sophisticated methods in hardware operations. That is why the sampling procedure is the most promising application from the hardware perspective, especially for inference tasks. Expanding the spin machines' functionality is a promising direction, given the speed and energy efficiency of the optical efficiency domain.
The physical system often realizes the symmetric coupling coefficients, making the model undirected. Using the systemspecific devices that redirect light, it is possible to introduce the asymmetry in the variable connections, which opens the path to the directional PGM. In addition to the universality concept, one can see many practical tasks encoded into the Ising model (such as portfolio optimization) as special cases of the PGMs. Moreover, the hardware's ability to realize the high-order interaction terms allows one to encode complicated conditional dependencies with little or no overhead on the number of variables.
However, the application scope of optical machines aimed at simulating PGMs is far beyond the scope of this problem. One can encode complicated large graphs with many factors, representing large-scale practical problems and efficiently use them as supporting decision-making networks. There are also applications in control theory and game theory. For example, PGM can compactly model joint probability distributions using sparse graphs to reflect conditional independence relationships in complex systems. It is possible to decompose similarly multi-attribute cost functions (or utility functions). For instance, let the general cost function be a sum of local cost functions. Each local term has parental nodes (random variables or factors), which it depends upon. Moreover, some of the utility nodes will also have action (control) nodes similar to parent nodes because they depend on the state of the environment and the performed actions. The resulting graph is called an influence diagram. Using such a diagram, one can perform sampling, similar to the inference task, and compute the optimal (sequence of) action(s) to maximize or minimize the cost function. [231,234] The application of the same strategy was used in multi-person game theory. [235] In such a way, one can exploit optical spin machines to investigate dynamical systems and decision policy on factor graphs. There are many more applications of such correspondence between spin system functionality, control theory, and decision-making. The advantages of optical systems will benefit large complex graphs with complex connections between units. [231] Exploring the functionality of optical machines with respect to different paradigms is a promising research direction.

Image Processing
Several problems in computer vision can be formulated as binary quadratic programs, a particular case of QUBO. One can also see the similarity with PGMs. The conventional approach to such problems is to use the semidefinite relaxation technique, which appeared to be quite efficient. [236] The problems discussed include image co-segmentation, image segmentation with different constraints, graph matching, image deconvolution, graph bisection, and others. The computational complexity of these problems is high, which makes it necessary to propose an improved version of the semidefinite programming approach, which is more efficient and scalable. Some of these formulations are listed with little corresponding details, and we refer the reader to the original work. [237] min x∈{−1,+1} N x ⊤ Ax is the image co-segmentation task with the matrix A, [237] s is the number of images, n i is the number of pixels for i-th image, and n = ∑ s i=1 n i .t i ∈ {0, 1} n is the indicator vector for the i-th image, ∈ (0, 1] is a parameter.

Regression min
Continuous variables, partial mapping, straightforward operation, inefficient concerning the variables mapping.
Continuous variables, partial mapping, straightforward operation, inefficient concerning the variabes mapping.

k-means min
Continuous variables, partial mapping, consequent operation, inefficient variabes mapping and operation setup.
Graph bisection min Discrete variables, partial mapping due to the constraints, straightforward operation, inefficient representation of the constraints.
Image co-segmentation Discrete variables, partial mapping due to the constraints, straightforward operation, overhead on auxiliary variables, inefficient concerning the constraints and overhead.
is the image deconvolution task, where K is the convolution matrix corresponding to the blurring kernel k, S denotes the smoothness cost, x and q represent the input image and the blurred image respectively. [236] min x∈{−1,1} n − x ⊤ Wx, is the graph bisection task with W ij = exp(−d 2 ij ∕ 2 ), if (i, j) ∈ ; and 0 otherwise, where d ij denotes the Euclidean distance between i and j. These tasks can potentially be mapped into the special-purpose hardware dealing with quadratic assignments or low-level programmable tasks.

Several Examples of Hardware Embeddings
Here we consider the hardware representation of several tasks we considered previously. We characterize each assignment stat-ing its possible embedding on the particular hardware type -spin machines and characterize several parameters of such embedding. We consider discrete and continuous variables (the latter can require additional overhead on the number of discrete operational units), direct mapping of the problem coefficients or partial concerning other factors, whether the hardware requires the consequent manner of operations or not, incorporating additional constraints into the coefficients of the problem and other details. Overall, these factors determine whether the possible embeddings are efficient or not (significant overhead, consequent operations, etc.). We present the examples in Table 1.
A significant part of the current review is devoted to the description of the reductionism approach to the optimization problem, where one has to transform the target assignment into the known formulation (for example Ising model) or combinations of conditions (see graphical models) and then to map it into the particular hardware to get a proper solution. Although many presented approaches are similar, we are not restricted to covering only them. To support this statement, we present a few highcomplexity realizations of specific computational problems.
Among them, we can highlight the recent demonstration of natural language processing on a photonic processor, [239] see Figure 11. The significant advance is achieving capacity exceeding 1.5 × 10 10 optical nodes, which enables large-scale applications. In another work, [240] authors realized an optical NN to simulate inference at an optical energy consumption of 2.7 aJ per MAC for computer vision model Resnet50 (Residual Network) and 1.6 aJ per MAC for natural language processing model BERT (Bidirectional Encoder Representations from Transformers) with little accuracy degradation. The third example [241] demonstrated a realization of a complicated generative network on the basis of a photonic computing core consisting of an array of programmable phase-change metasurface mode converters. The corresponding scheme can be found in Figure 10. Overall, one can see more and more sophisticated architectures being reproduced using optical platforms.

Spin Glass Simulators
The inherent relationship between spin glass models and optimization tasks has been a longstanding subject of research. Considerable attention has been devoted to examining the complexity of both paradigms. In this regard, simulating a realistic glass model can be approached as a computational problem. Hence, it is necessary to mention several works on these simulations using optical platforms. A concise overview of the measurements of various parameters of artificial disordered systems through optical setups, focusing on overlap distributions and replica symmetrybreaking realizations, can be found in Ref. [242]. Additionally, the chapter includes the construction of a statistical model of light modes dynamics in a random laser with a new equivalent for the overlap distribution. One of the first experimental measurements of overlap distributions in a random laser system was reported in Ref. [243]. The subsequent work by Basak et al. in 2016 [244] describes the observation of strong intensity fluctuations in standard ordered cavities, with subsequent analyses highlighting the replica symmetry breaking phenomena. Another example is the work by Moura et al. in Ref. [245], where the intensity fluctuation overlap distribution is measured in the spontaneous modelocking regime of a multimode Q-switched Nd:YAG laser. Research on spin glass simulators goes hand in hand with the optimization domain, motivating the search for unconventional hardware that is the primary focus of this review. This results in the realization of optical computing platforms that can address spin-glass problems on a large scale, such as the platform based on spatial light modulation and multiple light scattering. [246]

Main Directions of Technological Development in Optical Computing
The demand for computational resources is gaining momentum due to their use in many practical applications. This trend is supported by a growing industrial interest from prominent IT companies (Microsoft, Google, IBM, Amazon, etc.) and fast-growing start-ups. To get a better global picture, one must understand the current paradigm of conventional heavy calculations and what advantages the optical machines can offer. First, we present several key metrics of the standard approaches and then show the Figure 11. 3D PELM for language processing. a) The text database entry is a paragraph of variable length. Text pre-processing: a sparse representation of the input paragraph is mapped into a Hadamard matrix with phase values in 0, . b) The mask is encoded into the optical wavefront by a phase-only SLM. Free-space propagation of the optical field maps the input data into a 3D intensity distribution (speckle-like volume). c) Sampling the propagating laser beam in multiple far-field planes enables upscaling the feature space. d) The example shows a binary text classification problem for large-scale rating. Reproduced with permission. [239] Copyright 2022, Optica Publishing Group.
benefits of optical devices. After that, we describe the strategies pursued in photonic neuromorphic computing.

Performance of Information Processing Systems
In general, Moore's law concerns several metrics. All of them are reaching saturation but at a different paces. To maintain the same effectiveness of the hardware, new technology is required. However, the more significant demand for superior hardware is caused by the explosive growth of AI applications, which puts much more pressure on research and development performance. For example, the need for computational capabilities has increased by more than five orders of magnitude from 2012 to 2018 because of the AI developments shown in the OpenAI report. [8] Several key metrics characterize the performance of information processing systems. We will use MAC (multiply-accumulate operation containing one multiplication and one addition) and FLOP (floating-point operation). The relationship between them is that 1 MAC counts as 2 FLOP. One usually uses MACs and FLOPs to measure the speed performance of the device, which depends on the frequency or the characteristic intrinsic operation time on the hardware. Alternatively, one can use the operations per second (OPS), be it conventional mathematical operation or hardware state switching, but this notation is rarely used. Another important metric is energy consumption or efficiency, which can be measured in FLOPs W -1 (FLOPs per watt). One can consider alternative metrics, such as the total training energy in joules in the case of the training NN or J per spike in the operations performed on the spiking NN (SNN) architecture. Many combined metrics and their variations exist, such as speed per area (OPS per mm), that are used to describe some other energy characteristics. Other important parameters of the hardware setup may include the analogue level of noise, scalability properties, specific architecture parameters, etc.
Data centres that use thousands of CPUs and hundreds of graphics processing units (GPUs) consume megawatts of power. Despite the versatility of conventional computers, their characteristics are not enough to achieve high performance in the key metrics. Thus, application-specific hardware that differs in architecture and logic reduces this gap between the desired efficiency and computer capabilities. One can find several discussions of these devices in Refs. [196,247] with the corresponding comparison of the key metrics; see also Figure 12. In addition, we mention some of these electronic devices as reference points for comparison with optical devices.
Classical computing architectures can differ in the details within one type of device. However, it is common to characterize them using two key metrics-the processing power or the computing speed using FLOPs and the energy efficiency. One can use the ratio of the FLOPs to the power consumption in watts (W) to get the energy efficiency metrics. [196] The standard estimate of the modern CPU efficiency is 2 TFLOPs, while the power efficiency is about 10 GFLOPs W -1 . Therefore, we can use the Intel Xeon processor as one of the top devices in terms of efficiency for working with the doubleprecision format, which has 4.8 TFLOPs and 29 GFLOPs W -1 . [248] GPUs are the advanced specialized electronic architectures and workhorses of the current ML tasks in real applications because of the parallel computing options. Most of the GPUs operate at near 0.3 kW power consumption with the range of 0.5-7 TFLOPs and corresponding 1.6-23 GFLOPs W -1 energy efficiency for the work with the double-precision format.
Another type of classical hardware is powerful non-distributed computer systems that are not so energy efficient but have enormous computing power. The top 10 list starts with NVIDIA DGX Figure 12. Left panel: computing power and energy efficiency of different types of computing hardware. The schematic distribution of the processing power versus energy efficiency is shown for several CPUs, GPUs, FPGAs, supercomputers, and potential unconventional computing devices based on optical systems. Reproduced with permission. [196] Right panel: energy efficiency values versus computing speed per area for spike-event hardware compared with results described in the literature. [247] Reproduced with permission from corresponding preprints.
SuperPOD with 2356 TFLOPs and nearly 26.2 GFLOPs W -1 . The most powerful supercomputer in processing power is Fujitsu's Fugaku, with 442000 TFLOPs and 14.8 GFLOP J -1 . One can link several devices into the powerful distributed system to achieve much higher processing power with additional energy costs.
Another class of electronic devices can be named "dedicated hardware". Although the GPU is not usually attributed to this class, it performs a similar role. A good example is the FPGA, an integrated circuit that can be configured by a customer using a hardware description language. On average, FPGA can achieve 10 TFLOPs with near 50 GFLOPs W -1 energy consumption rate and several times more (x 5, x 7) by working with lower precision numbers. [249] Another example of dedicated hardware is Google's Tensor processing unit (TPU). TPUs are customdeveloped application-specific integrated circuits (ASICs) to accelerate ML workloads. The efficiency can be estimated as 90 TFLOPs, and 400 GFLOPs W -1 . [250] Further improvements in electronic special-purpose devices are expected to come from analogue architectures based on memristors, [251] non-volatile memories, compact low-voltage field-effect transistors and engineering of heterostructures of 2D materials taking into account the quantum effects. Another option is to explore the different architecture of the dedicated hardware. For example, IBM claims to achieve 176 000 times better energy efficiency with their bio-inspired neuromorphic chip TrueNorth chip than the conventional general-purpose Intel i7 system for specific applications. [252] Nevertheless, TrueNorth has a relatively slow frequency rate of 1 kHz and an approximate energy efficiency of 2.3 pJ per bit. Moreover, it requires additional connections for the incoming neural spikes. One can further explore Intel's Loihi [253] or NeuroGrid [254] devices, which are close to the modern GPU. [255] Despite impressive and innovative developments, more than the presented classical architectures are needed to satisfy the need. For example, some estimates on demand from future autonomous vehicles require the information processing at 100 TOPS rate with the energy consumption of less than 100 Watt with the additional low latency. [256]

Optical Energy Consumption
Optical devices can process information instantaneously. Additional advantages include negligible energy consumption and heat generation. State-of-the-art for CPUs and GPUs metrics can be converted into 20 pJ per MAC. [257] The dedicated hardware and application-specific circuits can achieve 1 pJ per MAC with reduced precision of the calculations. [258] The same so-called "ideal" benchmark is supported by Ref., [51] where authors used a programmable nanophotonic processor with a cascaded array of 56 programmable MZIs in a silicon photonic integrated circuit to perform the vowel recognition task. Modern AI chips can reach the 100 mW per GOPS operation power per second, but the future competitive requirements should be ≈ 10 −1 mW per GOPS. [8] We can consider several examples of photonic hardware and highlight their technical characteristics, such as speed and energy consumption. The photonic accelerator architecture based on coherent detection [60] enables a new class of ultra-low-energy processors operating at very low (sub-aJ) energies for MAC operation. These structures can be reprogrammed and trained on the fly and have good scalability of up to one million elements. Additionally, [60] discusses the "standard quantum limit" for optical NNs that can be bounded with 50 zJ per MAC values for irreversible digital computation.
Optical NNs can achieve accurate results with extremely low optical energies. [259] It was shown experimentally that optical NN with dot product calculated optically achieved high accuracy on the MNIST digits classification using few photons (of the order 10 −19 J of optical energy) per weight multiplication. The essential idea was to reduce the noise from accumulating scalar multiplications in dot-product sums.
Some optical machines can use pre-optimized mathematical structures for architectural benefits. A good example is energyefficient, high-throughput, and compact tensorised optical NN exploiting the tensor-train decomposition. [260] Such a NN can improve the energy efficiency by a factor of 1.4 × 104 compared with digital electronics ANN hardware and by a factor of 2.9 × 102 www.advancedsciencenews.com www.advquantumtech.com compared with silicon photonic technologies. Moreover, it was possible to achieve better energy efficiency with fewer elements for footprint-energy efficiency calculation. [260] In general, neuromorphic photonic systems potentially offer petaMAC per second per mm 2 processing speeds [63] and attojoule per MAC energy efficiencies. [64] Energy consumption is closely related to the physical properties of the neural architecture. For example, event-driven spiking neural networks (SNNs) outperform ANNs in energy efficiency. The dynamic of many models can be described using the universal Izhikevich model. [255] Event-driving neuromorphic computing overcomes traditional von-Neumann architectures' limitations but has several specific problems with the throughput, scalability, training methods, etc.
The successful implementation of an optoelectronic spiking neuron inspired by the Izhikevich model was reported in Ref. [247]. A nanoscale optoelectronic neuron with 200 aJ per spike input can trigger the output from on-chip nanolasers with 10 fJ per spike. This neuron can support a fanout of ≈ 80 or overcome 19 dB excess optical loss while running at 10 GSpikes s -1 in the NN. Such a scheme corresponds to 100 throughput and 1000 times energy-efficiency improvement compared to state-of-art electrical neuromorphic hardware such as Loihi and NeuroGrid. [247] The hybrid systems of quasiparticles can be another potential platform for spiking architectures. Exciton-polaritons can achieve 1 pJ per spike with 100 ps timescale. [261,262]

Evaluation of Speed
A universal optical computer was not a viable option to compete with classical computers. Instead, a specified optical computer or optical block as a part of a hybrid classical/nonclassical architecture has become a focus of recent research. One of the first realizations of simple mathematical operations, such as a free-space fan-in/out vector-matrix multiplication, was introduced by Goodman in 1978. [263] It is the essential linear algebra operation, where the input vector is loaded into an array of light sources, and the multiplication matrix is encoded into the SLM. The light propagation is analogous to broadcasting the initial vector into SLM, which performs element-wise multiplication, after which the lens gathers all the beams in the horizontal direction and summates the intensities. One can evaluate the performance of this device as N 2 MAC for one multiplication of the vector with N elements and a square matrix N 2 . However, the effective performance is limited by the system's frequency f , mainly of the SLM, resulting in fN 2 MACs; see also. [264] Nevertheless, using 256-length input vector and 125 MHz frequency rate, the device's performance can reach impressive ≈ 8 TMACs.
Other schemes based on different forms of the free-space matrix-vector multiplication can reach similar values. In 2020, Lightmatter presented an optoelectrical hybrid chip "Mars" with 0.4 − 4 TMACs depending on the frequency of weights. [265] A massively parallel convolution of 16 x 16 "tensor core" scheme based on crossbar architecture has been built on a chip with 13 GHz modulation speed for the inputs, and approximate 2 TMACs. [266] Another scheme based on the electro-optical Mach-Zehnder modulators represents a universal optical vector convolutional accelerator and achieves more than ten TOPS speed, with a consequent successful use as an optical convolutional NN in facial and handwritten digit images recognition. [267] Most of the photonic hardware with the feed-forward architecture can operate at high (GHz) speeds and usually have good scalability characteristics. [60,259,260] Another critical factor affecting the optical device's speed performance is the hardware's architecture. From this perspective, RC might improve many aspects of optical computing devices. One could expect several orders of magnitude speed-up compared to the typical ANN structure. RC optoelectronic/optical implementations are usually divided into spatially distributed and time-delayed. [82] The RC scheme on a silicon photonic chip with optical waveguides, splitters and optical combiners can achieve the data processing rate of 0.12 and up to 12.5 Gbit s -1 . [268] Moreover, more exotic physical systems, such as exciton-polaritons, can reach similar performance so that the SNN architectures can achieve the characteristic operation time of the order of 100 ps with the energy efficiency of 1 pJ per spike [261,262] Optic-based spin machines also enjoy competitive speed characteristics. CIM evolved from having just 4 spins and 12 connections in 2014 (Stanford) to 16K spins and 256M connections in 2021. The 2000-node version achieves semidefinite relaxation minimum of a cost function in 0.1 ms and further improves the solution. [269] The new generation of CIMs based on Thin-Film LiNbO3 (TFLM) photonic circuits will be released in 2022. It will feature an OPO network with ≈ μW pump power, ≈ fs pulse duration, 100 GHz-1 THz clock frequency and the synchronized operation of multiple CIMs on a chip.
Exciton-polaritons possess even better ultrafast timescales. For example, the polariton graph simulator [131] is easily scalable to 10K elements and shows ≈ 100 ps operational times respectively, while the degenerate lasers [95] system have ≈ μs characteristic timescale. However, all-to-all controllable couplings have yet to be experimentally implemented.

Other Important Properties
Other essential factors are undoubtedly affecting optical devices' attractiveness and performance, such as intrinsic noise and analogue accuracy of the hardware. For example, the recognition results using MNIST handwritten digits can show different accuracy on different devices, which can be a good measure of how well a particular NN is adjusted to a specific task. [247] The comprehensive analysis of the error sources and their classification for the electro-optical device can be found in Ref. [8].
The essential part of the hardware is its structure/architecture. It affects many other properties of the optical devices, be it the accuracy, scalability, the potential for future optimization, etc. The interplay between the hardware's electronic and photonic components depends on the architecture. It directly affects optical/electronic conversion, storing and reading the data, and logic operations cost in the case of a hybrid architecture.
Scalability is one of the key metrics and is the consequence of the architecture choice. It measures the ability of a system to keep its algorithmic performance with a growing number of variables.
The optical setups enjoy additional degrees of freedom compared to the conventional electronic hardware. For example, two independent variables in the complex plane can param-www.advancedsciencenews.com www.advquantumtech.com eterize short optical impulses. In addition, one can explore optics-specific degrees of freedom such as polarization and orbit angular moments of light.
Lastly, current optical hardware is used to employ classical algorithms and NN architectures that are conventional for standard electronic architecture. These algorithms are designed using Boolean logic, which is suitable for a digital computing system. However, they are not always optimal for optics implementation. Therefore, developing specialized algorithms optimized for optical computer platforms is necessary, further reducing the operational complexity and execution time.

Noise in Analog Optical Computing
In many analogue devices, noise plays a crucial role in their operation. The investigation of noise, its sources, and its properties in optical systems is a fundamental subject that has been addressed extensively. The classical theory of laser noise [270] encompasses fundamental concepts of fluctuations caused by atomic transitions between lower and upper levels and the independence of classically-prescribed optical fields, considering both moderate and high-power laser cases. These ideas were expanded to include sources of quantum noise (such as momentum fluctuations of electrons at optical frequencies and uncertainty-related fluctuations of the electromagnetic field), shot noise, the transition to classical noise in high-power lasers, the distinction between lasing and non-lasing modes, and intensity fluctuations at different frequencies and their corresponding distributions. [271] The transition between classical and quantum noise has been extensively studied, [272] with discussions on the origin of quantum noise emerging from the reversible or irreversible part of dynamics and comparisons with purely classical fluctuations and corresponding physical examples. [273] While noise in analog photonics is generally considered a harmful effect, it is possible to mitigate its impact or make the system robust toward specific types of perturbation using system-specific techniques. Analog DL platforms experience both deterministic and non-deterministic noise sources, with the amount of noise increasing with operational speed. To address this issue and efficiently deploy MAC operations, the authors of [274] introduced and experimentally demonstrated a noise-resilient DL scheme with a record-high 10GMAC per sec per axon compute rate. This approach uses a coherent silicon integrated circuit that combines a noise-tolerant linear neuron architectural scheme with noise-aware training methods.
There are many other alternative ways of dealing with "noisy" analog computations. For example, in Ref., [275] authors discuss the advantages of the pre-trained analog optical processors with bit precision operations. To surpass the performance of digital processors because of the confined photonic hardware size and the limited bit precision of high-speed electro-optical components, engineers usually used post-training techniques such as inference averaging, dynamic precision inference etc., to compensate for the "noisy" analog computations. Hence, the authors proposed and experimentally demonstrated a speed-optimized dynamic precision NN inference via tiled matrix multiplication on a silicon photonic processor with the aim of targeting highaccuracy and speed-optimized classification tasks. The advantages of optical computing over digital computing for accelerat-ing DL lie in operations executed at low precision. The key metric here is the effective number of bits of precision of analog processors, which is limited by noise. Dynamic precision analog computing for NN was proposed in Ref. [240]. It lies in repeating operations and averaging the result, decreasing the impact of noise. This method reduces energy consumption by up to 89% for a particular computer vision model and 24% for a natural language processing model, overcoming the weight noise, thermal and shot noises.
The computational errors from different sources tend to accumulate and severely impair the large-scale photonic NN performance. Usually, one can not expect something valuable from the noise realizations. Counterintuitively to the current belief, it was demonstrated that a photonic generative network can act as a part of a generative adversarial network for generating handwritten numbers. [241] The implementation was realized with a photonic core consisting of an array of programable phase-change memory cells, applied noise-aware training by injecting additional noise, and led to the demonstration of the network's resilience to hardware nonidealities. Several offline noise-aware training schemes for discriminative models were proposed, such as injecting noises to layer inputs, synaptic weights, and preactivation. [241] In summary, noise is considered a significant obstacle to efficient optical computation; however, it can be leveraged or exploited smartly.

Optical Minimizers of Spin Hamiltonians
Optical systems designed to minimize spin Hamiltonians have the potential to find the global minimum of hard optimization problems. These systems offer several advantages, including the ability to find better solutions to a wide variety of nonlinear optimization problems within a fixed time, to find solutions of a given precision more quickly, or to solve more complex problems at a fixed and limited cost. However, these machines also have their limitations and vary in terms of scalability, the ability to engineer the required couplings, the flexibility of tuning the interactions, the precision of read-out, and other factors that facilitate the approach to the global minimum rather than local minima. Despite these limitations, all of these machines have some aspects of their operation that promise increased performance over classical computations. To solve an optimization problem using optical minimizers of spin Hamiltonians, it is necessary to find an optimal mapping of the real-life problem onto a spin Hamiltonian. Some optimal mappings are already known, while for others finding an optimal mapping is a crucial step toward successfully solving the problem.
Combinatorial optimization is a field of study that focuses on finding the absolute minimum configuration of a given problem. However, in many applications it is desirable to not only find one absolute minimum, but also to obtain multiple or all degenerate absolute minima and, in some cases, to sample many low-energy excited states. [276] This sampling capability can be useful for applications that require distributional information about optimal solutions, such as the implementation of Boltzmann machines as generative models for ML. [277] In industrial settings, having access to a pool of candidate solutions to an optimization problem can make processes more efficient and flexible. For example, in drug discovery, [278] structure-based lead optimization could generate many candidate molecules for simultaneous testing.
One approach to solving large optimization problems is to decompose them into smaller subproblems that can be solved separately, for example to accommodate hardware limitations. By using multiple low-energy samples rather than just the optimum for each subproblem, it is possible to construct better solutions to the original problem. [279] However, a spin minimizer designed for combinatorial optimization may not be well-suited for sampling all ground and low-energy states. The nonlinear stochastic dynamics of such machines in the presence of quantum noise can be exploited to sample degenerate ground and low-energy spin configurations of spin models. When these optical machines operate in a quantum-noise-dominated regime with short photon lifetimes (i.e., low cavity finesse), homodyne monitoring of the system can efficiently produce samples of low-energy spin configurations that are better than their classical counterparts. [280] To properly access the properties of such systems, one can use computer simulations in several scenarios. Such emulations allow one to avoid extensive labor experiments to predict properties of such systems properly, tune and optimize the parameters for optimal performance and even inspire new classes of algorithms for conventional computers. The emulation algorithms can be found in Refs. [281,282]. Such techniques can apply to a broad type of NNs.

Efficiency of Artificial Neural Networks
ANN are powerful tools for processing large data sets and analyzing vast amounts of information quickly and without explicit instructions. As a result, a wide variety of NN architectures have been developed and implemented in various applications. The development of different NN is important because each architecture can represent different systems while maintaining a certain level of universality in approximating and representing complex systems. This expands the already significant scope of applicability of NNs.
Passive optics can perform many linear transformations without power consumption and minimal latency at rates over 50 Gb s -1 . Optical logic gates have been demonstrated to be feasible. [283][284][285][286][287] However, attempts to replicate classical boolean electronic logic circuits in photonics have not been successful. Analog photonic computing devices are suitable for NNs due to their fast and energy-efficient computations. Optical nonlinearities can be used to implement various nonlinear functions. [288] Recent developments suggest that optical implementations of NNs can surpass electronic solutions in terms of computational speed and energy efficiency. However, the challenge of developing truly deep NNs with photonics remains. Photonic multilayer perceptrons and photonic SNN have potential for realizing alloptical ANNs. Photonic accelerators for convolutional NNs are the most promising photonic solutions for enhancing inference speed and reducing power consumption in the near term.
However, there are still many opportunities to explore and improve the implementation of photonic NNs. For example, more research is needed to assess whether specific types of deep NNs can be implemented optically in an efficient manner that provides advantages over fully electronic implementations. Further-more, some deep NNs, such as long-short-term memory NNs, generative adversarial nets, geometric deep NNs, and deep belief networks, have not yet been implemented in photonics.
The ultimate goal is to realize large photonic NNs with thousands of nodes and interconnections across many hidden layers. To achieve this, it is essential to work on the cascadability and robustness of photonic NNs to fabrication imperfections and parameter drifts over time. [289] Resonant structures like microring resonators are susceptible to manufacturing deviations. [290] Linear optical processors based on MZIs appear more robust due to their reconfigurability. Some studies discuss achieving reliable photonic computations with imperfect components. [291] Further investigation is needed to implement nonlinear activation functions in an all-optical manner in photonic NNs. While software can emulate nonlinearities, integrating nonlinear elements into hardware remains a challenge. Several approaches have been reported to address this issue, including the use of MZIs, [292] graphene and quantum well electro-optic absorption modulators, and photonic crystals. [293] Technological breakthroughs would greatly benefit photonic NNs, particularly the implementation of an integrated, non-volatile, and energyefficient photonic memory element. Phase-change materials are a promising approach for achieving such photonic memories due to their potential for multi-level storage. [294] These materials' cells have been exploited in photonic NNs, particularly for SNNs. [57]

All-Optical Backpropagation
When training NNs, one usually considers the backpropagation algorithm by default. The essential idea behind the backpropagation is to compute the gradient of the loss function with respect to each weight by the chain rule and doing it consequently, one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in this sequence of steps. [214,216] Such a procedure allows one to fit the weights of a NN for a given task. Still, the complexity of the backpropagation is enormous. It grows linearly with the number of training examples or butches, the number of iterations, which is not known in advance and the basic complexity of feedforward input propagations, which can be estimated as a consequent series of matrixvector multiplications. These evaluations hold for many cases, assuming batch gradient descent algorithm and simple matrix multiplication for the input propagation. However, one can reduce the number of steps with some approximate schemes. At the moment, there are many different ways to train NNs, including variants of backpropagation or alternatives, such as learning without backpropagation. [295] Thus, the backpropagation algorithm remains one of the most expensive components to compute. The significant power and time consumption happens due to the sequential computation of gradients in the backpropagation procedure of NN training. Backpropagation through nonlinear neurons is another challenge to the field of optical NNs and a significant conceptual barrier to all-optical training schemes. Although there exist several practical, simple solutions, such as using approximation provided in a pump-probe scheme that requires only passive optical elements [296] or by measuring the forward and backward Figure 13. Schematics of physical neural networks (PNNs). Training sequence of PNN with b) back-propagation, and c) augmented biologically plausible training called direct feedback alignment (DFA). Augmented DFA enables parallel, scalable, and physically accelerable training of deep physical networks based on random projection with alternative nonlinearity g(a). Reproduced with permission. [300] Copyright 2022, Springer Nature.
propagated optical fields based on light reciprocity and phase conjunction principles, [297] the schemes still involve digital electronics or programming a high-speed SLM respectively. Therefore, having incomplete solutions, the work on the end-to-end optical training of NNs is in progress. Achieving the efficient all-optical backpropagation training method (besides the realization of depth and nonlinearities) will be a major achievement in the field. The question of such realization is just a matter of time since there are no fundamental restrictions on such a development. [298]

Alternative Learning Methods
The most computationally intensive part of the NN operations is the learning process, with backpropagation being the standard procedure behind many cases of training the weights. Exploiting the physical mechanism to reduce the energy requirements for such a training procedure (e.g., in the case of optical schemes) should be considered one of the significant achievements of optical NNs. A hybrid in situ-in silico universal algorithm called physics-aware training was introduced in Ref. [299] with a few examples (including an optical one) as demonstrations. Since the analogue computation of the backpropagation terms according to the direct chain rule is complicated, scientists and engineers have devised many alternatives for the learning procedures. For example, the training algorithm called direct feedback alignment was introduced in Ref. [300], see Figure 13. This is a universal method based on random projection with alternative nonlinear activation and requires no information about the nature of the physical system. Another approach lies in a simple mechanism that can transmit teaching signals across neuronal layers by multiplying them by random synaptic weights and performs similarly as backpropagation on many tasks. [301]

Statistical Sampling
Statistical sampling is another essential domain where using optical machines can be beneficial. PGMs can effectively represent the probability distributions of different factors in complex systems. Moreover, due to its universal structure, one can model complicated large graphs with many factors for various practical problems.
The correspondence between the Ising model and the probability measure of the pairwise PGM allows one to solve many tasks, such as inference based on the given observations or sampling. For example, the latter can obtain the most and the least probable states by exploiting the sign before the energy function. Furthermore, the additional specific mechanism presented in several types of hardware can enhance the sampling procedure www.advancedsciencenews.com www.advquantumtech.com to efficiently use them as a source of additional information for particular problems.
Unfortunately, the simulation of PGMs using optical machines needs to be better investigated. The obvious directions will be to increase the programmability of the optical spin models to access more options for manipulating the Ising/XY/Potts etc., states or decompose large and rich PGMs into their discrete approximations accessible by spin Hamiltonian simulators. Another option is to investigate additional hardware improvements in the context of PGMs. Finally, there are many more applications of such correspondence between spin system functionality, control theory, and decision-making.

Neural Architectures and Transfer Learning
Many NN architectures can differ in forms (deep and shallow, feed-forward, and recurrent), training methods, network topologies, and operational principles. Some photonic architectures were mentioned before; see Section 2.3. Moreover, some of these structures are best suited for one purpose than another. For example, recurrent NNs are good at tackling temporal dependencies, while convolutional NN is a standard architecture in image processing tasks. However, one can not easily realize all of the architectures on particular hardware due to its physical limitations or the ineffectiveness of the design.
To deal with the transfer of functionality between different architectures, one can pay attention to the domain of transfer learning. Originally, transfer learning was a research direction in ML that aimed at gaining knowledge from solving one type of problem and using it in a different but related domain; see recent reviews. [302,303] However, transfer learning is a way to transfer features of one architecture to another and make the problem more hardware-friendly.
Transfer of functionality will dramatically influence the ML domain and benefit the hardware computing field. It is of essential importance for optical devices, which have certain engineering limitations on the realizations of some architectures. Many more related research directions, like neural architecture search, can be adjusted to optimize the hardware systems.

Optical Quantum Computing
ANN in photonic integrated circuits and optical minimizers of spin Hamiltonians are the main paradigms for optical platforms that have already established an engineering base and clear development directions. Compared with emerging quantum technologies, a high-risk endeavor, classical optical devices offer advantages in speed, parallelism, energy consumption, or operational policy in short to medium term. Therefore, we can say that optical technologies are repeating their electronic special purpose hardware analogues development, with the technological progress making "another loop in its spiral development". Quantum computers have emerged from exciting developments in physics and the theory of computation. There are several hardware platforms for developing quantum computing, and it is still being determined which technology or combination of technologies will be most successful.
This section assesses the current status and future potential of quantum computing based on photons. The current view of the academic community is to exert caution when discussing future practical applications of quantum computing technology because it is so different from the information technology we use now. Many believe that quantum technology will substantially impact society in the decades ahead. [304] Still, not many are confident about the commercial potential of quantum technology in the near term (five to ten years). [305] Others are sceptical that quantum computers will ever become useful. [306] At the core of critics' argument against the feasibility of quantum computers lies the notion of complexity. So far, a very low-level complexity class of probability distributions has been identified and described by noisy intermediate-scale quantum computers. Such computers would allow neither good-quality quantum error correction nor a demonstration of "quantum supremacy"-the ability of quantum computers to make computations that are impossible or extremely hard for classical computers. [306]

Quantum Optical Devices
The operation of quantum computers relies on three principles: quantum entanglement, quantum complexity and quantum error correction. Therefore, quantum computers exploit the characteristic correlations among the parts of a quantum system that make them robust and scalable to large devices solving hard problems.
By 2022, many advances in quantum computing were announced (but some were also refuted). The leading technology is based on superconducting qubits (Google, IBM, Rigetty) and trapped ions (IonQ, Honeywell). Google team has announced quantum supremacy using 53 qubits in 2019; IBM entangled 65 qubits while revealing a road map to more than 1000 by 2023. The advantages of superconducting qubit systems are that they are based on well-developed semiconductor technology; however, they have to be kept cold (10mK) and have a short decoherence time (< 10μs). In contrast, trapped ions are very stable with much longer decoherence times (minutes), longer range interactions (beyond nearest neighbors) and report the best quantum volume among any quantum computer systems. However, many lasers are needed to be controlled simultaneously, the operation could be faster, and it would be hard to put many ions on a chip. So far, IonQ has achieved 32 trapped ions in a chain, promised to achieve quantum supremacy by 2025 and solve interesting real-life problems by 2028. There are other proposals and small-scale realizations using silicon quantum dots, [307] diamond vacancies, [308] neutral atoms, [309,310] etc. One of the biggest disappointments was experienced by Microsoft in 2021 invested into topological qubits. A topological qubit created from a pair of Majorana zero modes could theoretically benefit from topological protection. This protection could lead to stability and a lack of decoherence, potentially allowing topological quantum computers to scale up in power more easily than other approaches. The theoretical existence of Majorana zero modes was claimed to have been realized experimentally in 2018, [311] but the paper was later retracted due to the discovery of erroneous data.
Quantum computers based on photons had been considered impractical in the early ages of quantum computer developments because of difficulties in generating and controlling the required quantum states. However, such computers are being developed by photonic companies such as Xanadu (Toronto) and PsiQuantum (Palo Alto, CA) in addition to intensive academic research. The advantages of photon-based quantum computers are room temperature operation, much longer decoherence times (from ms to h), and the systems being cheaper and easier to build. However, they become large quickly (although PsiQuantum claims that one million qubits would still be possible).
For a photon-based quantum computer, boson sampling was proposed as a counterpart to a random quantum circuit of superconducting qubit systems. A sampling task is one where the computer generates samples from a specific probability distribution. Quantum algorithms allow sampling from probability distributions well beyond the capabilities of classical computers. The most famous example is Shor's factorization algorithm which exploits the ability to sample a probability distribution efficiently based on the Fourier coefficients of a function on a quantum computer.
Squeezed states of light have an unequal distribution of quantum uncertainty between their amplitude and phase. The more a state is squeezed, the more photons it contains. Multi-photon squeezed light is found in many quantum-optics experiments and has been studied for over two decades in quantum computing models. [312] It has been proposed that a relatively simple optical circuit consisting of beam splitters and photon counters that exploits the properties of squeezed light could carry out a sampling algorithm faster than classical computers. [313,314] Such an algorithm has many practical applications, including finding matching configurations between molecules [315] or different states of a molecule. [316] More rigorously, the boson sampler is a quantum optical device in which a linear optical network mixes many non-classical photon sources. As a result, the photons are indistinguishable and, when originating from different sources, lead to complex photon counting statistics of the output detectors. When the number of the input/output channels of the boson sampler is large, the emulation of such a device with a classical computer is believed to be ♯ ℙ-hard. [313,317] In the original formulation, the boson sampler was introduced as a device consisting of singlephoton sources, a linear interferometer and photon-counting detectors at the output channels. Several experiments implemented variations of this set-up: 5 input photons in 21-mode optical circuit, [318] and 20 input photons in 60-mode interferometer. [319] Using single-photon sources creates various technological complications that reduce the scalability necessary to overcome the classical computer calculations (that roughly scale as 2 k in the number of operations where k is the number of input photons). The lack of scalability in single-photon-based experiments on integrated platforms is due to non-deterministic state preparation and gate implementation. Using deterministically prepared squeezed states and linear optics with non-Gaussian operations provided by photon-counting detectors allows for significant scaling up in the number of input/output channels. Therefore, the Gaussian boson sampling (GBS) was proposed, where the single-photon sources are replaced by the single-mode squeezed light generated by parametric down-conversion sources. [313] The achievement of "quantum computational advantage" while implementing GBS using 50 input channels and a 100-mode inter-ferometer was recently reported. [320] The authors state that their device provides 200 s samples requiring classical computers billions of years. Specifically, the paper reports a GBS experiment representing a quantum state in 10 30 -dimensional Hilbert space and a sampling rate that is 10 14 faster than that of using digital supercomputers. This paper was described as the first independent verification of Google's quantum advantage claims and claimed to surpass Google's supremacy by several orders of magnitude.
This huge computational advantage reported [320] is based on specific statistical tests measuring the proximity of the measured samples to the outcomes of noiseless simulations of the quantum experiment that were performed on a classical digital supercomputer. It was previously shown that a classically sampled distribution might pass the same statistical tests by only reproducing small-scale correlations of the actual theoretical distribution. [321] Moreover, a polynomial-time algorithm based on taking a truncated Fourier-Hermite expansion on the boson sampling distribution [321] may achieve similar or better sampling quality for the statistical methods of Ref. [320]. Another method for attaining similar sampling quality based on an algorithm of Clifford and Clifford [322] was also proposed. [323] Finally, very recently, a series of approximations were introduced to generate the probability distributions of any specific measurement outcome in a polynomial complexity. [324] The accuracy of the experiment was achieved at the fourth-order approximation using a laptop computer. The algorithm was tuned toward the actual experiment and applies only to the GBS (not Fock-state boson sampling), [313] only for threshold detectors (not photon-counting detectors), and only for a small number of modes (not quadratic in the number of photons as in the original proposal [313] ). Subsequent experiments reported nontrivial genuine high-order correlations in GBS samples, providing evidence of robustness against possible classical simulation schemes. [325] So far, experimental implementations of GBS lack programmability (reconfigurability of the circuitry) or have prohibitive loss rates that limit the scalability. There is a need for rigorous theoretical evidence of the classical hardness of GBS, althought some progress was recently made. [326] In 2021, Xanadu and NIST attempted to remedy this by implementing a programmable and potentially highly scalable circuit. [327] The system uses eight modes of strongly squeezed vacuum initialized as two-mode squeezed states in single temporal modes. These pass through a fully programmable four-mode interferometer and are read out using photon number-resolving detectors on all outputs. This was achieved using strong squeezing and high sampling rates. The interferometer implemented a user-programmable gate sequence based on a network of beam splitters and phase shifters. The resulting eight-mode Gaussian state was measured on the Fock basis using eight independent photon-number-resolving detectors. The total device was composed of a 10 mm × 4 mm photonic chip coupled with a high-level application programming interface running on a classical computer.
There are many problems to overcome before the quantum sampling implemented on a quantum computer becomes useful for real-world applications. Photon losses need to be controlled and significantly decreased to enable photon travel through the circuitry to improve scalability. The improvement in the sampling fidelity and the quality of the squeezed states must be increased. The most exciting application would require individual control of the degree of squeezing and the amount of optical power in each squeezed state. However, the number of possible real-life applications that can be addressed using the current architecture is limited. The Xanadu group implemented two potentially practical algorithms by encoding problems into the beam splitter network. They used the generated samples to determine energy spectra for transitions between molecular states and to find the similarity between graph representations of different molecules. Specifically, a graph can be encoded in a photonic circuit by mapping its adjacency matrix into the structure of a linear optical interferometer with squeezed light. [328] The photon-counting statistics can be used to specify so-called "feature vectors", which represent the graphs in Euclidean space [329] so that the distance between them can be used to quantify the similarity of the corresponding graphs. Such similarity measure between the graphs derived from a GBS device is important, for instance, for classification in ML.
Recently, other optical computing platforms based on squeezed states have been theoretically proposed on the route to a useful optical quantum computer; [330,331] see Figure 14.

Boson Sampling and Graph Isomorphism
Using the light interference network for quantum analogue calculation has many practical advantages. We mentioned that operating with the boson sampling setup allows one to calculate the permanent of a specific matrix, which is extremely hard from the computational perspective. However, it is more complex to make this helpful computation and was an open question for some time with a few remaining debates. Recently the connection between a Gaussian boson sampler and the graph isomorphism problem was established. [333] The graphs are encoded into quantum states of light, and then their properties are probed with photon-number-resolving detectors. Using a complete set of graph invariants, the authors prove that the probabilities in the setup can be combined, and the isomorphism between the two graphs can be established only in the case of equal detection probabilities on the output.
It is still an open question whether graph isomorphism has a specific complexity type. It is believed to belong to the class of ℕℙ-intermediate computational problems. The existence of a polynomial-time algorithm that can determine whether two graphs are isomorphic is under question; however, there are quasi-polynomial types of algorithms. One can find the recent advances in photonic boson sampling with the description of both the technological improvements and future challenges. [334] The proposed connection between the graph isomorphism and boson sampling can be further extended to other practical tasks, such as constructing graph kernels for the ML applications operating with the graph-structured data. [329]

Quantum ML
Programmable waveguide meshes possess the capability to execute arbitrary linear transformations between sets of input and output waveguides, a fundamental operation in photonic quantum computing. In this paradigm, quantum information is encoded in the quantum states of light propagating through photonic integrated circuits. [335] A prevalent scheme encodes a qubit as a single photon in a superposition of two rail waveguides. [336] Noisy intermediate-scale quantum (NISQ) devices have demonstrated potential in the field of quantum ML, offering the prospect of processing vast data sets at a significantly faster rate than classical computers. [337] Quantum ML draws parallels with classical photonic deep NN accelerators, consisting of stages of linear waveguide meshes interconnected by activation layers exhibiting strong reversible nonlinearities. [338] In a quantum optical neural network (QONN), programming a NISQ computer entails training the phases in the waveguide mesh via supervised learning on input and output quantum states. QONNs can be trained to perform an array of quantum information processing tasks, including quantum optical state compression and reinforcement learning. Recently, a QONN successfully programmed a one-way quantum repeater. [338,339] Nonetheless, these concepts and numerous other ideas in neural quantum architectures remain far from practical implementation with current experimental capabilities.

Comparison with other Quantum Approaches to Optimization
As previously discussed, CIM has shown several orders of magnitude time-to-solution advantages compared to D-Wave2000Q quantum annealer on similar dense matrix instances. Recent comparisons of the quantum approximate optimization algorithm (QAOA) with competing methods such as quantum annealing and simulated annealing, [340] the D-Wave2000Q quantum annealer with the IBM Q Experience system that implements QAOA, [341] and benchmarking of QAOA on Google's "Sycamore" [342] enable us to compare the performance of optical spin machines with QAOA to some extent. Figure 15. The ratio of the found solution to the best solution when using QAOA for various problem sizes N. Each solution is averaged across 10 random instances (standard deviation is given as error bars). The experimental solutions of the SK and Max-Cut models approach random as N increases. Reproduced with permission. [342] Copyright 2021, Springer Nature.
In the QAOA, the variational wavefunction resembles a trotterised version of the quantum annealing procedure: where the starting state is |+ > is the product state of eigenstates of x with eigenvalue 1, |+ >= ∏ i (|0 > i +|1 > i )∕ √ 2 which is simultaneously the superposition of all computational basis states. In contrast to a trotterised version of quantum annealing, the parameters i and i are adjusted in a classical learning loop to minimize the objective function. Such adjustments are considered as ℕℙ-hard problems themselves. As P → ∞, QAOA approaches smooth QA. According to the results of Ref., [340] the quantum approximate optimization algorithm (QAOA) can deterministically find the solution to specially constructed optimization problems where both quantum annealing and simulated annealing fail due to wide and tall energy barriers of the function being minimized.However, there exists an efficient classical algorithm for these instances.
In Ref., [341] small size (up to N = 18) weighted Max-Cut problems and 2-SAT problems were tested using D-Wave2000Q quantum annealer with IBM Q Experience. The actual machine IBM Q on 16 qubits gave such poor solution quality that the real physical experiment on D-Wave200Q was compared to the simulation of QAOA. Even in this case, physical QA has shown much better success probabilities than QAOA (99.92 vs 8.84(p = 1) and 42.39(p = 3), respectively, on as small matrices as N = 8!) The conclusion drawn was that, for the set of problem instances considered and using success probability as a measure, "the QAOA cannot compete with quantum annealing". The corresponding plots can be found in Figure 15.
In Ref., [342] the authors used the Google Sycamore superconducting qubit quantum processor to run the quantum approximate optimization algorithm (QAOA) for combinatorial optimization problems on a planar graph matching the hardware connectivity. They also applied QAOA to the Sherrington-Kirkpatrick model and Max-Cut, both high-dimensional graph problems that require significant compilation for QAOA. The problems were solved up to N = 23 numerically (without noise) and experimentally. For QAOA the theoretically optimal , and p ∈ 1, 2, 3, 4, 5 were used in experiments. The success probabilities on average of the problem on graph matching hardware reached a plateau for N > 8 at about 80 % (numerically) and about 45 % experimentally. For SK and Max-Cut problems, performance deteriorated quickly (for any p) to the probability of finding a solution by random guessing (for N > 15). The authors deduced that although current quantum processors are unable to surpass classical optimization heuristics, utilizing prevalent techniques such as the Quantum Approximate Optimization Algorithm (QAOA) on representative problems can serve as a standard for evaluating different hardware platforms. In order for quantum optimization to rival classical approaches for practical problems, it is imperative to transcend beyond artificial problems with low circuit depth.

Quantum Effects and Optical Machines
As we can see, various optical hardware uses different mechanisms for its operation. It can have the primary mechanism's pure classical, quantum, or hybrid nature. Even in the case of operating near the classical limit, the quantum effects can be essential and greatly influence the actual operation regime.
For example, it was shown that the nonlinear stochastic dynamics of the CIM in the presence of quantum noise could be efficiently exploited to sample degenerate ground and low-energy spin configurations of the Ising model on the example of Max-Cut problems. [343] Both quantum noise and optical nonlinearities play an essential role in system dynamics. Removing these essential elements will result in the degradation of sampling performance. The supplementary numerical results beyond the classical simulation complement the description of the quantum mechanism's role in the CIM operation. Another work [146] studies the performance scaling of three quantum algorithms for combinatorial optimization, such as CIM performance, discrete adiabatic quantum computation, and the Dürr-Høyer algorithm for quantum minimum finding that is based on Grover's search. Authors claim that the CIM performance is dramatically better for solving Max-Cut problems. Moreover, the CIM is competitive against various heuristic solvers implemented on CPUs, GPUs, and FPGAs.
Many optical devices fall under the category of open quantum systems. Such a formalism is necessary to account for many complicated effects. For this purpose, a Markovian open quantum systems framework has been developed. [344,345] Such effective dynamics for the reduced density matrix of the system give rise to the Lindblad-form master equation, which allows one to trace such effects as equilibration with the pump and decay processes, thermalization of the system and different aspects of interaction with the environment. Although the numerical methods for such processes are quite complicated, one usually develops approximated schemes that account for the omitted effects. There are many more systems where this approach could be beneficial for describing subtle but essential features-another example, except for the CIM, is the exciton-polariton system frequently mentioned before. Furthermore, one should pay attention to other microscopic processes in the EP system since such consideration gives more degrees of freedom to inspect compared to the simple mean-field theory. [346][347][348]

Benchmarking Optical Machines
So far, the research and development of optical hardware are experiencing significant growth. The main problem is to compare the capabilities of optical machines as they are often tested on different problems of variable sizes and difficulty. Thus it is hard to figure out the scaling properties of the particular mechanism from either experiment or numerical emulation procedure. Another problem is lying in the biased results, which can be cherrypicked for better demonstrative purposes. In general, it is hard to find extensive, complete, up-to-date and unbiased results comparing different types of optical hardware.
Although the majority of the NN optical architectures can be compared using standard metrics concerning the accuracy of the particular datasets and the required workload, we outline what is known and how some of the optical machines can be compared. For example, in Ref. [349] comparisons between memristors, GPUs, D-Wave and the CIMs were made using the same set of dense 60-node Max-Cut graphs. The time to solution (with 99 % probability to reach optimal solution) was 600μ s for the CIM and 1000 s for the D-Wave. It was shown that the Ising machine, which is based on optoelectronic feedback systems, resolved Max-Cut optimization problems on both regular and frustrated graphs with 100 spins, exhibiting comparable or superior performance to CIMs based on DOPOs. [118] Since OEO-based CIMs can be realized as integrated photonic circuits, the flexible spin coupling can be achieved optically using programmable silicon photonic circuits. This will fully exploit the high bandwidth of the optical system and result in a significant acceleration over existing CIM concepts.
Establishing universal benchmarks will attract more research in the field since understanding the hardware's successes and failures on particular problems allows one to maximize utility. Moreover, such a research direction shares similar issues with the ongoing studies on the NN architectures and phase transitions in the statistical approaches to the computational problems, [102] which is cross-beneficial for all of the domains.

The Most Promising Applications for Optical Computing
Our subjective perspective is that modern optical computing has the potential to give a significant computational advantage in three major applied areas: neural networks, nonlinear optimization, and statistical sampling.
Optical hardware is a promising platform to get accceleration for these applications, with many computational advantages coming from the hybrid-quantum/classical mode of operation. The optics naturally supports these tasks but also benefit from www.advancedsciencenews.com www.advquantumtech.com many more factors, such as specific architectures and their interplay with the natural properties of light systems.
For example, a mode selection mechanism is one of the beneficial regimes of operation for a quantum system spanning a highdimensional space of possible solutions and finding an optimal one while settling to the first possible coherent state with a large occupation. Another component is classical dynamical system behavior that can mimic the NN dynamics, following the classical gradient dynamics on a changing energy landscape while tunnelling through barriers to the nearest energy minimum. The task is achieved if this minimum corresponds to the optimum solution to the problem. Finally, a similar mechanism is responsible for sampling the landscape's low-energy subspace.

Future Perspectives
The 'no free lunch' (NFL) theorem in optimization states that any two optimization algorithms have the same performance averaged across all possible problem instances. This theorem applies to the hardware instead of the algorithms with many more implications. One of the consequences of the NFL theorem is the correspondence between the solver/hardware structure and the hardness of the problem with the best-case and worst-case scenarios.
To use optical spin machines to speed up the solutions to specific industry real-life problems, one needs to think hard about the range of application that may go far from the QUBO. These application has to closely match the optical machine's operational principle to take advantage of all the potential advantages. Many questions need to be addressed before the optical spin machines become useful for the real-life applications. Which platforms should we use for comparison between different machines? What is the importance of optical quantum versus classical, classical versus classical, hybrid versus classical, optical versus other physics-based hardware advantages? How does hardware performance compare to the best algorithms run on traditional systems? To answer these questions, we need to introduce a standard for fair comparisons between the machines and approaches. Which section of the workflow is more advantageous to optimize? Should sections that are closer to hardware or closer to a user be more important? How to properly optimize pre-processing and post-processing? How do we evaluate results and which metric should be used? The proximity between the found solutions of QUBO can be evaluated using, for instance, the Hamming distance, the distance in the energy space, the ratio of the energies, the accuracy of the neural architecture, or some other the generalization of the error metrics can be used. How do we evaluate optimization performance in several important dimensions, for example, the computation time, the solution quality, the energy efficiency, the input scope, etc. all can be used for evaluation? How do we account for overheads concerning the given architecture and the task-specific constraints? How to optimize the implementation, translating, embedding, tuning, post-processing? How do we find inputs that are hard and relevant? How do we avoid using trivial ones? The possible the possible phase diagrams should be built for the parametrized tasks. How do we maximize the generality of the conclusions, and to what extent if we can only test some combinations of inputs and hardware?
The advantage will come when we (i) develop purpose-built solutions tuned to specific applications, (ii) develop hybrid algorithms and approaches (e.g., including ML as a part of the hybrid solutions) and (iii) leverage programmable accelerators for core tasks. More research is needed to bring the potential of optical (or any other unconventional) computing systems to real-life applications. Answering the critical questions will bring us closer to a better understanding of the underlying principles of unconventional optical machines, improve their performance and hence achieve a significant practical impact.