Adaptive Photochemical Nonlinearities for Optical Neural Networks

Optical neural networks (ONNs) hold great potential for faster and more energy‐efficient information processing in coherent photonic circuits. To realize ONNs, linear combinations and nonlinear activation functions have to be implemented in an optical fashion. Optical nonlinearities are, however, still difficult to achieve, and existing designs are usually too inflexible to offer different activation functions as used in artificial neural networks. Herein, the nonlinear properties of the large and highly adaptive class of photoswitchable chemical compounds is made accessible as activation functions in ONNs by employing photo‐induced isomerization in azobenzenes to steer activation behavior through nonlinear modulation of an information‐carrying optical signal. The strength of the nonlinearity can be controlled by the chemical concentration while a physically motivated model describes the experimental data for systematically varied photoswitching parameters, resulting in a tunable yet interpretable activation function. Employing such an activation function in a neural network then allows to gauge its strength and perform established classification tasks. The work combines recent advances with photoswitchable chemical compounds and optical neural networks to enable control over the design of nonlinear activation functions, thus opening exciting perspectives for explaining the emergence of intelligent behavior in neural networks.


Introduction
Artificial deep neural networks (DNNs) are pervasive in science and engineering, [1] yielding new possibilities for a broad range of applications including computer vision, [2] language processing, [3] and biomedicine. [4]7][8] Despite the availability of more computational resources [9] and improved transistor technology, [10] physical constraints, however, inevitably limit the scalability of electronic technology advancements. [11]oreover, training very deep neural networks on semiconductor-based hardware requires a tremendous amount of energy [12] and causes increasingly high carbon emissions. [13]As a consequence, a steeply rising demand for alternative hardware technologies has emerged to continue the scaling of these systems and to enable more sustainable processing. [14,15]ptical signal processing has the potential to overcome these limitations due to its ability to operate at higher rates while having a substantially smaller energy footprint. [16,17]This is particularly true for DNNs which require millions of matrix multiplications and can therefore further benefit from the intrinsic parallelization capabilities for wavelength-and spatialmultiplexing in optical processing, [18][19][20] resulting in an increasing interest in so-called optical neural networks (ONNs). [21]24][25][26][27] Importantly, it has been demonstrated that the linear part of these algorithms can strongly benefit from the high parallelization capability of integrated photonics. [28]In contrast, nonlinear signal processing has remained challenging for optical systems [29] and is instead done in electronic hardware in the majority of the systems, [19,20,22,23,25,28,[30][31][32] thus requiring frequent and inefficient analog-to-digital optoelectronic conversions.
Several examples of optical components with nonlinear effects resembling the behavior of popular activation functions are known, such as saturable absorption, e.g., in 2D-materials like graphene [33][34][35] or in metallic nanoflakes, [36] electromagnetically induced transparency, e.g., in laser-cooled atoms, [37] and free-carrier dispersion or thermo-optic effects in resonator-and interferometer-configurations. [38]Also, quantum dots, [39] exciton-polariton nodes, [40] phase change materials, [41,42] and differentially-biased semiconductor optical amplifiers [43] have been considered for use as nonlinear elements in neural networks.However, such nonlinearities typically lack the flexibility to adapt overall system performance to specific use cases.While it is required to tune the nonlinear characteristics of conventional neural networks to achieve optimal learning behavior, [44] it further remains unknown how specifically the strength, functional shape, noise, and potential drifts of nonlinear activation affect problemsolving capabilities of ONNs.Strong and tunable nonlinear optical materials are hence highly sought-after to overcome current bottlenecks for advancing hardware implementations of ONNs. [45]1.Contribution Photoswitchable molecular compounds constitute a highly flexible class of materials with optical nonlinear behavior, which is yet unexplored for ONN implementations.The nonlinear properties of photoswitches arise from conformational isomerism, i.e., the molecules can be found in two different geometrical configurations, namely, the trans and cis isomers (Figure 1A).Both isomer states have different optical properties and switching between them can be achieved through thermal and optical stimuli.The wide variety of chemically well-studied photoswitchable molecules thus allows for adjusting the optical response in a variety of ways, including the concentration of photoswitches in a solution, molecular functionalization, temperature, and wavelength-selective optical controls for switching from cis to trans and vice verse.
Here, we focus on exploiting the photo-switching properties of azobenzenes, one of the best-studied photoswitchable molecular compounds, for implementing a photochemical activation function (PCAF).We demonstrate how the resultant activation behavior can continuously be steered between different regimes of nonlinear signal transformations that facilitate information processing for optical neural networks.Using a physically motivated mathematical model, we formalize the behavior of our nonlinearity and enable the identification of optimal activation function parameterizations for different network architectures and tasks.These parameterizations feed back into our photochemical system and can directly be translated into physical and chemical properties, yielding a recipe for task-specific and easy-toimplement building blocks.We evaluated our system and its behavior based on four well-known machine learning classification tasks and studied different parameterizations and the impact of noise with respect to the network performance.By fitting the model to different concentrations of azobenzene, we confirm that the physical implementation of our activation function can be tuned toward an optimal regime with interesting implications for both, the design of photonic nonlinearities and a deeper theoretical understanding of activation functions for optical neural networks.

Design and Implementation of the PCAF
We realize reversible switching between isomer states in azobenzenes via illumination with light, [46] where ultraviolet Figure 1.A) Synthesized azobenzene in the trans and cis configurations.Conversion from trans to cis isomers takes place under irradiation at 532 nm (photoswitching), while reverse switching occurs at irradiation of 468 nm or due to thermal transitions.B) Absorption spectra (wavelength dependent) of cis and trans isomers.Different absorption properties visible at the wavelength of 532 nm subsequently used as signal carriers.C) Experimental setup.The absorbance of a laser of λ = 532 nm by a cuvette filled with azobenzene is measured.The photoswitching of the azobenzene is controlled by continuously driving the isomerism equilibrium back to the trans state with a blue LED ( λ = 468 nm).A fiber splitter and a power sensor are used to determine the intensity of the incident light.Transmitted light is measured with a second power sensor.The laser is attenuated with a variable-neutral density filter to record the nonlinear absorbance for different input intensities.
(UV) wavelengths induce switching predominantly into the trans state, while visible (VIS) wavelengths (green) switch predominantly into the cis state.Simultaneous illumination with light of both UV and VIS wavelengths thus allows us to tune the cis and trans isomer fractions in a solution.For azobenzene, however, the two isomer states exhibit dissimilar absorbance (Figure 1B), resulting in an intensity dependence of the attenuation coefficient in addition to the dependence on the chemical equilibrium between both isomer fractions.The optical inputoutput relation of light traversing the solution then becomes nonlinear because the incident VIS signal alters the absorbance properties as it drives the azobenzene molecules into the cis state (of lower attenuation coefficient), while the UV signal provides a constant drive back into the trans isomer state.In consequence, the system shows characteristics similar to a saturable absorber. [47]ere, we use a water-soluble tetra-ortho-methoxy substituted azobenzene (Figure 1A) that features short switching times and good thermal stability of the cis-isomer with half-life times of multiple days. [48,49]Additionally, it provides high resistance to photobleaching and a remarkably high cyclability, with comparable molecules showing no visible decline in functionality after repeated switching. [50]Furthermore, the compound is highly stable in solution and shows no sign of degradation after more than 12 months of storage (see Figure S1, Supporting Information), thus making it a great candidate for long-term use.We characterize the intensity-dependent absorbance of azobenzene solutions with adjustable concentrations, using a 532 nm wavelength laser as the signal carrier, as shown in Figure 1C, and control the input intensity with a variable neutral density filter.A power sensor then records the transmittance through a d = 1 cm wide cuvette holding respective azobenzene solution, which is simultaneously illuminated from a 468 nm wavelength light emitting diode (UV LED) to continuously switch azobenzene molecules back to the trans configuration.

PCAF Modeling and Fitting
We deduce an analytical and differentiable fitting function for the (nonlinear) transmission of light through the azobenzene solution that is based on the Beer-Lambert law.To model the behavior of the physical system that is relevant for simulating a corresponding neural network, we will assume a homogeneous chemical equilibrium in the entire solution volume and linear input intensity dependence of the conversion rate from trans to cis states, in addition to the constant attenuation rate implicit in the Beer-Lambert law.While such model of the nonlinear dynamics adequately describes a wide range of photochromic processes, [51] it does not account for aggregation, [52,53] photobleaching, convection originating from insufficient mixing, or electrostatic interactions between the chromophores. [54,55]or the equilibrium mechanics of conformational isomerism, we thus find where I 0 is the input intensity and I the output intensity after transmission through the azobenzene solution and fitting parameters α, β, and μ.The strength of the optical nonlinearity is described by the parameter where ε c and ε t are the attenuation coefficients of the cis and trans state, respectively, c is the concentration of azobenzene in the solution, and d is the cuvette width (see Supporting Information for a more detailed derivation).We measure the relation between input and output power of the VIS light (and thus the absorbance of the azobenzene solution) for five different concentrations c, which allows for controlling the strength of the nonlinearity (see Equation ( 2)).Fitting the experimental data for input intensities from 0 to 10 mW yields both quantifiable linear and nonlinear behavior, as shown in Figure 2A.All systems show linear behavior at high input intensities but below approximately 400 μW a nonlinear regime is observable (inset in Figure 2A).Here, the nonlinearity is more pronounced for higher concentrations, as may be expected.Figure 2B shows that the fitted values for the strength of the nonlinearity α increase linearly with concentration, in accordance with Equation (2).
From Equation (1), we now find that the absorbance undergoes three regimes (schematically shown in Figure 2C): 1) for low VIS input intensities (I 0 % 0) nearly all molecules are in the trans isomer state, resulting in a linear input-output relation with slope μe Àα ; 2) at very high input intensities (I 0 !∞) nearly all molecules are in the cis isomer state also resulting in a linear input-output relation, here with slope μ; 3) in between both limiting linear cases, the desired nonlinear regime can be found.The center of the nonlinear regime lies at the point of inflection on a double logarithmic scale (I 0 = 1/β).At this point, the chemical equilibrium is balanced between both isomers, such that cis and trans concentrations are equal.While the parameter β determines the position (relative to the input intensity) of the nonlinear regime, α affects its input and output intensity ranges.Small α values are representative of limited intensity ranges over which nonlinear behavior is observable (y-axis intercepts on a double logarithmic scale), while pronounced nonlinear behavior results in larger α values.
For simulating realistic neural network performance of the observed nonlinear photoswitching behavior, we further incorporate Gaussian noise into our model, representing the experimentally occurring variations of the PCAF.The standard deviation of the noise distribution is thereby modeled to linearly depend on the input intensity.

Training Neural Networks Using the PCAF
We transform measured data to inverse physical units for simulating neural network performance by multiplying optical input and output intensity values with constant factors (see Figure 3A).
To not introduce any additional hyperparameters, we chose to use the inverse values of the fitted parameters μ and β (see Equation ( 1)) for this transformation.This leads to scaling the input and output intensity axes in such a way that in simulation units μ = β = 1.Thus, the fitted values of μ and β do not occur in the final activation function and the only experimental value left is the strength of the nonlinearity α, which is not affected by the unit transformations.The resulting PCAF, as used for training the neural network, is shown in Figure 3B.
We simulate fully connected neural networks (FCNs) to train them on various classification tasks using the nonlinear PCAF.The FCN comprises an input layer, several fully connected layers, each built by a linear part (affine transformation) followed by the activation function, and finally an output layer where the softmax function is used to normalize the outputs and arrive at class label predictions (Figure 3A).We chose to first leave α as a free parameter and analyze its impact on network performance.We then test the validity of the PCAF in a simple classification task on the MNIST dataset [56] by analyzing the distribution of activations (before passed to the PCAF) of a fully trained network (Figure 3C).First, we observe that all activations occurring during the simulation are in-between measurements, hence, we are only interpolating between those measurements and are not extrapolating to unknown regimes.Second, we find that the center of the distribution of the activations of the first layer lies close to 1/β, i.e., in the regime of the most pronounced nonlinearity.Because the network is trained to maximize its classification ability, this shows the high importance of nonlinearity to achieve good performance.The same effect does not occur for the second layer.Here, the activations are distributed in a broader range reaching far into the regime where the activation approaches linear behavior again.This could be a sign that the network has reached its necessary nonlinear transformation of the input data after the first layer and classifying the so-derived features afterward nearly linearly.Additionally, the up-scaling to high activations done in this way by the second layer might be beneficial to reach a more calibrated output in the last layer (cf.[57]).These effects provide exciting opportunities for future work, where the impact of more complex datasets and additional regulating building blocks (e.g., batch normalization [58] to control the scaling effects) could be investigated.
To evaluate the PCAF, and, in particular, the influence of the strength of the nonlinearity α on learning behavior more generally, we run simulations of FCNs on four well-known classification datasets (Figure 4).The two tasks XOR and circle consist of two-dimensional continuous input variables and two possible output classes.The ground truth decision boundaries as well as the network outputs are shown in the left column of Figure 4.The outputs are not deterministic due to the incorporated noise in our model, so we calculate the output 100 times and depict them on a continuous scale.The stochasticity of the outputs is most pronounced in the vicinity of the classification boundaries because small changes in the output can alter the classification decision easily here.MNIST [56] is a handwritten digit classification task and FMNIST [59] comprises images of clothing, both using 28 Â 28 pixel grayscale images of 10 classes as inputs.For each classification task, we show the classification accuracy on unseen test data (test set) in dependence on the nonlinearity parameter α.
In the case of α = 0, the network can be reduced to a single linear operation and correspondingly we find performance similar to those of linear classifiers.For the linear highly nonseparable tasks of XOR and circle accuracies of only approximately 50% are attainable which equals random classification.However, when increasing α, a sharp increase in accuracy occurs and accuracies of close to 100% are reached.For MNIST and FMNIST, the initial linear classification performance is already clearly above the random guessing accuracy of 10% but nevertheless, a marked boost in test accuracy is observed when increasing α.Apparently, for all four datasets, a value of α exists from which the test accuracy reaches a plateau, indicating a sufficient strength of the nonlinearity for the respective task.When compared to our experimentally measured values of α for different concentrations, we clearly reached this regime, as seen in the case of c = 3 mM where α = 1.75, which highlights the benefits of tunable optical nonlinearities.2)).C) Schematic illustration of regimes of nonlinearity model (see Equation ( 1)) on a double logarithmic scale.Dashed lines represent linear regimes (slope 1 in double logarithmic scale).The colored areas denote different regimes of the cis-trans equilibrium and thus of the resulting absorbance.While the system acts linear for low and high intensities (blue and green areas with corresponding slopes m), a nonlinear transition between these regimes occurs when the input intensities are in the order of 1/β (point of inflection in double logarithmic scale).The displacement between the linear regimes directly depends on α.

Conclusion
ONNs are the most promising solution to go beyond electronic hardware because optics enable ultrafast linear operations while being energy efficient and compatible with complementary metal-oxide-semiconductor (CMOS) hardware.This shift in hardware however poses new challenges arising from noise in the experimental system, the need for different training paradigms, and cumbersome access to weights.The difficulty of realizing tunable optical nonlinearities with characteristics suitable for ONN processing is however considered one of the central bottlenecks today.
Here, we introduced PCAFs as a novel paradigm to implement adaptive nonlinear building blocks for ONNs.We demonstrated how changes in the concentration of photoswitchable azobenzenes enable continuously steerable characteristics of this function, thus providing different regimes of linear and nonlinear signal processing.A PCAF allows the definition of a gradual strength of the nonlinearity because it comprises a steerable parameter, which entails the linear case (α = 0).
From a computer science perspective, all differentiable functions of a closed form can be used to train artificial neural networks so that the research on functions that are comparatively akin to linear functions is limited.From a more theoretical perspective and when considering the physical constraints of optical components, well-defined demands of nonlinear characteristics in the context of neural networks are however of high interest.By analyzing the effect of the strength of the nonlinearity α in our PCAF, we addressed this demand and revealed interesting insights, namely, the existence of a critical strength of the nonlinearity.Once this strength exceeds a certain threshold the performance consistently increases in all tested classification tasks.Further increasing α, however, did not improve the accuracy of the networks, implying a suitable regime of our nonlinearity.The existence of a critical strength of the nonlinearity, which clearly separates the linear and nonlinear behavior might not only be of interest for experimental realizations of ONNs but also open up more interesting questions from computer science perspective.The same holds for the seemingly intrinsic affinity to the maximum nonlinearity of the activation, as shown in Figure 3C, which demands further investigation in future work.
Another interesting future direction is the usage of tunable nonlinearities for other types of architectures such as recurrent neural networks and concepts such as reservoir computing.For example, it has been shown that continuity (which is for example not given by rectified linear unit (ReLU)) and boundedness are important properties regarding the stability of reservoir computing, [60] opening up interesting opportunities for photochemical systems beyond feed-forward networks.
From an experimental point of view, we introduced the wide variety of photoswitchable molecular compounds as a conceptually novel class of activation functions for optical neural networks.For azobenzenes, we observed that higher concentrations yield increasing deviations from our model (presumably caused by the limited validity of the Beer-Lambert law as discussed above), but since our theoretical analysis revealed a stable regime of nonlinearity, there is no need for higher concentration from a practical point of view once a sufficient α-value is reached.While we focused on demonstrating the feasibility of adaptive photochemical activation functions for ONNs and the role of the strength of the nonlinearity for well-known classification tasks, it will be highly interesting to investigate temporal dynamics of the switching process and compare azobenzenes with the myriad of other photoswitchable chemical compounds that offer new possibilities for adapting ONN performance and gain new insights into information processing in the intelligent matter. [15]urthermore, our proof of concept shows that PCAF material systems hold promise for implementation into nanophotonic circuitry to replace electro-optic nonlinearities currently used in multilayer neural networks. [22,61]An efficient interface to photonic waveguides is needed, which can be realized either by embedding photoswitches into compatible host materials (such as polymers) or by realizing propagation of light through the photoswitching solution at the microscale, thereby, miniaturizing the experimental scheme presented in this work.By this, switching times and energy consumption can be reduced from the order of seconds and order of joules, respectively, to picoseconds and picojoules, surpassing state-of-the-art electronic systems by orders of magnitudes.

Experimental Section
NN Training: For all datasets, we use three-layer FCNs, the SGD optimizer with Nesterov momentum, [62] and a base learning rate of 10 À2 .For XOR and circle, we generate input data points between 0 and 1 and choose decision boundaries to generate correct labels of equal abundance for both classes.We use hidden dimensions 10 and 10 and train for 100 epochs with 10 000 datapoints per epoch and decay the learning rate exponentially to 10 À5 .
For MNIST and FMNIST, we use hidden dimensions of 300 and 100 and train for 100 epochs with a cosine learning rate scheduler. [63]The train loss as well as the test loss decrease steadily and no heavy divergence between test and train losses occurs, so there is no evidence for severe overfitting (cf. Figure S3, Supporting Information).
All available samples were used and we retained the standard training/ test split for both datasets.Due to the fact that we use intensities as signal carriers, all inputs, outputs, activations, and weights are limited to be positive.To accomplish learning with this constraint, we normalize the input data to the interval [0,1], initialize the weights uniformly between 0 and an upper bound, that is two times the bound of the Kaiming initialization [64] with a gain of 1 to analogously keep the variance of the resulting activations limited, and we clip the weights after each optimizer update step to [0,∞).

Figure 2 .
Figure 2. A) Measured power-dependent absorbance for different azobenzene concentrations.Dots represent measurements, lines the fitted models, and cones around lines the 1-σ interval.The nonlinear regime is shown in the full plot, while the inlay shows the full range of the measurements.To depict the nonlinear regimes more clearly, the output values are normalized to agree for high input intensities.B) Dependency of the strength of the nonlinearity α on the azobenzene concentration c.Experimental values and linear fit (see Equation (2)).C) Schematic illustration of regimes of nonlinearity model (see Equation (1)) on a double logarithmic scale.Dashed lines represent linear regimes (slope 1 in double logarithmic scale).The colored areas denote different regimes of the cis-trans equilibrium and thus of the resulting absorbance.While the system acts linear for low and high intensities (blue and green areas with corresponding slopes m), a nonlinear transition between these regimes occurs when the input intensities are in the order of 1/β (point of inflection in double logarithmic scale).The displacement between the linear regimes directly depends on α.

Figure 3 .
Figure 3. A) Scheme of FCN with PCAF.In each node, all inputs are summed up, run through to the model of the PCAF, and passed to the next layer.B) Activation function as used for training the network (PCAF), i.e., after rescaling to artificial units and including noise (depicted by 1-σ cone, enlarged in inlay).Strength of the nonlinearity α = 1.75 corresponding to c = 3 mM.C) Distribution of neural activations as inputted to activation function (i.e., after summation, before PCAF) for fully trained model on the MNIST test dataset for both hidden layers, depicted next to the fitted model and measured data points.The center point of the nonlinearity marked by vertical line at 1/β. occurring activations are the measured range while the first layer activations are distributed around the regime of the highest nonlinearity.