Neuromorphic Computing via Fission‐based Broadband Frequency Generation

Abstract The performance limitations of traditional computer architectures have led to the rise of brain‐inspired hardware, with optical solutions gaining popularity due to the energy efficiency, high speed, and scalability of linear operations. However, the use of optics to emulate the synaptic activity of neurons has remained a challenge since the integration of nonlinear nodes is power‐hungry and, thus, hard to scale. Neuromorphic wave computing offers a new paradigm for energy‐efficient information processing, building upon transient and passively nonlinear interactions between optical modes in a waveguide. Here, an implementation of this concept is presented using broadband frequency conversion by coherent higher‐order soliton fission in a single‐mode fiber. It is shown that phase encoding on femtosecond pulses at the input, alongside frequency selection and weighting at the system output, makes transient spectro‐temporal system states interpretable and allows for the energy‐efficient emulation of various digital neural networks. The experiments in a compact, fully fiber‐integrated setup substantiate an anticipated enhancement in computational performance with increasing system nonlinearity. The findings suggest that broadband frequency generation, accessible on‐chip and in‐fiber with off‐the‐shelf components, may challenge the traditional approach to node‐based brain‐inspired hardware design, ultimately leading to energy‐efficient, scalable, and dependable computing with minimal optical hardware requirements.


Supplementary Text
A. Mathematical analogy between neural networks and nonlinear pulse propagation In general, nonlinear pulse propagation in optical waveguides, including soliton fission, can be modeled by means of the generalized nonlinear Schrödinger equation (GNLSE) in the time domain 1 : !( ! =  (1 +  )*+,$ !!( , ( * || ' ) (S1) Here, A(z, T) is the pulse envelope in a comoving frame T= t− β1z at the group velocity 1/β1.The parameters βk and  denote the higher-order dispersion terms and the nonlinear parameter of the waveguide, respectively.τshock is the self-steepening parameter and R(T) accounts for the nonlinear response of the material (including instantaneous electronic and non-instantaneous Raman contribution) 2 .The * operator substitutes a convolution operation of the form  * || ' = ∫ (′)|(,  −  -)| ' ./.

𝑑𝑇 -.
The GNLSE is a partial differential equation without analytical solutions and requires numerical approximation methods to be solved, such as the split-step Fourier algorithm.The iterative split-step Fourier method can be applied to differential equations of the form: !!" (, ) =  9  +  9 , (S2) with the propagation coordinate z, the optical field envelope A, the linear dispersion operator  9 , and the nonlinear operator  9 .In the numeric split-step approach 2 , the linear and nonlinear terms of the equation are commonly solved in the frequency and time domains, respectively, in an alternate sequence mediated by the Fourier transform (FT).Simulations in this work have been performed by solving Eq. (S1) numerically in this way, where the nonlinear step has been performed using a 4 th -order Runge-Kutta integrator to resolve the z derivation.
Under certain assumptions, it can be shown that a discretized nonlinear Schrödinger equation displays the same mathematical form of a feed-forward artificial neural network.Without loss of generality, we may neglect the weak molecular Raman response of the waveguide (i.e., () ≈ () ⟹  * || ' ≈ || ' , where () is the Kronecker delta function) and assume a nondispersive nonlinear material response (i.e.,  )*+,$ = 0) in first approximation.Those assumptions reduce Eq.(S1) to the specialized nonlinear Schrödinger equation of the form In the framework of the split-step Fourier method, we solve the linear part of Eq. (S3) separately from the nonlinear part.The linear equation ∂ " ( ; ) =  9  has an analytic solution in the Fourier space, which can be expressed as: where  F = {} is the optical field amplitudes in the frequency domain,  = ∑ 0 $! $&'  $ Ω $ is the propagation constant, Ω is the angular frequency difference with respect to the central frequency, and  is the step-size.We find that the solution in Eq. (S4) is a multiplication in the frequency domain, which can be formally written as a convolution in the time domain (; ) = ∫ (′; 0)( − ′) ′ , (S5) where () =  /0 { $%('))* + }() is the temporal impulse response of the waveguide over half a propagation step /2.Half steps are the convention used in the split-step algorithm (see Eq. (S7) below).Discretizing the time to a grid with steps ∆ =  6 and the propagation axis to a grid with steps ∆ =  $ , allows us to approximate Eq. (S5) per half propagation step with where  6 $ = (∆; ∆) and  69 = (( − )∆)∆.
We note that Eq. (S6) has a formal resemblance to a sum & weight operation, i.e., the linking function of a neural network layer.Here, the dispersive coefficients  69 correspond to the temporal impulse response of the waveguide shifted in time relative to the temporal field components  9 $ .The coefficients act as static, waveguide-specific weights on the complex-valued field amplitudes 3 .On the other hand, the nonlinear term of Eq. (S3), i.e., ∂ " (, ) = || '  =  9 (), can be solved straightforwardly by discretizing the derivation in z.For simplicity, we demonstrate this using the Euler method ∂ "  ≈ # )* ( 6 $70 −  6 $ ).Hence, the  th nonlinear step along  can be approximated in the time domain by means of a simplified form, such as: Eq. (S7) resembles a cubic activation function of the temporal field amplitudes  6 $ , mediated by a nonlinear gain parameter γ.The activation potential ^6 $ ^' may be the pump field itself (i.e., selfactivation) or other generated fields that temporally overlap with the main field (i.e., crossactivation).Inserting Eq. (S6) into Eq.(S7) yields an equation that formally matches the mathematical model of artificial neural nodes in a network (cf.Eq. (2) in the main text).We stress that, beyond a common stability criterion, step size  can be chosen to be arbitrarily small.Thus, it has no practical meaning to estimate the computation advantage of neuromorphic wave computing by counting the mathematical operations needed to obtain a stable solution of Eq. (S7) over the entire fiber length.

B. Phase sensitivity of soliton-fission-based neural emulation
To investigate the phase dependency and sensitivity of the system, we performed a straightforward single-bit-shift experiment.We scanned a single encoding bit (width = 1 nm) through the available bandwidth of the programmable filter (1528 nm to 1600 nm), with a step size of 2.0 nm.For each bit position, we collected the output spectrum of the HNLF for four different phase factors.This simple yet informative experiment reveals the high phase sensitivity of the soliton fission dynamics to phase information at the input.Fig. S3 shows the upper and lower intensity bounds across the readout bandwidth, which can be understood as a measure of the system's maximum readout dynamics for a chosen phase factor.
Even for small phase values (0.1π), intensity variations are visible in the output spectra (up to 5 dB).With increasing phase factors (0.4π, 0.7π, 1.0π), variations increase, thus indicating a higher susceptibility to changes in the bit position.While one would usually assume better system performance for larger output variations, in our experiments we found that phase factors of around : ; performed best in all tested benchmarks, independently of the input feature size (i.e., the phase information per encoding wavelength).

C. Training results
Sinc regression -In our implementation, we repeatably measured a random bit string with 1000 random phase values.Notably, we avoided regularization (i.e., ridge regression) in the learning process, which appears to be a common, yet more complex approach to achieve higher fitting accuracies.Fig. S4 shows the results for the Sinc regression.We obtained a test (training) RMSE of 0.0175 (3.05×10 -4 ).A similar approach based on spatial modes 4 requires two orders of magnitude higher pulse energy (320 nJ) than our frequency-based approach (89 pJ).Yet, it exhibits a higher error rate (RMSE of 0.067) for this task.The spatial mode implementation also reported performance limitations due to modal self-cleaning effects with higher powers.In comparison, we have not observed such power restrictions in our experiments, and we do not expect such limiting mechanisms in soliton fission-driven nonlinear dynamics.
n-bit parity -Fig.S5 depicts the training results from Fig. 2(G) in the main text.We define the system's operation fidelity as the accuracy of a class prediction problem in machine learning.It is a measure of how well a model correctly identifies or excludes a condition.In this context, accuracy is defined as the proportion of true results (both true positives (TP) and true negatives (TN)) in the total number of samples examined: Like the test results, the training results improve operation fidelity with increasing system nonlinearity (soliton number).We further find that the operation performance can be scaled with the amount of chosen readout bins, offering an additional control parameter to our approach.For bit lengths from  = 4 to  = 7 bits, we studied the operation fidelity for increasing readout bins from 4 and 7 to 125 bins.Notably, the observed improvement in performance with increasing system nonlinearity is independent of the chosen number of read-outs observed for a fixed number of read-out bins (e.g., 7 read-outs for bit length n = 4 to 7).The remaining training parameters are specified in Table S1.and Fig. S6, which present the results for two different input power configurations (i.e., soliton numbers  = 2 and  = 5).The results show that increased readouts beyond the input feature size (i.e., bit length ) can drastically increase system performance, essentially serving as a free system hyperparameter.
On the other hand, with a higher number of spectral bins, we observed effects of overfitting (i.e., a decrease or oscillation of test fidelities when we reach a training accuracy close to 100%), imposing a practical limitation to the system confidence bounds.Notably, the bin sweep confirms the overall trend of the results shown in Fig. 2. In particular, with a higher soliton number, operation fidelity can be improved for a constant number of readout bins.
Task-dependent benchmarks -The confusion matrices of the training set for all taskdependent benchmarks are shown in Fig. S7 and complement the test results in Fig. 3.
MNIST data set -The recognition of handwritten digits was the most challenging task for our implementation.We used the MNIST set to study the dependence of the system performance on (i) the maximum phase factor of the encoding, (ii) the spectral resolution of the read-out, and (iii) the number of readout bins.The results in Fig. S8(A) show a slight decrease in prediction accuracy for phase factors, increasing from : 0'; to : ; , independently of the chosen read-out bin number.Subsequently, we investigated the performance dependence on the number of readout bins similarly to the -bit parity study (Fig. S8(C), for a phase factor of : ; and 0.5 nm resolution).In agreement with previous findings, we observed that increasing the number of readout bins improves system performance up to a certain limit.Beyond ~100 readout bins, the consequences of overfitting became apparent (i.e., further decrease in the training error but stagnation in the test error).
Finally, we observed that system performance also dropped with decreasing spectral resolution (Fig. S8(B)), and the best results were achieved with the highest resolution available (0.125 nm).
It is important to note that while an increase in resolution improves system performance (more available spectral combinations), it also drastically impacts the measurement time per inference in the current platform implementation due to characteristics of the used spectrometer.The measurement time per scan increases from seconds (in the case of 2 nm resolution) to minutes (in the case of <0.5 nm resolution).At the highest resolution (0.125nm), we achieved the best system performance, however, this setting led to impractical measurement times per sample (~2mins).Thus, we settled for a compromise between system performance and total acquisition time by choosing a 0.5 nm spectral resolution for the final acquisition of the MNIST data set (approx. 1 min.acquisition time per sample).
COVID-19 data set -To ensure a fair comparison between our implementation and the digital support vector machine (SVM) approach, the same training and test sets were used for both.We used a 5-fold cross-validation during training for the software SVM implementation to determine the correct hyperparameters.Specifically, for the SVM we used a penalty parameter of 0.05, and an  ' penalty as the loss function 5 .
The benchmark feature set used in the challenge contains over 6000 features extracted from the widely used openSMILE toolbox .To provide helpful input information to our system (i.e., a feature size ≪400), we performed a two-step pre-processing on the raw audio data to obtain 30 principal features, which were eventually given as inputs to both the SVM software classifier and our experimental system.The method for feature reduction is illustrated in Fig. S9.
Our technique builds on creating a modulation spectrogram as the basis for a two-fold feature extraction scheme.Such an approach is quite new and has outperformed conventional spectrogram-based features for COVID-19 detection using SVMs 7 .A modulation spectrogram describes the rate of change for the individual kilohertz-rate frequency components and is particularly useful for analyzing highly dynamic, time-variant data 8 .We first transformed the initial audio signal () into a 2-dimensional time-frequency space ((, )) using a short-time Fourier transform (STFT, 256-point Fast-Fourier transform, FFT).Then, we applied a second FFT on the data per frequency channel along the time-axis 8 , resulting in the modulation spectrogram ( B+E , ).In our process, we divided the resulting spectrogram into twenty 1 Hz bins on the modulation axis,  B+E , and twenty 400 Hz bins on the conventional frequency axis, , resulting in a 20 × 20 frequency-frequency map ( F#G ( B+E , )).Next, we performed an initial feature extraction across each dimension ( and  B+E ), where eight spectral descriptors, such as entropy, centroid, and kurtosis 7 were computed, resulting in an additional 320 low-level descriptors (LLDs, i.e., 2 parameters × 20 bins × 8 descriptors).This first feature extraction step leads to a total of 720 features (400 energies from the binned modulation spectrogram plus the 320 LLDs) that have been successfully applied to COVID-19 detection in recent demonstrations 7 .To further decrease the dimensionality of the input data, we used a second feature extraction in the form of a principal component analysis (PCA) to reduce the feature size to 30.
The achieved results for the COVID-19 classification task from the main text are shown in Fig. S10.

D. Optical Neural Network Performance Comparison
To contextualize our system performance data, we compared different optical neuromorphic processors for the same tasks as shown in Table S2.

E. Parameters of Identified Artificial Neural Network Primitives
Table S3 summarizes the hyper-parameters and achieved performance of the neural network primitives used for the comparison in Fig. 3.We trained each model by means of the stochastic gradient descent (SGD) optimizer with momentum 0.9.We employed a training batch size of 16, which was consistent for all tasks.

F. Artificial Neural Network Performance on the n-bit Parity Operation
Numerous studies in computer science demonstrated the dependence of bit length n on the network complexity (i.e., more nodes or layers) required to perform the n-bit parity operation.Since those models are often based on numerous assumptions about network structure (e.g., direct input-output connections), we aimed to support the claim about increasing network complexity vs. bit length with our own approach, to reveal network primitives.We have studied the operation fidelity for the same bit lengths used in the experiment (i.e.,  = 2 to  = 7) for an increasing number of nodes (2 to 128) in a single-hidden-layer network.Our results in Fig. S11 show that operation fidelity improves significantly with larger networks for a given bit length.This observation holds true for all tested bit lengths.In addition, starting from 5-bit and higher, we observed overfitting effects for a larger number of hidden-layer nodes, ultimately requiring more complex and optimized neural network topologies for robust operation as the bit length increases.

G. Performance and Energy Considerations
Different performance metrics are commonly used to assess the computational performance of hardware.Among the most popular for neural network applications are multiply-andaccumulate (MAC) or floating-point (FLOP) operations, which directly refer to the vector matrix multiplications in neural networks.However, a direct and reliable calculation of such metrics is often not feasible for analog and neuromorphic hardware implementations.A recent suggestion to estimate the computational performance of analog systems builds on the sum of mathematical operations required to simulate the system 9 .The proposed method yields almost arbitrary performance for wave-based computing systems depending on the chosen simulation grid size, which can be adapted and does not relate to the actual neuromorphic task the system has to solve.
Therefore, to compare the computational cost of our approach per given task, we introduce the use of neural network primitives with equivalent task performances as depicted in Fig. 3.Then, to compare the system responses, we calculated the FLOPs carried out by the network primitives using the following method.For a single inference step, the input matrix is of size [1,  #G ].Here,  #G is the dimension of the input vector multiplied for a hidden layer matrix of size where  * is the number of nodes in the hidden layer.The output is then multiplied with another matrix [ * ,  +HI ] to get the final output.Here,  +HI represents the dimension of the final prediction.
For the first matrix operation, the required FLOPs can be calculated as: 1 ×  #G ×  * +  * .For the second matrix operation, the required FLOPs can be calculated as: 1 ×  * ×  +HI +  +HI .The results for all tasks are summarized in Table S4.E.g., for the IRIS task (see Table S4), the required FLOPs are determined as: For the SVM classifier we used in the COVID-19 task, the FLOPs of a single inference step can be easily calculated as 2 ×   (i.e., 30 in the present study).Therefore, our SVM requires 60 FLOPs to perform the COVID-19 classification.This simplicity is because a linear SVM has a prediction complexity of (), where  is the input dimension size, given that there is only a single inner product 5 .
To allow for a fair comparison of our experiments with all software classifiers, we exclusively consider the required FLOPs from the input to the hidden layers (i.e., not to the output layer).This is because the readout of our current fiber-based implementation still requires measuring the average value over thousands of optical pulses.In a future system, however, an individual inference could be measured on a single-shot basis (i.e., at 200 MHz rate), e.g., using a recently reported high-speed Waveshaper alternative based on acousto-optic modulators 10 for encoding combined with dispersive Fourier transform and ultrafast detection 11 .Yet, such single-shot implementation would be highly dependent on the chosen components.Hence, in any case, we cannot provide a trustworthy estimate for the energy consumption of the read-out layer at the current stage.
Table S4 shows the energy comparison between our experimental wave-based approach (neglecting encoding & decoding electronics) and its digital counterparts for one inference step.For comparison, we assume an NVIDIA Kepler K20m GPU with ∼50 pJ/FLOP 12 .
Interestingly, let's consider highly efficient computing architecture (1 pJ/FLOP 13 ).The implemented digital approaches are more energy efficient for inference than the fiber-based system, but only for certain tasks, e.g., IRIS or COVID-19.Notably, this optimized architecture usually operates at reduced bit depths (among other simplifications), which may lead to overall reduced accuracy for universal task applications.
Finally, to better understand our implemented system, we compare the inference energy requirements with similar wave-based processors in Table S5.Spectro-temporal characterization of the information carrier pulse.(A) Autocorrelation trace before phase optimization measured in front of the nonlinear fiber.(B) Autocorrelation trace after phase optimization measured in front of the nonlinear fiber.In both cases, an autocorrelation correction factor of 0.707 was applied for the full width at a half maximum retrieval (i.e., pulse width labels in the figures).(C) Spectrum of the frequency comb pump laser (cyan), and spectrum after amplification and filtering through the programmable filter before entering the HNLF (purple).The C-and L-band covered by the programmable filter are highlighted in yellow and light blue, respectively.Here we show an example of a phase encoding mask (specifically for the case of an IRIS data set, in magenta) including four feature levels between 1528 nm and 1568 nm.(D) Autocorrelation traces after pulse optimization and no phase mask (cyan curve), a random phase mask of maximal magnitude π/8 (purple curve), and a random phase mask of maximal magnitude π (magenta curve).

Fig. S3.
System sensitivity to the altitude and spectral location of a phase change in the pump spectrum.The panels show the measured output spectra of the HNLF for a continuous shift (from 1528 nm to 1600 nm) of a single bit (1 nm width) encoded in the spectral phase of the input pulse, with (A) 0.1 π phase shift, (B) 0.4 π phase shift, (C) 0.7 π phase shift, and (D) 1.0 π phase shift.The grey area in the plots indicates the individual spectra for all bit positions, and the colored spectra (blue and magenta) illustrate the minimum and maximum of each phase configuration (i.e., lower and upper bounds).S3.
Overview of the achieved performance of the neural network primitives and their hyperparameters. refers to the activation function applied to the output of a hidden layer.The column rate denotes the learning rate of the SGD optimizer.In the case of Abalone, the results are presented for the KL-divergence loss.S5.

Task
Energy requirements of a single-pulse inference for different wave-based computing implementations.

Fig. S1 .
Fig. S1.Schematic experimental setup of the implemented transient optical neural emulator based on high-order soliton fission.All incorporated fibers and fiber devices are polarizationmaintaining.The inset depicts a photograph of the optical processing unit, showing the compact footprint of the system.PMF -polarization-maintaining fiber, PM-DCF -polarizationmaintaining dispersion compensating fiber.

Fig. S4 .
Fig. S4.Results for the universal function regression of a sinc function.A significant low training error of 3.05e-4 (RMSE) and test error of 0.0175 (RMSE) indicate good function approximation capabilities of our experimental system.

Fig. S5 .
Fig. S5.Training results (in %) for the n-bit parity task in the read-out configuration, where the number of bins is chosen to equal the bit length.

Fig. S9 .
Fig. S8.Dependence of the prediction accuracy of MNIST on phase altitude, spectral resolution, and number of readout bins.(A) Prediction accuracy over the encoding phase for various numbers of readout bins and 0.125 nm resolution.(B) Prediction accuracy vs. resolution of the spectral recordings for various numbers of readout bins and a π/8 encoding phase.(C) Prediction accuracy vs. the number of readout bins used to train the linear classifier for a π/8 encoding phase and 0.5 nm resolution.

Fig. S11 .
Fig. S11.Digital neural network performance.(A) Test accuracy of the n-bit parity problem for digital neural networks with an increasing number of nodes (the x-axis only shows the number of nodes in the first layer).The error bars indicate the standard deviation over 50 different runs, the data points correspond to the average accuracy.(B and C) Comparison between (B) the mean square error (MSE) and (C) KL-divergence as a loss metric for the Abalone data set.Here we present the test samples of best performing neural networks after training: (B) a 2-layer network (1024 nodes in the first layer) which resulted in an RMSE of 0.118, and (C) a 2-layer network (128 nodes in the first layer) which resulted in an RMSE of 0.095.

Table S1 . Training hyperparameters for the different benchmark tasks shown in the present work.
CV -cross-validations.

Table S4 .
Overview of the artificial neural network (ANN) parameters and calculated FLOPs, as well as energy requirements for one inference step (input to hidden layer) of both TONE and competing digital ANN primitives, shown for the different tasks studied in the present work (see Fig.3in the main text).