Photonic Neural Networks with Kramers–Kronig Activation

Photonic neural networks (PNNs) are promising to replace conventional deep learning hardware due to their potentially higher energy efficiency and computational speed. Other than the fast progress on optical linear transformer, the nonlinear activators are much less mature. Usually, the optical activators employ the nonlinear mapping from input amplitude to output amplitude, which is limited by high threshold and loss, or rely on additional bias voltage and heterogeneous integration of external circuits. Herein, the activation induced by the Kramers–Kronig relationship is proposed, i.e., the connection between amplitude and phase of the light field, but not purely the light amplitude itself, namely, Kramers–Kronig activation (KKA). PNN with KKA exhibits learning capability apparently better than the activation‐free linear network and comparable to the network with popular activation functions like ReLU, Softplus, and so on. Moreover, PNN with KKA is highly programmable and cascadable supporting ultra‐deep networks. The essence of KKA is attributed to the learning of nonlinear features in relatively low‐dimensional RN × N space by linear features in high‐dimensional CN × N space. Considering besides the amplitude–phase coupling, several other parameters like wavelength, frequency, polarization, etc., can be also mutually linked. This approach may expand to new avenues for optical activations.

on energy as well as throughput bandwidth due to usage of high-speed analog-to-digital convertors (ADC) and digital-toanalog convertor (DAC).
To solve the shortage of current photonic neural network (PNN) on implementing the nonlinear activation, various approaches are proposed, including that utilize intensity modulators, the saturation effect of cameras, quadratic nonlinearity of photodiodes, saturation of semiconductor amplifiers, and saturable absorbers, to name a few. They employ the specific device behaviors to produce nonlinear mapping between the input amplitude (or intensity) and output amplitude (or intensity), which might be deemed as the straightforward optical analogs of the conventional electrical activations (e.g., ReLU, Softlus, etc.) performed on classical digital circuit ( Figure 2a). However, they are often limited by high threshold and high loss, or rely on an additional bias voltage and the heterogeneous integration of external circuits. Moreover, such optical activation is essentially in real-valued domain, and accordingly, perpetuate the mathematical convenience of classical ANN for both training and inference on electronic hardware because the real-valued operations (e.g., MAC) are often computational-efficient for electronic circuit in compare with their complex-valued counterparts. However, PNN is intrinsically complex-valued but not realvalued, and thus effectively using such complex-valued nature of the light field would be beneficial to develop the activators more suitable to optical-domain implementation.
Here, we propose the nonlinearization induced by the connection between amplitude and phase of the modulated light. Specified, the intensity (or amplitude) of the input laser can be attenuated by controlling the absorption of the modulator, and simultaneously, the phase of the modulated light changes.
Such linkage of the phase and amplitude intrinsically originates from a direct consequence of casualty universal in most optical media: the Kramers-Kronig relationship (KKR). In this work, it is used to produce the nonlinear activation, namely, Kramers-Kronig activation (KKA), as shown in the Figure 2b. It is found that the KKA is promising to provide a performance close to that with state-of-the-art (SOTA) activators like ReLU, Softplus, etc. Some practical considerations, including the insertion loss and the power consumption as well as the energy efficiency, critical to the physical realization of the deep photonic networks are analyzed. The essence of KKA as well as its values for future advanced machine intelligence is also discussed. As will be shown, our KKA approach has the potential to significantly reduce unnecessary DA/AD conversion, lower the footprint of on-chip ONNs, increase light intensity utilization efficiency and provide better network-cascadability. Besides phase and amplitude, more linkages between different freedoms of the light field might inspire new chances to novel activators for future PNN.

The Principle of KKA
The typical architecture of the PNN with KKA is based on the pseudoreal valued MZI mesh, [1] a scalable photonic neural chip design utilizing the real-part of the MZI mesh to learn the realvalue matrix (i.e., the mesh itself is truly programmed as a complex-value matrix, but works as a real-value matrix multiplier whose operation only depends on its real part), as shown in Figure 2b. However, unlike the regular pseudo-real valued mesh design whose inputs are purely amplitude-modulated, here to produce nonlinear activation, the inputs are provided by the modulators whose amplitude attenuation (absorption) mathematically links the corresponding phase variation (chirp) with KKR. Without the loss of the generality, we take electrical absorption modulator (EAM), the most typical device with such amplitude-phase-KKR, to show the generation of KKA. [2] Specified, in the EAM, the refractive index of the light-guiding media Im(n) is electrical tunable to directly modulate the amplitude, but the real part of the refractive index Re(n) would vary simultaneously according to KKR, resulting the phase change. Such that the exact modulated complex field is where |E laser | is the light amplitude through EAM without voltage-stress, λ is the wavelength, L is the device length, and n is the refractive index tuned by the voltage-stress. It is well known that the KKR links the real and imaginary part of the refractive index as [2] ΔRe½nðω, VÞ ¼ c π P:V: where ω ¼ 2πc=λ, P:V: ∫ ∞ 0 dΩ stands for the Cauchy principal value. Considering the amplitude depends only on ΔIm(n) (here λ and L are fixed, and without the loss of the generality, we may set |E laser | = 1), it can be found that the phase change due to modulation actually can be deemed as a function of the amplitude absorption ϕ ¼ 2πLΔRe½nðω,VÞ λ ¼ ωc π P:V: In the following, we propose how the phase-amplitude correlation based on the KKR produce nonlinearity in the pseudo-real valued MZI mesh with EAMs as the input components as shown in Figure 2b. [1] Taking the 4 Â 4 mesh as an instance, here the input modes E in ! can be written as a four-dimensional vector where |E in,i | is the light amplitude of input port i (i = 1,2,3,4), j is the imaginary unit, and the phase ϕ i should be dependent on the |E in,i | (i = 1,2,3,4) due to the influences of the chirp, i.e., the KKR given by Equation (3). In contrast, it is well known that by using the SVD algorithm, arbitrary complex-valued matrix M can be factorized as by three-cascaded MZI mesh (U, Σ, V ) as shown in Figure 2 [3] M ¼ UΣV ¼ m 11 expðjθ 11 Þ : : : : : : m 14 expðjθ 14 Þ : : : : : : : : : : : : : : : : : : : : : : : : m 41 expðjθ 41 Þ : : : : : : where m ij and θ ij (i,j = 1,2,3,4) are the amplitude and argument of complex-valued-element at i-row-j-column of the matrix M expressed by the MZI mesh. Therefore, the outputs E out ! would be Figure 2. a) PNN with regular optical activator which directly provides nonlinear amplitude or intensity mapping between input and output (here the Mach-Zehnder modulator is exemplified as the chirp-free modulator allowing pure-amplitude modulation); b) PNN with KKA which employ the amplitude-phase link to construct nonlinear activation.
www.advancedsciencenews.com www.adpr-journal.com Accordingly, the real part of the outputs detected by the balanced photodiodes is Noting that based on previous classical pseudo-real-valued mesh architecture, [1] ReðMÞj E in ! j equals to the weighting-sum operations given by Equation (8) ReðMÞj where E in ! corresponds to the real-valued (pure-amplitudemodulation) input vector, M is the complex-valued matrix represented by the MZI mesh, and its real part Re(M) serves as the weight-matrix for PNN. [1] Equation (5)  pseudo-real-valued mesh. [1] Whereas it could be deemed as that the inputs |E in,i | (i = 1,2,3,4) is, first, nonlinear-activated by f NL ðjE in;i jÞ ¼ jE in;i j cos½ϕ i ðjE in;i jÞ þ θ n;i = cos ðθ n;i Þ n ¼ 1, 2, 3, 4 (9) and then perform the weighting-sum operations similar to Equation (8), i.e., In another word, introducing the chirp into the modulation could produce the activation induced by KKR, or namely KKA. The homodyne-detected real-part of the outputs after the inputs passively transmit through the MZI mesh corresponds to the results of fused operations of nonlinear activation (KKA) and linear weight-sum. However, unlike in conventional deep neural networks whose activations are applied on the output side of the hidden layers, as shown in Figure 2a, KKA takes effects on the input side (e.g., the modulators). Such that the generation of KKA is apparently different from conventional activation, and there are five noteworthy features: 1) While in the hidden layer of conventional ANN, we do weight-sum first and then nonlinear-activation (bottom of Figure 2a) for the hidden layer in the PNN with KKA, it becomes first-activation-then-weight-sum (bottom of Figure 2b). However, considering the outputs of the hidden layer N, E out,N , are usually the inputs of the layer N þ 1 E in,N þ 1 , activating on either E out,N or E in,N þ 1 , in principle, would not affect the overall learning capability of the neural network negatively; [4] 2) In the existing optical activators based on saturated absorption, SOA, MZI-MRR combinations, Ge/Si system or optoelectronic hybrid architecture, [5][6][7][8][9][10][11] their physics mechanisms are often employing the intensity-dependent nonlinear effects to implement nonlinear-mapping from the input (intensity) of the activator to its output (intensity). This thought is quite straightforward, but the phase information within the light signal is not well utilized (S1, Supporting Information), whereas, the nonlinearity of KKA is facilitated by the link between the amplitude and the phase of the light-signal if note Re(E in ) = |E in | cos(ϕ) = |E in |cos( f(|E in |)) and similarly, Im(E in )=|E in |sin( f(|E in |)). In another word, the role of the phase is almost absent in conventional activators but quite important to produce the KKA. This is the most critical difference between these two types of activation approach. Accordingly, engineering the KKR, rather than pure the intensity-related effect, could open extra phase-dimension to design the optical activation function; 3) KKA is a physics-intrinsic activation mechanism not involving the contribution from the nonideal linearity of the modulator as well as its driver, and also does not require some deliberatelyintroduced nonlinearity on the electrical domain something like I = g NL (V ), where I is the encoded (modulated) intensity of the input data, and g NL (V) is a function reflect the nonlinear response under voltage stimulation V. This is because, actually, the voltage V is an eliminated dummy variable to help us extract the amplitude-phase link via combining the measurements of both amplitude-voltage and phase-voltage response (S2, Supporting Information). However, in the EAM, the variation of the amplitude and the phase is respectively dependent on the change of the imaginary and real part of the refractive index and the KKR physically guarantees that the same level change of the imaginary part of the refractive index should always produce the same corresponding change of the real part, without concerning on the specific details of the stimulation source and/or exciting method. In another words, no matter what the exact form of g NL (V) is, we could always achieve the same amplitude-phase relationship physically determined by KKR. Accordingly, the electronically driven induced nonlinearity g NL (V) would not contribute to the generation of the KKA; 4) Even for the given nonlinear device with as-design (fixed) KKR, its nonlinearity can still be reconfigured by programming Im(M), the imaginary part of the complex-valued matrix expressed by the MZI mesh. This further provides the flexibility to tune the activation behaviors; 5) Besides EAM, KKA could be also feasible in other modulation devices like MRR, [12] MZM, [13] or even Michelson interferometric modulator [14,15] because in these devices, altering the light's amplitude could simultaneously produce phase-change as well, and the light-modulation, in essence, is the manipulation of the refractive index, whose real and imaginary part is linked by KKR (discussed latter in Section 6). However, among all these devices, EAM is the simplest system to explicitly map the phase-amplitude-relationship to the links between the real and imaginary part of the refractive index (S1, Supporting Information): The amplitude is only dependent on the imaginary part of the refractive index, while the phase is only dependent on the real part, as shown in Equation (1). This is why we choose EAM to exemplify the KKA as mentioned above.

Nonlinear Behaviors of the KKA
Here, we take the EAM modulator from Broadcom as the model system to study the KKA behavior generated in the pseudo-realvalue MZI mesh. [16] Its |E|-ϕ relation linked by KKR is shown in Figure 3a, which is extracted from the test-data (S2, Supporting Information). Combining such |E|-ϕ relation and programmable Im(M), it is feasible to produce highly tunable KKA.
The simplest case might be that when Im(M) = 0, there would be the activation function of f NL (E) = E*cos(ϕ(E)), as shown in Figure 3b (black line). However, in most cases with nonzero Im(M), it is challenge to deduce the general analytical form of the activation function. Even though, we can still present rich instances with analytical-deducible activation function, and thus demonstrate the feasibility of programming the KKA.
As proposed above, for the given PNN based on pseudo-real valued MZI mesh, [1] Re(M) is used as the weight matrix, and Im(M) induces the activation. Basically, in Equation (6), we can factorize them in SVD form respectively as where the absolute value of every singular value σ ii (i = 1, 2, … n) in diagonal Σ and σ ii 0 (i = 1, 2, … n) in diagonal Σ 0 should be less than 1 because of the attenuation nature of the MZI mesh. Hence, we define σ ii = cosθ i , and σ ii 0 = sinθ 0 i . In contrast, the i-th (i = 1, 2, … n) component of the input vector can be written as cos θ 1 cos ϕ 1 v 12 cos θ 1 cos ϕ 2 : : : v 1n cos θ 1 cos ϕ n v 21 cos θ 2 cos ϕ 1 v 22 cos θ 2 cos ϕ 2 : : : v 2n cos θ 2 cos ϕ n : : : : : : : : : : : : v n1 cos θ n cos ϕ 1 v n2 cos θ n cos ϕ 2 : : : v nn cos θ n cos ϕ n 0 B B @ 1 C C A jE in,1 j jE in;2 j : : : and ImðMÞ Ã Im E in ! ¼ U v 11 sin θ 0 1 sin ϕ 1 v 12 sin θ 0 1 sin ϕ 2 : : : v 1n sin θ 0 1 sin ϕ n v 21 sin θ 0 2 sin ϕ 1 v 22 sin θ 0 2 sin ϕ 2 : : : v 2n sin θ 0 2 sin ϕ n : : : : : : : : : : : : : : : : : : : : : v 2n cos θ 2 A 2 cosðθ Ã 2 þ ϕ n Þ cos θ 2 : : : : : : : : : : : : v n1 cos θ n A n cosðθ Ã n þ ϕ 1 Þ cos θ n v n2 cos θ n A n cosðθ Ã n þ ϕ 2 Þ cos θ n : : : v nn cos θ n : : : www.advancedsciencenews.com www.adpr-journal.com Noting that the weight matrix expressed by the real part of the unitary MZI mesh is ReðMÞ ¼ U v 11 cos θ 1 v 12 cos θ 1 : : : v 1n cos θ 1 v 21 cos θ 2 v 22 cos θ 2 : : : v 2n cos θ 2 : : : : : : : : : : : : v n1 cos θ n v n2 cos θ n : : The real part of the j-th component ( j-th neuron) of the output can always be deemed as that the inputs (i.e., |E in,i |, i = 1,2,3,…n) are first activated with the nonlinear functions f NL ðjE in;i jÞ ¼ ,…n) and then performed with the weighted-sum operation (Figure 2b). Such that, cascading the pseudo-real-valued MZI mesh with EAM could construct the multilayer (deep) neural network with alternatively executed linear (weighting) and nonlinear (activation) operations. The θ j -and θ j *-dependence of KKA make it trainable. However, because θ j and θ j * (also) control the matrix M expressed by the MZI mesh, whose real part Re(M) is the weight matrix of PNN, the training of the weight would be coupled with the optimization of the nonlinear activation. When setting U 0 = U and V 0 = V, some instances of KKA are shown in Figure 3b (certainly, Im(M) = 0 also belongs to the cases of U 0 = U and V 0 = V ) to visualize the specified activation behaviors. When sin θ i 0 ¼ cos θ i (i = 1,2,3,…n), i.e., Im(M) = Re(M) (red line in Figure 3b), it has More generally, let Im(M) = kRe(M) (k is a parameter programmable by MZI mesh), the activator becomes where ϕ 0 ¼ arccos . Tuning k would produce abundant and variable nonlinear behaviors, as shown in Figure 3b, which seem promising to play role of several SOTA activator like ReLU and Softplus to enhance the learning capability of neural networks as demonstrated in the study of Passalis et al. [17] Moreover, other than the classical neural networks, where the activation functions usually are with fixed form, here, interestingly the PNN with KKA allows the different neurons be with different activation behaviors, e.g., when sin θ 0 This implies that the activation to the stimulus |E in,i | at input port i could be not only highly programmable (note θ j is variable) but also port-to-port different at the output-side (since different output port j often corresponds to with different θ j ). i.e., KKA could be both neuron-by-neuron and synapse-by-synapse different. Such feature is closer to the biological nature since it is well known that the real neural system actually consists of neurons and synapses with many different subtypes. [18] Specified, as shown in Figure 4a,b, corresponding to θ = 1.0π and 0.8π, exhibit a response which is comparable to the ReLU activation function: output is low for small input values and high for large input values. For the case of θ = 0.8π, output at low input values is slightly increased with respect to the response where θ = 1.0π. However, these ReLU-like activations are not entirely monotonic but decrease slightly before increasing. In contrast, the responses shown in Figure 4c,d, corresponding to θ = 0.6π and 0.4π, are quite different. These configurations demonstrate a saturating response in which the output increases (case θ = 0.6π) or decreases (case θ = 0.4π) nearly linear for lower input values, but is top (bottom) suppressed for higher input values. Moreover, when sin θ i 0 ¼ sin θ i , the M expressed by MZI mesh would be just a unitary matrix. Consequently, other than using two unitary matrixes and one diagonal matrix need in regular SVD-based mesh design, one unitary matrix is enough to implement any real-valued weight matrix of a hidden layer as well as the corresponding activators defined by Equation (18), as shown in Figure 4e. This is quite valuable for minimizing optical losses, reducing fabrication resources, saving the programming power, and lowing the programming error. [1] Aside the cases with U 0 = U and V 0 = V, the more common situation is U 6 ¼ U 0 and V 6 ¼ V 0 . However, if the L 2 norm of www.advancedsciencenews.com www.adpr-journal.com Re(M) is remarkable larger than that of Im(M), we can still approach the M with following process.
Here, we define where diag(A) corresponds to the matrix constructed by the diagonal elements of matrix A. Noting the L 2 norm of U H U 0 Σ 0 V 0 V H as well as Σ * is much smaller than that of Σ, the activation behaviors can be quasi-analytically tracked by the methods proposed above. While the L 2 norm of Im(M) and Re(M) is comparable, it seems difficult to analytically describe the corresponding KKA properties. Even though, it is still feasible to numerically train the activator, which will be further discussed in Section 4.

The Computational Evaluations of PNN with KKA
In this section, first, we numerically characterize the performance of the PNN with KKA on the benchmark machine learning task of classifying images from the MNIST dataset, which consists of 60 000 images of handwritten digits ranging from 0 to 9. The PNN setup is shown schematically in Figure 5a-e, and consists of a sequence of hidden layers with different activation setting. The last (output) layer is a linear layer that reduces the vector to a length of 10 elements, suitable for one-hot detection across the 10 digits classes. SoftMax activation function is for the output layer, and cross-entropy serves as the loss function. Four kinds of different activation configurations in the hidden layers are considered: 1) no-activation ( Figure 5b); 2) ReLU function ( Figure 5c); 3) Softplus function (Figure 5d), and 4) KKA (Figure 5e). Such comparison allows us, on one hand, check whether KKA is assuredly able to be the activator or not, on the other hand, rationally evaluate its performance compared to popular SOTA activation functions like ReLU and Softplus. For unitary MZI mesh, the KKA can be analytically described with explicit form given by Equation (18). Hence for these cases, the PNN with KKA can be trained in the real-value domain utilizing the flow similar to that in regular networks. In the training, we divide the 60 000 images of the MNIST dataset to the training and testing sets with 50 000 and 10 000 samples, respectively. Utilizing the Pytorch framework, the training set samples are feed into the network and employed to calculate the gradient by state-of-the-art back-propagation method. The batch size is 500, which is found a good choice with stable learning convergence (S3, Supporting Information) though pay the cost of larger computation-memory consumption and relatively slow network update. However, the optimal learning rate (LR) used for the network training with four different activation configurations (linear, ReLU, Softplus, KKA) is dependent on the network depth (S3, Supporting information). Under the optimized conditions, the networks could converge after training for 50 epoch. During the training, once an epoch is complete, we record the performances of the network on the test set with weights updated according to the gradients extracted from the loss. After all 50 epoch, the best test-set accuracy among these 50 values is selected as the final learning result of the trained network. Before entering the PNN, the images undergo a preprocessing stage consisting of a Fourier transform step and a cropping step, as shown in Figure 5a. These operations reduce the total size of the input data from 28 Â 28 = 784 real-space pixels to 25 complex Fourier coefficients. This preprocessing to the input data had been proved as an efficient route to provide better simulation and training efficiency, while maintain reasonably high classification performance in previous explorations to novel optical activators. [19] Hence, here, we also introduce such flow in the evaluation of the PNN with KKA. Since it is just a preprocessing to the input data, and does not impact to the exact structure of the employed network, such pre-processing should not affect the power consumption, time delay and other factors of the trained neural network.
It found that the PNNs with one, two, and three (hidden) layers show relative stable accuracy around 88.9% (Figure 5f, black dash). This is not surprising because a sequence of linear transformations is also a linear transformation, and accordingly the linear PNN would not benefit from an increase in the network depth. In other words, without intermediate nonlinearities, additional linear layers cannot produce meaningful increase for the learning capacity of the PNN. Whereas, the regular ReLUactivated PNNs with one layer, two layers, and three layers respectively have classification accuracy of 94.1%, 94.8%, and www.advancedsciencenews.com www.adpr-journal.com 94.4% (Figure 5f, blue dash). The performance of Softplus is slightly weaker than ReLU in one-layer network (93.7%) and becomes very close or even better in the deeper two-and threelayers networks (Figure 5f, green dash). These results indicate that: 1) the nonlinear activation does improve the learning capability of PNN; 2) the two-layers network should be the optimal depth because on one hand its classification accuracy is better than the one-layer network, while the learning capability of deeper three-layers network seems saturated. The performances of the PNN with KKA are between the nonactivated (linear) and activated (using ReLU and/or Softplus) networks, which has the classification accuracy of 90.2% even only employing one FC hidden layer, and could be higher than 91.4% when using more than two layers (Figure 5f, red line). Such that PNN with KAA apparently outperforms the linear PNN. Moreover, similar to ReLU-(or Softplus-) activated PNN, its accuracy is incremental for deeper layers, indicating the KKA does endow nonlinearity into the PNN rather than a linear network with some extra weight parameters. However, in all these MNIST classification cases, the accuracy of PNN with KKA seems lower than PNN with SOTA activators like ReLU and Softplus. This weakness might be attributed to the analytical-deduced KKA form of www.advancedsciencenews.com www.adpr-journal.com Equation (18), which attaches the limitation to expressivity of the MZI mesh on (only) unitary matrix.
To further verify the feasibility of KKA, we employ KKA in a larger network built by SVD MZI mesh (Figure 2b) with stronger learning capability (universal expressivity to arbitrary complexvalued matrix [20] ) to execute relative complex Fashion-MNIST tasks (Figure 6a). Specified, fully connected feed-forward networks with two hidden layers of 256 complex-valued neurons each were implemented with GridNet mesh (Figure 1c), [6] which achieves %99% MNIST classification accuracy as proposed in the study of Fang et al. [20] Without the "low-pass" cropping to original image to shrink the input size as done in MNIST classification experiments, the 28 2 = 784 dimensional real-valued input was directly converted into 392 = 784/2-dimensional complex-valued vectors by taking the top and bottom half of the image as the real and imaginary part. Such that the data distribute evenly throughout the complex plane rather than just along the real number line. [20] The linear layers of GridNet could express arbitrary complex-valued matrix, so unlike in MNIST tasks, we do not need to introduce any prior-setting to correlate Re(M) and Im(M). Accordingly, however, KKA does not have (quasi-)analytical formulas, and should be numerically trained (S4, Supporting Information) in complex-valued domain. Likewise, the Softplus and ReLU function should be also with their complex-valued version. [20,21] Exactly, the Softplus nonlinearity is applied to the modulus of the complex numbers. A modulus squared nonlinearity modeling an intensity measurement is then applied, while the phase of the outputs would be maintained. [20] Similarly, here the complex-valued ReLU affects only the modulus with the optimal form explored in pioneer works of Arjovsky et al. as (z/|z|)ReLU(|z| þ b), [21] where the bias b is a tunable parameter, and |z| the norm of the complex output light field z. But still, final layer is the SoftMax, and allows the output to be interpreted as a probability distribution. [22] The crossentropy loss function is used to evaluate the output distribution against the ground truth. In line with the cases of MNISTclassification networks shown earlier, we select batch size = 500 to achieve robust training (S4, Supporting Information). Moreover, considering the larger size of the network with much more weight parameters, to ensure the reliable convergence to the loss-minimum, here, we first coarse-train the network for 50 epoch at LR of 0.002, 0.002, 0.01, and 0.0006, respectively, for KKA, ReLU, Softplus, and linear activation, and then finetrain for another 50 epoch with lower LR (1/5 of the ones used in the first 50 epoch). The two-stage strategy works well on training the larger networks with stable loss descending and improved accuracy (S4, Supporting Information). Also, we pick out the best test-set accuracy among these 100 values as the learning result of the trained large network. Different from the relatively weak activation of KKA in aforementioned MNIST classification cases, here, the PNN with KAA could be comparable or even better than conventional SOTA activators in the Fashion-MINST cases, It is found that the network without activation (Figure 6b), with ReLU activation (Figure 6c), with Softplus activation (Figure 6d) and KKA (Figure 6e), respectively produce 88.71%, 99.67%, 99.58%, and %99.97% classification accuracy on Fashion MNIST dataset (Figure 6f ). This highly probably thanks to the better expressivity of the universal SVD MZI mesh (GridNet), which unlocks the limitation on the available parameters space of the activation functions. Therefore, we propose that, unlike the conventional ReLU and Softplus-activated PNN whose weighting-and-sum operations and nonlinear activations are mutual-independent, the weighting and activation in the PNN with KKA is coupling and codependent on the parameters of the complex-valued matrix configured by the MZI mesh. Hence, the expressivity of the mesh could strongly impact the validity of KKA, and it should be of great significance to explore the correlation between the mesh expressivity and the performances of PNN with KKA in the future.

Scalability of PNN with KKA
The learning capability of the ANN is roughly proportional to its scale, hence the scalability of the PNN with KKA is of great significance to fully realize its potential on various applications. In the following, we analysis the feasibility to large scale PNN with KKA on three aspects: 1) the multilayer cascadability and 2) the power consumption as well as energy efficiency.

Multilayer Cascadability
It is well known that deep network is highly desired for the machine intelligence. Hence, well cascadability of the activator is highly critical. For PNN, the cascadability is mainly limited by the optical transmission of the network. Very recently, the fully integrated coherent photonic circuits with programmable optical nonlinear unit (NU) are demonstrated and achieve practical transmission in deep network with three full-connected layers. Figure 7a shows the architecture of these networks with NU. [23] It splits the laser source to input light and reference light, then the input light is further split to N input ports of the network, and encoded by the modulators array with the input vector. Afterward, the input vector is fed in the MZI mesh of the first hidden layer to perform the matrix multiplication through passive optical interference, and then the NU applies the activation function to yield the input to the next layer. After the inputs transmit for L cascaded layers, the output signal is extracted by the homodyning detection. Such that, the on-chip insertion loss IL NU of the PNN with NU consisting of L layer full-connected forward network (N-port-in and N-port-out) would be (S5, Supporting Information) www.advancedsciencenews.com www.adpr-journal.com (20) which is contributed by following parts: 1) Utilizing the powersplitting tree based on, e.g., log 2 (2N) layers of Y-splitter or 1 Â 2 multimode interference with insertion loss of IL PS to split the laser source to 2N ports (N input ports and N reference sources), and accordingly the loss of this part is IL PS log 2 (2N); 2)The insertion loss of the input modulators IL Mod ; 3) The insertion loss of one hidden layer constructed by the MZI mesh (e.g., using unitary Clements's mesh with N stages of MZIs and the loss of one MZI is IL MZI ) and the NU composed of the tunable power splitter and the modulator. Since the PN modulator is apparently lossy comparing with the tunable power splitter, e.g., based on thermal-tuning MZI, the loss of the NU can be roughly approached by IL Mod . Such that the loss of L hidden layers is L*(N * IL MZI þ IL Mod ); and 4) The factor 1/2 in Equation (20) originates from the fact that in the homodyne probing of the output, only half of the light "see" the MZI mesh. [1] Whereas, for the PNN consists of L layers full-connected forward network with KKA (Figure 7b), the laser power would firstly split evenly to L layers (e.g., using power-splitter tree), and each layer would get 1/L light intensity. Then the 1/L light intensity would be mutually split to input and reference part, and feed in corresponding hidden layer for KKA and weighting-sum operation. Such that, the effective on chip loss IL KKA would be (S5, Supporting Information) which is contributed by following parts: 1) Utilizing the powersplitting tree to equally split the laser source to L layers. Unlike the PNN with NU where the light-power from the output ports of one hidden layer is transmitted to the next layer, the 1/L light power obtained by each layer is mutual-independent used without direct interlayer light-transmission, the loss of this part should be 10lgL þ IL PS log 2 (L); 2) The light power for a hidden layer is further split to 2N ports (N input ports and N reference sources). The corresponding loss of this part is IL PS log 2 (2N); 3) One hidden layer constructed by the MZI mesh (e.g., using unitary Clements's mesh with N stages of MZIs similar the case of PNN with NU [23] ) and the KKA unit whose loss originate from the modulator (although in our aforementioned discussions, EAM is employed, other types of modulation devices are also feasible, see Section 6). Note here the activators of the first hidden layer are actually input modulators, hence generally, the loss of any hidden layer could be always given by IL Mod þ N * IL MZI ; and 4) Likewise, the factor 1/2 in Equation (21) originates from the fact that in the homodyne probing, only half of the light "see" the MZI mesh. [1] Considering the loss of one hidden layer in PNN with either NU or KKA is contributed by the modulator and the MZI mesh, the estimated loss value (the parameters of the device loss used for estimation are given in S5, Supporting Information) shows that for all the depth, the PNN with KKA is much less lossy, and thus could achieve much stronger output signals in compare with the multilayer architecture proposed in the study of Bandyopadhyay et al. [23] (Figure 7c). In another word, for the same feasible laser power on chip, PNN with KKA could support deeper network for the same hidden layer size. Moreover, for even deeper network, PNN with KKA allow introduce extra laser sources as shown Figure 7d. In contrast, the maximum feasible laser power feed-into silicon-photonic chip with conventional architecture (Figure 7a) might be limited by the two-photon absorption. [24] Such that, since the laser sources for each layer are mutual-independent, the architecture of the PNN with KKA, in principle, could support arbitrary number of propagation layers far more than conventional designs.

Power Consumption and Energy Efficiency
Besides the insertion loss, the power consumption is another important performance factor which directly determine the energy efficiency of the PNN chip. In general, for the optical neural network based on the coherent N Â N MZI mesh with regular commercial volatile silicon phase shifter (thermal, carrierinjection, or carrier-depletion), the power from the electrical-optical conversion of the modulator (as well as the optical-electrical conversion of the detector) scale with N since it only takes place at the input (output) edge, while the power from the weight (optical phase) programming of the universal MZI mesh scales with N 2 , hence the total power consumption (if neglect the small power of the detectors) is typically at the level of NP mod þ N 2 Pπ, where P mod is the modulator power, and P π corresponds to the power of a phase element with π shift (note in the practical applications, the exact phase value on each element is highly dependent on the weight matrix to be configured, however, considering the phase is with 2π period, it is rational to use π as the statistical expectation of the phase value to estimate the power level). Hence for the high mesh radix N corresponding to large number of the modulators, the total power should be mainly contributed by the programming but not the modulation. Accordingly, noting the computation capability of the mesh is 2fN 2 (here, f is the time clock, see Ref. [3]), the energy efficiency is controlled by %f/P π . Such model can be extended to analysis the power consumption and energy efficiency of deep network with cascaded L full-connected layers as discussed above. If note both the NU and KKA unit, which are used to activate the output of a certain hidden layer and feed into the next layer, consist of one detector and one modulator, we can find that for the PNN with KKA and the benchmarked PNN with NU at the same scale, their needed devices are very close. Accordingly, they would exhibit similar power consumption and energy efficiency.
The aforementioned analysis on both insertion loss the power consumption promisingly indicates that the PNN with KKA is much less lossy than conventional PNN with NU without paying the cost on power consumption. Moreover, thanks to the emerging exploration on more energy-efficient phase-shift element, the power performance of the PNN can be further improved. In the state of the art optical computing solution (like Mars core from Lightmatter [25] ), it tends to substitute the phase shifter feasible commercial silicon photonics foundry with more low-power (but relative less mature) MOEMS phase element to enhance the energy efficiency performance. Furthermore, if the commercial (volatile) phase elements are substituted by the nonvolatile ones based on e.g., phase-change materials, [26,27] the programming power (especially in inference process) would be close to zero. Accordingly, the power would be dominant on modulation with dramatically improved scales-law from O(N 2 ) to O(N). Consequently, the energy efficiency would scale with Nf/P mod , i.e., the high mesh radix is beneficial to achieve ultra-excellent energy efficiency.

Discussion
In this work, we demonstrated that the PNNs with KKA can achieve training and testing accuracies comparable to those reported in conventional PNNs needing electrical activation. [28][29][30] Our PNNs with KKA also provide abundant and configurable nonlinearity, and even allow port-to-port-different activation. In addition to having the advantages of direct optical activation (thus avoid the usage of ADC needed by conventional electrical activation), the architecture of the PNN with KKA, in principle, is able to support arbitrary number of propagation layers.
As for the physical implementation of PNN with KKA, the fabrication of the EAMs based on GeSi quantum well on either O or C band had been verified to be CMOS compatible, [31,32] and thus can be integrated together with other silicon photonic components like waveguide, MZI mesh, Ge photodiode on some advanced silicon photonic platforms, e.g., iSiPP200 from IMEC. [33] Thanks for the small footprint (tens μm), lower power consumption (<1 pJ Bound À1 ), and high speed (tens Ghz) of GeSi EAM, PNN with KKA might provide more efficient inter-hidden-layer cascading and overcome the bottleneck of conventional PNN on optical transmission due the prohibitively high loss of electrical activation with limited performance challenged by realizing ADC co-with high speed, high accuracy, and low power at the same time. Moreover, besides EAM, KKA could be also feasible for other modulation devices like MZM, MRR, which are widely provided in most silicon photonic platforms. [8,33] Actually, though their phase and amplitude are not explicitly mathematic-linked by KKR, they are controlled by the refractive index modulation as well, thus can be written as the functions g 1 and g 2 depending on the real and imaginary part of the refractive index n ϕ ¼ g 1 ðΔReðnÞ, ΔImðnÞÞ Since ΔRe(n) and ΔIm(n) are truly physically correlated by KKR, it still allows us using KKR to (indirectly) link ϕ and |E| of the MZM or MRR, and accordingly achieve activation as proposed above. Likewise, in some other emerging silicon photonic modulators, [33] e.g., node-matched-diode modulator or photon crystal modulator, [24,34] ϕ-|E| link can also be constructed via KKR. Hence, for all these cases, we may attribute them as generalized KKAs (G-KKA). In addition, the developments of heterogeneous integration enable the activations based on the components with optical-gain like semiconductor optical amplifier (SOA). However, unlike the conventional power-in-power-out strategy, which tends to produce saturating nonlinear response, [35] KKA based on SOA could still produce activations something like ReLU or Softplus as shown in EAM case thanks for the utilization of the phase parameter. [11] Although the KKA employs the new mechanism to optically produce the nonlinear activation, it does not bring extra technical complexity on physical implementation. Compared with the regular linear optical transformer, which usually modulates the amplitude (or intensity) to encoding the input data on optical domain, and then using the MZI or MRR mesh to execute linear transformation (weighting and sum), and the photodiodes obtain the calculation results, KKA does not require any additional photonic devices. It still is produced by the modulator, although the exact working mode is, to some extent, different from the commonly amplitude (or intensity) modulation. The extraction of the results via on-chip homodyne detection had also been proved to be technically feasible as shown in several previous works. [1,36] Therefore, our KKA approach provides the capability of monolithic-integrating both the linear and nonlinear functions, together, e.g., on one silicon photonic neural chip, without needing any extra off-photonic-chip digital processing as done in classical optical-linear-plus-electrical-activation architectures. [37] As a result, such monolithic-integration approach free of off-chipelectrical-activation can be conveniently implemented in most commercial pure silicon photonic platforms (e.g., IMEC, AMF, CompoundTek [38] ). Even though, it should be pointed out that in some cases with less-expressive photonic circuit mesh, the performances of the KKA are a bit weaker than the SOTA activators like ReLU or Softplus. Hence on the viewpoint to ensure the learning capability of the PNN, at least so far, KKA cannot fully displace the electrical activations whose performances are robust and independent on the structure details of the network. Rational design to balance the mesh expressivity, activation performances, and network scale is beneficial to construct the most appropriate PNN with KKA. Perhaps, some advanced training strategies are also valuable to enhance the performances of KKA. [20] Moreover, if note the emerging exploration to integrate the analog electrical circuits on the (pure) silicon photonic platform, as well as the advanced capability of few platform like GlobalFoundry that allows the monolithic integration of silicon photonic and electrical circuits, [39] the alternative hardware design of the PNN chip might be the relative less expressive mesh with a small partial fraction of the detectors linking the analog electrical nonlinear units, so as to achieve well trade-off among learning capability, technical complexity, power, chip area, and cost.
The aforementioned discussions show that KKA is promising to physically implement on silicon photonic neural chip using a series of modulator. The mathematical deduction and numerical simulations to the KKA show its high trainability and abundant nonlinearity. Furthermore, it is interesting to see the exact essence of the nonlinear feature learned by KKA. Here, we suggest an explanation of KKA (as well as G-KKA) based on the dimension-embedding, an important viewpoint inspiring the design of the architecture of ANN for machine learning. [40,41] Exactly, it is well known that the nonlinear features in the low-dimension space can be learned by (mapped to) the linear features in high-dimension space or vice versa. [42,43] Here, expressing the matrix by the pseudo-real-value mesh mathematically corresponds to the mapping from high-dimensional C NÂN space to relatively low-dimensional R NÂN space. Hence, the linear transformation to the C N vector in C NÂN space, in principle, should be able to learn the nonlinear transformation to R N vector in R NÂN space. Especially, the KKA realized by picking the real part of MVP operation ME would map the linear transformation of the C N vector in C NÂN space (here is ME) to the real-valued nonlinear (activation) as well as its combined linear (weight þ sum) transformation in the R N vector in R NÂN space. Moreover, considering the KKR is a direct consequence of causality, [44] a universal property of nature telling us that the output of a system cannot temporally precede the input, the neural networks with KKA might be prone to catch the causal factors which are not only beneficial to improve the interpretability of existing classical ANNs, but also critical for advanced machine intelligence involving causal inference. [45] Ascribing the KKA to the dimension-embedding might be a feasible view point to understand the nonlinear activation in the framework of the machine learning. More explorations to potential of KKA on causal-inference should also be of interest. However, there is still much-to-do to explore the KKA deeply. We would like to emphasis that the strategy to induce the KKA does be highly different from that activate the classical ANN implemented by electronic hardware. Such difference originates from the fact that the PNN is essentially a complex-valued network, while the ANNs running on electronic hardware are actually designed in the real-value domain. Therefore, rigidly constructing the apple-to-apple photonic counterparts of the electronic components e.g., photonic MAC, photonic matrix multiplier, photonic tensor core, photonic nonlinear activation, at least sometimes, is not the most rational route to fully facilitate the potential of the photon on accelerating the ANN tasks. It is of great significance to effectively utilize the dimensional advantages of the photonic network brought by its intrinsic complexvalued nature. KKA itself just provides a good starting point from which one can employ the phase-amplitude coupling on the input of each hidden layer to endow the PNN both linear and nonlinear connections, and such coupling is promising to further extend to other photonic mesh for linear transformation, e.g., substituting the MZI cells with the parity-time (PT) symmetric couplers, [46] whose gain-variation of the waveguide would control the mesh transformer and induce extra nonlinearity. Moreover, it is noteworthy the photon also possesses several other freedoms like polarization, frequency, angular momentum, etc., [47] hence besides the aforementioned phase and amplitude, the correlation among these freedoms may induce future chances to new optical nonlinear activations.

Supporting Information
Supporting Information is available from the Wiley Online Library or from the author.