Learned Integrated Sensing Pipeline: Reconfigurable Metasurface Transceivers as Trainable Physical Layer in an Artificial Neural Network

Abstract The rapid proliferation of intelligent systems (e.g., fully autonomous vehicles) in today's society relies on sensors with low latency and computational effort. Yet current sensing systems ignore most available a priori knowledge, notably in the design of the hardware level, such that they fail to extract as much task‐relevant information per measurement as possible. Here, a “learned integrated sensing pipeline” (LISP), including in an end‐to‐end fashion both physical and processing layers, is shown to enable joint learning of optimal measurement strategies and a matching processing algorithm, making use of a priori knowledge on task, scene, and measurement constraints. Numerical results demonstrate accuracy improvements around 15% for object recognition tasks with limited numbers of measurements, using dynamic metasurface apertures capable of transceiving programmable microwave patterns. Moreover, it is concluded that the optimal learned microwave patterns are nonintuitive, underlining the importance of the LISP paradigm in current sensorization trends.


I. Analytical Forward Model of the Physical Layer
The intrinsic sub-wavelength nature of the radiating metamaterial elements enables a compact description in terms of coupled dipoles [1][2][3][4] . Ultimately, we will directly link a given metasurface configuration (specifying which elements are radiating) to the radiated field.
Note that the possibility to analytically do so is a key advantage of this metasurface hardware over alternative devices with reconfigurable radiation pattern such as leaky reconfigurable wave-chaotic three-dimensional cavities [5][6][7] . For the latter, one could learn forward models based on near-field scans of radiated fields [8,9] ; yet the number of required equivalent source dipoles [10] (even with a coarse half-wavelength sampling of the aperture) is much higher than in our case where it is simply N. The ability of the coupled-dipole model that we outline in the following to predicted the wave patterns radiated by metasurface transceivers has been verified through full-wave simulations [11] and more recently in a simplified setup (slots instead of cELC metamaterial elements) also experimentally [12] .
Dipole Moments. -Generally speaking, a point-like scatterer's magnetic dipole moment m is related to the incident magnetic field via the scatterer's polarizability tensor ̿: where r denotes the scatterer's location. In our case, using surface equivalent principles, an effective polarizability tensor of the metamaterial element embedded in the waveguide structure can be extracted [13] . For the metamaterial element that we consider, only the component is significant. In the following, we use ( ) which is a typical polarizability value for a tunable cELC element [13] . The tuning state of a given metamaterial element can be encoded in its polarizability: if the element is "off", its polarizability (at the working frequency) is zero and hence it does not radiate any energy. We thus use as effective polarizability, where { } is the tuning state of the jth metamaterial element in the ith metasurface configuration.
The local magnetic field that excites the metamaterial element is a superposition of the feed wave and the fields scattered off the other metamaterial elements. in Equation S1 is thus itself a function of the metamaterial elements' magnetic dipole moments, yielding the following system of coupled equations [11] :  Figure 2: the diagonal of G is zero since the self-interaction terms are incorporated into the effective polarizabilities in .
Due to the metamaterial element interactions via the ( ) term, the mapping from tuning state to dipole moments is not linear. This is visualized with an example in Figure S2.
Thus, ultimately the mapping from tuning state to radiated field cannot be linear, which is a substantial complication for most beam synthesis approaches; in our LISP this does not pose any additional complication.
Propagation to Scene. -We now identify the wavefront impacting the scene for a given metasurface configuration. The sensing setup we consider is depicted in Figure 1b. A transmit (TX) metasurface illuminates the scene. The scene, in our case, consists of a planar metallic digit of size in free space that is to be recognized at a distance of 1 m.
To compute the ith incident TX wavefront (corresponding to the ith metasurface configuration) at a location in the scene, we superimpose the fields radiated by each of the N metamaterial element dipoles [10,14] : This is the third and final step of our analytical forward model, shown as Step 1C in Figure 2.
Note that the scene ( ) is ultimately sampled by ( ) ( ) ( ); when loosely referring to "scene illumination" in this work, we mean this product of ( ) and ( ). In practice, to compute the integral, we discretize the 2828 reflectivity map of the scene (the separation of adjacent pixel centers is half a wavelength, there is no gap between the edges of neighboring pixels) with one point per pixel, as seen in Figure 1b. Each point's reflectivity value ( ) is a gray-scale real value determined by the corresponding handwritten digit's reflectivity map.
The analytical expression for ( ) reads [11] ( ) where is the imaginary unit, is the amplitude of the electric line source generating the feed wave (taken to be 1 A in our work), is the propagation constant of the fundamental mode inside the waveguide, ( ) is a first order Hankel function of the second kind, is the position of the source, and is the circumferential angle around the source measured from the -axis. The space inside the waveguide we consider is empty (not filled with a substrate).
The inter-element coupling ( ) consist of two componentsinteractions via the waveguide (WG) and interactions via free space (FS): The waveguide interaction component reads [11] ( where and is the angle of the vector from to , and is the height of the waveguide. We use in our simulations. The free space interaction is given by [14,15] where and ( ) . The factor 2 in the expression for ( ) originates from self-images of the metamaterial elements due to the waveguide's upper metallic plate.

II. Further Details of Artificial Neural Network Algorithm and Parameters
In this section, for the sake of reproducibility, we detail the implementation of our method using a low-level API of the open source machine learning library TensorFlow [16] . The ANN architecture has been described in the main text. The measurements are fed into a first fully connected layer consisting of 256 neurons with ReLu activation. This is followed by a second fully connected layer made up of 10 neurons with SoftMax activation, yielding a normalized probability distribution as output. The highest value therein corresponds to the classification result (one digit between "0" and "9"). This architecture was chosen without much optimization and still performs quite well; its performance was observed to not significantly depend on the chosen hyperparameters, such as the number of neurons. We use the float32 and complex64 data types for real and complex valued variables, respectively.
Since the metamaterial elements are separated by at least a wavelength, the arguments of the Hankel functions in the expression for and are always above unity ( ).
We thus evaluate these Hankel functions with the following large-argument approximation ( ): We initialize the bias variables ( ( ) and ( ) in Figure 2 of the main text) as zero.
Weight variables are initialized with a truncated normal distribution 1 ; mean and standard deviation of the latter are zero and 0.12, respectively, for the digital weight variables ( ( ) and ( ) in Figure 2 of the main text) and zero and 0.2, respectively, for the physical weight variables .
To avoid issues related to division by zero, we use { } instead of { } as th diagonal entry of A, with being four orders of magnitude smaller than . A simple sanity check to confirm this procedure is presented in Figure S3. The magnetic dipole moments corresponding to a binary (random) metasurface configuration are computed and the result is compared to that obtained if elements in the "off" state are simply omitted. Ideally, the same magnetic dipole moments are computed for the "on" elements in both scenarios, which is indeed the case, as shown in Figure S3.
Training Algorithm and Parameters. -We use the MNIST dataset of handwritten digits [17] which consists of 60,000 samples of 28×28 gray-scale images of handwritten digits from 0 to 9 and 10,000 further such samples to test the learned network. For training, we split the training dataset and use 15% for validation purposes. Using a batch size of and the Adam method for stochastic optimization [18] with a step size of , we train the weight variables of the network based on the training dataset, using the cross entropy between the known and predicted labels as error metric to be minimized. Every 50 iterations we compute the accuracy achieved on the validation dataset. We define a patience parameter to avoid overlearning on the training dataset: if the validation accuracy has not improved for seven consecutive times, we stop training. Finally, we evaluate the achieved accuracy on the completely unseen test dataset. This is the accuracy we report in the main text.
Since our measurements are complex-valued, we stack real and imaginary parts of the measurements in a real-valued vector M; then we feed M into the first fully connected layer. A subtle but crucial detail is the need to normalize M. If its standard deviation is several orders of magnitude above or below unity, training the weights of the fully connected layers may take much longer or be unfeasible, in particular in combination with the temperature parameter (see below). Hence, we first identify a normalization constant as the average of the standard deviation of M obtained for ten different random TX and RX masks (obeying the chosen on/off constraint) using the training dataset. We then always divide M by that fixed normalization constant in the subsequent training, validation and testing. In other words, the normalization constant is fixed once and for all based on the training dataset.
The employed activation functions are defined for real-valued inputs as follows. A "rectified linear unit" (ReLu) activation function returns 0 if its argument is negative, and otherwise the argument is returned unaltered: A SoftMax activation normalizes a -element vector into a probability distribution consisting of probabilities whose sum yields unity: Imposing a Discrete Distribution of Weights. -In this section, we provide the technical details of how we restrict the physical layer weights to be chosen from a list S of predefined discrete values. For instance, the reconfiguration mechanism of the metamaterial elements may leverage PIN-diodes such that a given element is either resonant or not at the working frequency , which corresponds to a binary on/off amplitude modulation: S ={0,1}. Our procedure can also be directly applied to gray-scale tuning; for instance, in the case of a phased-array with 2-bit phase modulation we would use S ={1,i,-1,-i}. (Preliminary tests showed little gains in the sensing performance with gray-scale tuning compared to amplitudebinary tuning.) The ANN weight variables are, however, trained by back-propagating errors [19] which relies on computing gradients. This approach is thus, at first sight, incompatible with variables that can only take a few discrete values. An elegant solution to this apparent problem consists in using a temperature parameter [20] . The general idea is to start with a continuouslydistributed weight variable and to gradually force its distribution to become more and more discrete over the course of the iterations. By the end of the training, the weights are then effectively discrete and chosen from S.
We define a scale factor that increases according to a quadratic schedule with the number of iterations : where we use . For every variable to be trained, we introduce an -element vector . We then multiply the absolute value of with the scale factor and apply a SoftMax operation: ( ). We go on to define as a sum of its The training convergence for a sample realization with N=64 and M=5 is shown in Figure S4. We display the accuracies evaluated in intervals of 50 iterations on the training batch to be used next (blue) and the validation dataset (red). Recall that the accuracy reported in the main text is not plotted in this figure, since it is evaluated on a test dataset that was in no way used during training. The training is stopped after 69 epochs in the example in Figure   S4. While the validation accuracy saturates after roughly 20 epochs, we continue the training for much longer since at this stage the physical weights are not binary yet. Over the course of the remaining iterations, the physical weights become more and more binary, and consequently physical and digital weights are continuously re-adjusted to ensure the high validation accuracy is maintained. A striking detail is the apparent periodicity of a dip in the blue training accuracy curve. This is easily understood: the training batches (100 samples per batch) cycle through the training data (51,000 samples), such that one batch with data that is somewhat harder to classify is re-seen after 510 iterations. Since we probe the accuracy only in intervals of 50 iterations, it will take 50/(510 mod 50) = 5 epochs until the same 'hard' batch is featured again on our plot. coupling. Traditional approaches to tackle such NP-hard combinatorial optimization problems (even without the coupling constraint) include Gerchberg-Saxton algorithms [21] and semidefinite programming tools.
Here, we choose a much simpler approach resembling "adjoint"-based inverse-design methods [22,23] . We take the analog part of our pipeline in Figure  In order to identify M illuminations with minimal mutual overlap, we first compute the overlap matrix whose ( )th entry is ( ( ) ( )), where is the overlap function as defined in Equation S1 of the main text. Next, we compute the average mutual overlap as the average of the off-diagonal entries of and use that as cost function to be minimized: 〈 〉 .
In order to identify an illumination pattern that matches a given PCA mode P as closely as possible, we compute the overlap of the scene illumination with P and use its inverse as cost function to be minimized: ( ,P). Minimizing this cost function will maximize the resemblance of to P.
Using the temperature parameter as before ensures that during training the weights are carefully driven towards a binary distribution. This simple approach to deal with constraints and even coupling effects, only requiring the formulation of an analytical forward model, may also prove useful in other constrained physical inverse design problems, from nano-photonic inverse design [24] and optical wavefront shaping with digital micromirror devices [25] via beam synthesis with phased-arrays in the microwave domain [26,27] to infrared metamaterial phase holograms [28] . Note that we have an analytical forward model and borrow efficient error backpropagation tools (developed for neural network training) to solve a constrained inverse problemin contrast to other recent efforts to solve continuous inverse problems by first training an ANN to approximate a forward model [8,9] .
In passing, we thus introduce a simple constrained inverse configuration-design paradigm for dynamic metasurface apertures and show how it enables beam synthesis or the radiation of orthogonal patterns, as opposed to the (thus far) conventionally use of random patterns [29,30] .
While the hardware is the same, the identification of appropriate metasurface configurations is an additional one-off effort. With a modified cost-function, one can also identify settings for scene illuminations with custom-tailored speckle statistics, which may drastically improve the efficiency of computational microwave ghost imaging [31][32][33] .

III. Further Analysis of the LISP
A natural question that arises (albeit irrelevant for practical applications) is whether the other benchmark illumination schemes (random, orthogonal, PCA-based) will eventually, using more measurements, be able to perform as well as our LISP. For the N=16 case, we thus evaluated the average classification accuracy also for M=N and M=2N. Only the LISP curve had saturated; the other benchmarks accuracies were still slightly improving between M=N and M=2N and were still somewhat below the LISP performance. Since the "scene illumination" ( ) ( ) ( ) depends on the configuration of two metasurfaces with N tunable elements each, we expect that all schemes perform equally well only once .
We observe no significant difference in the performance across different metasurface realizations (i.e. different random locations of the metamaterial elements). The LISP scene illuminations overlap around 65% across different realizations with the same metasurface, indicating that the optimization space contains numerous almost equivalent local minima.
Remarkably, we never seem to get stuck in inferior local minima.
Robustness outside nominal conditions. -We investigate how robust the sensing performance is outside the nominal conditions, i.e. over a given set of parameter variations.
Here, we consider variations of the metamaterial elements' polarizability; due to fabrication tolerances of electronic components such as the PIN diodes, the experimental polarizability is expected to differ across metamaterial elements by a few percent from the value extracted via full-wave simulation based on the element's design [13] . With M=5 and N=64, we first train our LISP as before. Then, for each metamaterial element, we replace the true polarizability value by a different ( ), where is white noise; real and imaginary parts of are identically distributed with zero mean and standard deviation , so the standard deviation of (the size of the cloud in the complex plane) is √ .
In Figure S5 we Effect of element failure. -We also consider the case in which one or several metamaterial elements fail, that is for the failing elements to mimic the loss of tunability. Without any measures to counteract the failure, the performance drops on average as shown in Figure S6. The standard deviation, as evidenced by the shaded area, is relatively large and in some cases the classification accuracy may be almost unaltered. This is expected when the polarizability of the failing elements was close to zero anyway or when they were predominantly configured into the "off" state.
It is important to note that Figure S6 considers the worst-case scenario of element failure without any compensation. A first compensation could consist in characterizing the new perturbed scene illuminations to then retrain the digital weights accordingly. A more advanced compensation would require the identification of the failing elements (which is relatively easy in practice) to then retrain the entire LISP and update both analog and digital weights. In the latter case, almost no loss in performance is expected: essentially, the situation is then equivalent to using a metasurface with fewer metamaterial elements and Figure 4 in the main text has shown that reducing the number of elements from 64 to 16 affects the achievable performance only slightly.
Finally, note that this discussion is equally valid for any of the other benchmark schemes since failing elements would result in scene illuminations different from the assumed ones in these schemes, too.
Effect of measurement noise. -Finally, we explore the performance of our LISP in the presence of measurement noise. To that end, we add white noise to the measurement vector M before feeding it into the digital layers. This, again, constitutes a worst-case scenario in which noise was not anticipated during LISP training. Otherwise, in line with the discussion in Section V, one can train the LISP to be robust to a certain level of noise by adding noise to the training dataset. Figure S7 shows that a LISP trained for noise-free operation yields satisfactory results for SNR levels above roughly 20 dB. All illumination benchmarks are expected to experience a degradation in performance in the presence of noise.

IV. Alternative Reconfigurable Hardware
Dynamic metasurface apertures similar to the one we consider in Figure 1a of the main text have previously been used to generate random scene illuminations for computational imaging [29,30] . In our present work, we take full advantage of the ability to individually control each radiating element to purposefully shape the scene illuminations. The individual addressability of each metamaterial element distinguishes these dynamic metasurfaces from other designs in the literature that simultaneously reconfigure all elements to redirect a beam [34][35][36] .
Our LISP could of course also be implemented with other wavefront shaping setups such as arrays of reconfigurable antennas [37] , reflect-arrays illuminated by a separate feed (e.g. a horn antenna) [38][39][40] or leaky-wave antennas with individually controllable radiating elements [41,42] . Indeed, if our concept was transposed to the acoustic or optical regime, one would probably use an array of acoustic transceivers or a spatial light modulator [43] , respectively. In the microwave domain, however, antenna arrays are costly and the use of reflect-arrays yields bulky setups. In contrast, dynamic metasurface hardware as the one in Figure 1a of the main text only requires a single feed since it benefits from its inherent analog multiplexing and is moreover compact, planar and can be fabricated using standard PCB processes.

V. Outlook
In spite of preliminary results suggesting that our scheme's performance is robust to parameter variations (see Section III in the Supporting Information), it is worth considering how one would deal with significant fabrication inaccuracies in experimental metasurfaces.
Should one or several parameters of the metasurface design, such as polarizabilities or locations of the metamaterial elements, turn out to significantly vary due to fabrication issues, an additional calibration step could easily learn the actual parameter values of a given fabricated metasurface. The spirit of this procedure is similar to the way in which we achieved beam synthesis (see Section II in the Supporting Information) using only the analog part of our pipeline. The parameters to be determined are declared as weight variables to be learned during training and initialized with their expected values. Then, experimentally the radiated fields for a few different metasurface configurations are measured and a cost function is defined to minimize during training the difference between the measured and expected radiated fields. By the end of the optimization, a set of parameter values will have been learned that optimally matches the experiment.
Furthermore, for practical applications it may be of interest to achieve a certain robustness to fluctuations in the calibration parameters such as geometrical details of the experimental setup [44,45] . By including random realizations of major calibration parameters within the expected range of fluctuation during training, the network can learn to be invariant to these variations [44] . Ultimately, this enables the transfer of knowledge learned on synthesized data to real-life setups without additional training measurements. For a specific task such as hand gesture recognition, synthetic scenes can easily be generated with appropriate 3D modelling tools.
A conceptually exciting question for future research is how we can conceive a LISP that is capable of making on-the-fly decisions about the next optimal scene illumination, taking into account the available knowledge from previous measurements [46] .
Interestingly, our LISP can also be interpreted in light of recent efforts to meet the exploding computational demands of ANNs with wave-based analog computing, which seeks to perform desired operations with waves as they interact with carefully tailored systems [47][48][49][50][51][52][53] .
Our wave-based sensing scheme is essentially a hybrid analog-digital ANN in which the interaction of learned optimal wavefronts with the scene acts as a first processing layer. As we have shown, data acquisition fulfills two processing tasks in our LISP, being (i) trainable and (ii) highly compressive. In the future, the digital electronic structures used to implement the fully connected processing layers may be replaced by wave-based analog devices, as proposed in Refs. [48,49,52] . Figure S1. Taxonomy of illumination strategies in wave-based sensing in terms of the a priori knowledge they make use of. Schemes using generic (random or orthogonal) scene illuminations ignore any available a priori knowledge. Illuminations based on a principle component analysis (PCA) of the scene include knowledge about the scene but not about the hardware constraints or task to be performed. Our proposed Learned Integrated Sensing Pipeline (LISP) makes use of all available knowledge by integrating measurement and processing into a unique ML pipeline.      The LISP is trained for noise-free operation with N=64 and M=5. Then, its performance is evaluated after adding different amounts of white noise to the measurements obtained with the unseen test dataset. The signal-to-noise ratio (SNR) of the measurements is indicated on the horizontal axis. The purple curve shows the average over 100 realizations of the measurement noise, the standard deviation is negligible and hence not visible.