Artificial Intelligence Accelerators based on Graphene Optoelectronic Devices

Optical and optoelectronic approaches of performing matrix-vector multiplication (MVM) operations have shown the great promise of accelerating machine learning (ML) algorithms with unprecedented performance. The incorporation of nanomaterials into the system can further improve the performance thanks to their extraordinary properties, but the non-uniformity and variation of nanostructures in the macroscopic scale pose severe limitations for large-scale hardware deployment. Here, we report a new optoelectronic architecture consisting of spatial light modulators and photodetector arrays made from graphene to perform MVM. The ultrahigh carrier mobility of graphene, nearly-zero-power-consumption electro-optic control, and extreme parallelism suggest ultrahigh data throughput and ultralow-power consumption. Moreover, we develop a methodology of performing accurate calculations with imperfect components, laying the foundation for scalable systems. Finally, we perform a few representative ML algorithms, including singular value decomposition, support vector machine, and deep neural networks, to show the versatility and generality of our platform.


Introduction
The past half-decade has seen unprecedented growth in machine learning (ML) algorithms and their applications. For example, deep neural networks (DNNs) represent the state-ofthe-art performance in a variety of context, such as large-scale computer vision, natural language processing, and data mining. DNNs have also impacted practical technologies such as web search, autonomous vehicles, and financial analysis. 1 However, most of ML algorithms have substantial computational and memory requirements, which greatly limit their training and deployment in resource-constrained environments. To address these challenges, there has been a significant trend in building high-performance specific hardware platforms, such as field-programmable gate arrays 2,3 and application-specific integrated circuits. 4 Recent efforts on leveraging emerging techniques for efficient ML hardware focus on accelerating the key tensor-level multiply-accumulation (MAC) operations in ML algorithms, which is known as the most computation-intensive operations. For example, analog DNN hardware focuses on accelerating matrix multiplication, such as matrix-vector multiplying module, 7 mixed-mode MAC unit, 8,9 and memristor-based MAC. [10][11][12] On the other hand, all-optical and hybrid optoelectronic implementations in early works have offered promis-ing alternative routes to microelectronic implementations, [13][14][15][16][17][18][19] because of the advantages of executing MAC operation at the speed of light, high throughput, and very low or even nearly zero power consumption. Recently, an integrated nanophotonic processor based on reconfigurable Mach-Zehnder interferometers at telecommunication wavelength demonstrated the advantages of optical DNN acceleration, 20,21 where MVM operation was decomposed into a series of multiplications following singular value decomposition. Moreover, multiple 3D-printed diffractive optical layers in the terahertz range 22 have shown the capability of performing linear classification, although they are not reconfigurable for new models since weights are physically hardcoded in passive diffractive layers.
In this article, we report a new high-performance optoelectronic architecture of performing general MVM operations by exploiting the extraordinary properties of graphene.
Specifically, the architecture consists of a two-dimensional (2D) array of spatial light modulators (SLMs) and a 2D array of photodetectors with electrically controllable photoresponse, which are both constructed out of the combination of large-scale graphene monolayers and optical metamaterials. Since graphene is gapless, these optoelectronic devices can be tailored to operate in ultrabroad frequency range. Considering inevitable non-uniformity of material properties and associated device variation, especially for large-scale polycrystalline graphene, we evaluate the influence of various contributing factors and conceive a methodology of performing accurate calculation even with imperfect devices and systems. Finally, we demonstrate a few representative ML algorithms showing the versatility and generality of the hardware platform.
Results and discussion Figure 1a shows an illustration and operation principle of designed architecture to perform a general MVM operation o = W v, consisting of a 2D array of SLMs for encoding vector information and a photodetector array with tunable photoresponsivity for encoding matrix element information. The input light is incoherent, such as narrow-band illumination from a halogen lamp with wavelength selection components, so that any coherent inference effect is not involved. An N -dimensional vector v = (v 1 , v 2 , ..., v N ) is mapped onto one row of SLMs, and the vector information is also replicated on other rows. This replication has two advantages: (1) it removes the necessity of involving beam splitting components that restrict chip integration and complicate optical alignment; (2) it relaxes the requirements of high-quality devices with large-scale uniformity and has a large tolerance of device variation.
Each electro-optic unit of SLMs has an electrically-controllable optical transmission function T i (V gv ) encoding the information of v i , and the input power P 0 is modulated to P 0 T i (V gv ) after the passage. Afterwards, the modulated light is detected by an array of photodetectors, where the photoresponsivity of each element can be electrically controlled. Each element w ji in the matrix W is encoded corresponding to the photoresponsivity of a photodetector in the array R ji (V gw ). As a result, the obtained photocurrent I ji is P 0 T i (V gv )R ji (V gw ), and eventually generated photocurrents are added across columns in the same row that will be converted to voltage for nonlinear activation function implemented using electronic circuits. Mathe- Physically, both optical intensity and readout photocurrent are always-positive values. In order to perform mathematical calculations having both negative and positive real numbers, each element v i in v and w ij in matrix W can be represented as a difference of two posi- Thus, the MVM can be done through four In addition, we lay out a design flowchart including electromagnetics (EM) simulation, system abstraction and integration, and performance benchmarking and evaluation; see properties with device level response; the system modeling incorporates individual device input-output relation to construct a high-level computing architecture with sufficient soft- Detector Array The detailed implementation and characterization of the array of SLMs and photodetectors are summarized in Fig. 2. Figure 2a illustrates the design of graphene-based SLMs, which consists of a monolayer graphene and an extraordinary optical transmission (EOT) metamaterial on top. 25,26 The EOT metamaterial unit has a 340 nm outer radius (r), a 50 nm gap (s), and the periodicity (p) of the array is 1 µm. The resulting transmission resonance is positioned around ≈ 4.5 µm. The graphene layer sits on the top of a dielectric thin layer that acts as an insulating layer for electrostatic doping to control the graphene Fermi level (E F ) and modify its optical properties. The scattering rate of graphene is assumed to be 2 meV, corresponding to a carrier mobility ≈ 6500cm 2 /(V · s) when E F = 0.5 eV; see Methods for detailed conversion. Large-scale graphene films of such quality can be readily obtained nowadays using chemical vapor deposition (CVD). 27 The EOT array serves dual purposes; one is to enhance the light-matter interaction in graphene so that the modulation efficiency can be significant and the second is to be used as a top electrode for electrostatic control.
Underneath each pixel is a transparent electrode, such as nickle ultrathin films 28   In contrast to electronic implementation, this graphene optoelectronic architecture has ultrahigh parallelism, where all elements in both vector and matrix are computed simultaneously. In addition, the ultrahigh carrier mobility of graphene promises fast switching speed for the implementation of both SLMs and photodetectors, which can be readily above 1 GHz to even tens of GHz. 30 These two factors suggest the ultrahigh data throughput of the system. Furthermore, the electrostatic control and tuning of both SLMs and photodetectors are especially power efficient -with nearly zero power consumption -in the static state and inference mode. This architecture also has potential of being integrated into a single chip, thanks to the large-scale CVD growth of graphene and its compatibility with modern micro/nanofabrication processes.
One important issue of emerging architectures bearing photonic and generally analog computing is scalability, which is especially notorious involving unconventional materials, such as graphene nanomaterials. In current example, there are inevitable device variation and non-uniformity when the array scale is large, which can be due to polycrystalline nature of graphene and micro/nanofabrication variation. Performing accurate calculation with imperfect components is thus crucial for practical deployment, and the procedure of correcting such imperfection is necessary. Figure 3a   In our correction procedure, for each row, we sweep applied gate voltages on each unit of SLMs and corresponding photodetectors pair by pair, and for each pair, we sweep gate voltage of SLM unit and that of photodetector unit separately. From the readout, we obtain tuning curves for each unit of SLMs and photodetectors. Due to the non-uniformity of devices, the tuning range from each pair can vary, as illustrated in Fig. 3a. The developed strategy is to define the minimum tuning range on that row as the physical quantity unit so that any other reading from the row readout can be converted to algebraic values by dividing this unit.
Also, defining the unit as the minimum tuning range guarantees that each pair can achieve this range. This methodology highlights the advantages of replicating vector encoding on the rows of SLMs, through which the correction for each row is independent from others. In contrast, the calibration in the structures involving beam splitting elements is cross-linked between rows and is significantly more complicated. The detailed mathematical analysis and proof of correction procedure to generate the accurate output results are provided in Supporting Information Section 1.
The accuracy of graphene multiplier is evaluated by comparing the calculation results with those obtained from standard linear algebra multiplication function. Figure 3b shows a representative calculation error distribution of 10000 multiplication calculations of a random 8 × 8 matrix and a random 8 × 1 vector with all elements within −1 to 1. The histogram is fit using a normal distribution and the standard deviation is the figure of merit for evaluation. Figure 3c displays the standard deviation error for various degrees of device variation from 0 to 20%. This variation applies to both SLMs and photodetectors. The error is nearly constant by using the correction procedure described above, proving the effectiveness of this procedure. Note here, the residual error for perfect devices originates from the finite precision of applied gate voltage, which is assumed to be 8 bit. In addition to the limit of finite precision in applied gate voltages, the readout from detectors can also have finite precision. For example, commercially available digital CCD cameras in the visible range generally have 10 bit precision. We also investigated the influence of detector bit precision on accuracy performance, and as shown in Fig. 3d, the error drastically increases with small bit precision (e.g. 5 bit). Finally, we investigated the influence of noise in the system, which is modeled as a Gaussian noise added onto the readout end. The noise effect is reflected onto the error dependence on input power. Note here, the detector responsivity has been ideally modeled and in practice the responsivity can be quite different. Thus, the unit of input power on the x-axis is arbitrary unit. As expected, as the input power and thus signalto-noise ratio decreases, the error increases. More error histograms for these contributing variations and noise are provided in Supporting Information Section 2.
Finally, we utilize our graphene multiplier for running multiple ML algorithms. We emulated and corrected an 8 × 8 multiplier, and established a general matrix-matrix multiplication (GEMM) by segmenting the matrix into multiple blocks to fit the dimension of our multiplier emulator; see Methods for more details. We compare the quality of results of selected ML algorithms obtained with our GEMM multiplier with the results from a generalpurpose processor (GPP), which is an Intel Xeon Gold 6230 processor in this work. First, we evaluated the graphene GEMM for image reconstruction, in which the image was compressed using singular value decomposition (SVD). The original image is shown in Fig. 4a, and has been compressed using SVD such that image = U · Σ · V T , where the dimensionalities of image, U, Σ, V T are R m×n×3 , R m×p , R n×p , and R p×p , respectively. Specifically, our experiments were conducted on image ∈ R 768×512×3 (m = 768, n = 512). While the top singular vectors capture most of the variation, instead of using all the singular vectors and multiplying them as shown in SVD decomposition, we reconstructed the image using top-K singular vectors. The reconstructed image (K = 50) with GPPs (Figure 4b) has the same quality as that of image reconstructed using graphene multiplier (Figure 4c). The second ML algorithm we evaluated with graphene GEMM is unsupervised learning using support vector machine (SVM) algorithm on Blobs dataset. As shown in Figs. 4d and e, the clustering results generated with our GEMM multiplier match the results obtained on GPPs, where the loss differs < 0.2%. where the graphene multiplier achieved 88.7% accuracy for MNIST10 and 76.8% accuracy for Fashion-MINST10. In comparison, the GPP achieved slightly better prediction performance using the same MLP architecture, saying 92.3% and 78.7% accuracy for MNIST10 and Fashion-MNIST10, respectively. While we demonstrate that graphene GEMM multiplier can achieve similar results as GPPs for the first two ML algorithms, there are noticeable accuracy degradations for image classification tasks using MLPs. We find out that the accuracy degradations are mainly caused by initialization and training algorithms, such that the learned parameters in MLPs are very small with the mean close to zero. Due to such distribution of MLP parameters, the inevitable errors from graphene multiplier associated with noise and finite precision become noticeable than other applications. However, we believe that the impacts of errors introduced by graphene multiplier will be much smaller while applying to more robust and larger neural network architectures.
In summary, we report a new high-performance optoelectronic architecture of performing general MVM and GEMM operations by exploiting the extraordinary properties of graphene.
Specifically, this architecture consists of a 2D array of SLMs and photodetectors with electrically controllable transmission and photoresponse, which are both constructed out of the combination of large-scale graphene monolayers and optical EOT metamaterials. This system possesses ultrahigh data throughput and ultralow power consumption, because of extreme parallelism of the architecture, ultrahigh carrier mobility of graphene, and electrostatic control. From the perspective of practically deploying large-scale system, we design a methodology of performing accurate calculation with imperfect devices and systems and evaluate the influence of imperfection, considering inevitable non-uniformity of material properties and associated device variation. Finally, we demonstrate a few ML algorithms showing the versatility and generality of the hardware.

Graphene model in Lumerical FDTD
Graphene monolayer is modeled as a 2D rectangle conducting sheet in Lumerical material library, including both interband and intraband contributions. Fermi level and scattering rate are two parameters used to calculate dielectric constants used for Lumerical. Scattering rate used in Lumerical can be converted to mobility as follows. The damping constant γ = qhv 2 F µE F is twice of scattering rate setting in Lumerical library, where q = 1.6 × 10 −19 C is the value of electron charge,h = 1.05 × 10 −34 J · s is the reduced Planck constant, v F = 1 × 10 6 m/s is the Fermi velocity, µ is the carrier mobility, and E F is the Fermi level. Thus, in this study, 2 meV damping rate, corresponding to 1 meV scattering rate in Lumerical setting, is used and is equivalent to the carrier mobility ≈ 6500cm 2 /(V · s) at E F = 0.5 eV.

Device response fitting and variation modeling
The simulated transmission of SLMs and absorption of photodetectors using EM simulators as a function of various Fermi levels are fit using 2nd-order polynomials. Concretely, T (V g ) = a (2) (2) r . The device variation is modeled as that fitting parameters a varies as ã = −pX a + (1.0 + p/2) a, where X is a random number between 0 and 1 with uniform distribution generated for each unit and p denotes the strength of variation. ã is in the range between (1.0 − p/2) a and (1.0 + p/2) a. 20% variation means p = 0.2 and for each unit of the array X is randomly generated.

Implementation of general matrix multiply (GEMM)
GEMM is a common algorithm in linear algebra, machine learning, statistics, and many other domains. Mostly, this includes using blocking, inner products, outer products, and systolic array techniques, which breaks the computations of GEMM to better utilize vectormultiplication or MVM. Specifically, for this work, we develop optoelectronic GEMM by utilizing the proposed optoelectronic MVM, where we decompose the targeted matrices into block-matrices (also known as block partitioning). GEMM is then implemented recursively using divide-and-conquer algorithm, which is used to execute the ML algorithms discussed in Fig. 4.

Training of ML algorithms
Autograd is a reverse automatic differentiation system, which records a graph representation of all the operations that encode the input-output mappings of ML models. As a result, it returns as a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, gradients can be automatically computed based on the chain-rule for gradient-based backpropogation algorithms. While the evaluated ML algorithms are all implemented with GEMM operators, we can simply construct the autograd graphs using PyTorch-autograd mechanism and deploy gradient descent algorithm Adam to train the ML models according to a given loss function. Specifically, we use meansquare-error loss to train SVM-based clustering application, and negative-log-likelihood-loss to train MNIST10 and Fashin-MNIST10 classification tasks. Our Adam backpropagation algorithm settings include learning rate lr = 0.1, β 1 = 0.9, β 2 = 0.999, = 10 −8 , and without L2 penalty. To evaluate the final prediction performance of those ML algorithms with the proposed optoelectronic GEMM architecture, we replace the PyTorch matrix-multiply functions with our GEMM algorithm.