HyperSense: Hyperdimensional Intelligent Sensing for Energy-Efficient Sparse Data Processing

Introducing HyperSense, our co-designed hardware and software system efficiently controls Analog-to-Digital Converter (ADC) modules' data generation rate based on object presence predictions in sensor data. Addressing challenges posed by escalating sensor quantities and data rates, HyperSense reduces redundant digital data using energy-efficient low-precision ADC, diminishing machine learning system costs. Leveraging neurally-inspired HyperDimensional Computing (HDC), HyperSense analyzes real-time raw low-precision sensor data, offering advantages in handling noise, memory-centricity, and real-time learning. Our proposed HyperSense model combines high-performance software for object detection with real-time hardware prediction, introducing the novel concept of Intelligent Sensor Control. Comprehensive software and hardware evaluations demonstrate our solution's superior performance, evidenced by the highest Area Under the Curve (AUC) and sharpest Receiver Operating Characteristic (ROC) curve among lightweight models. Hardware-wise, our FPGA-based domain-specific accelerator tailored for HyperSense achieves a 5.6x speedup compared to YOLOv4 on NVIDIA Jetson Orin while showing up to 92.1% energy saving compared to the conventional system. These results underscore HyperSense's effectiveness and efficiency, positioning it as a promising solution for intelligent sensing and real-time data processing across diverse applications.


I. INTRODUCTION
Ubiquitous sensors, witnessing exponential growth in numbers and data generation rates, pose formidable challenges for existing processing methods due to algorithmic and architectural limitations.In the Internet of Things (IoT), the use of machine learning algorithms for sensor data analysis often leads to compatibility issues, given the escalating data volume requirements.These challenges stem from sensor computing demands, redundant information transmission, and computationally intensive analyses, resulting in time and energy inefficiencies.
In contrast to today's sensors' dense data generation, biological sensors operate on a significantly smaller scale, generating five orders of magnitude less data through intelligent sensing [25], [26].Figure 1 illustrates the existing sensing systems, where high-precision Analog-to-Digital Converters (ADC) produce highly dense data.Despite proposed intelligent sensing approaches to mitigate massive data generation costs [9], [22], [24], none have explored controlling highprecision ADC computation activity using low-precision ADC data, known for its energy efficiency [29].Our proposed Intelligent Sensor Control introduces a novel concept, leveraging real-time feedback from a machine learning algorithm to ensure sensors generate only necessary data for final analysis, focusing on or producing relevant data for learning purposes.
Challenges in deploying existing deep learning models on or near sensors, including memory consumption and energyintensive computations, necessitate a paradigm shift.Even lightweight models like YOLOv4 tiny [4] struggle with radar data [36].Our objective is to address these challenges through an intelligent, robust, and efficient framework that represents and analyzes raw sensor data.Moreover, existing learning models struggle to handle raw, noisy low-precision sensor data [5], [39], translating into a non-straightforward integration with sensors.By redesigning machine learning algorithms using neurally-inspired HyperDimensional Computing (HDC) [14], we aim to achieve real-time performance with noisy data.HDC, mimicking brain functionalities, offers advantages in efficiency and noise-tolerant computation [15], [38].
In this paper, we propose fundamental changes to make sensing systems intelligent for various applications, aiming for four orders of magnitude data reduction through bioinspired approaches.Our approach draws inspiration from the human visual system, leveraging HDC as an alternative computing method that processes cognitive tasks robustly and efficiently.Figure 2 highlights characteristics of HDC that enhance intelligent sensing.Our goal is to render AI more accessible to a wide array of sensing devices by addressing the digital data deluge issue, making sensing systems efficient through HDC's bio-inspired computational model.
Our work is fundamentally novel and provides the following contributions: • To the best of our knowledge, we propose a completely novel concept of Intelligent Sensing, Intelligent Sensor Control.Unlike previous works on intelligent sensing that focus on compressing data, our solution selectively generates data, leading to a substantial cost reduction in scenarios where the activity of interest is infrequent.• We propose HyperSense model which is capable of conducting HDC-based object detection for enabling visual sensing data based Intelligent Sensor Control.We also studied characteristics of HyperSense model by thoroughly exploring multiple hyperparameters of the model to identify the optimal parameters suitable for the sensing scenario.
• We design an FPGA-based domain-specific accelerator targeting HyperSense.To improve the sensing throughput, we customized the computing data path to reuse computations.Our evaluation results show that the FPGA implementation of HyperSense achieves on average 5.6× speedup compared to YOLOv4 running on NVIDIA Jetson Orin, while also delivering improved sensing accuracy with up to 92.1% energy saving compared to the conventional system.
II. RELATED WORKS Intelligent Sensing With the sensor technology improvements over the past decades, computational methods have evolved to help us gather meaningful information from raw sensor data [2].Multiple designs from different perspectives have been proposed to improve the data sensing efficiency, including sensor materials [2], sensor circuits [20], in-sensor accelerators [1], [10], [23], and near sensor accelerators [32], [37].The key idea of in-sensor and near-sensor acceleration is integrating machine learning computing circuits into the sensing circuit to improve the data processing efficiency.Those supported machine learning kernels include a histogram of gradient (HOG) [23], matrix to matrix multiplication [10], and convolution [1].Although those in-sensor and near-sensor acceleration techniques have significantly improved sensing efficiency, they fail to consider system-level integration.Recently, in computer system and architecture communities, multiple system-level intelligent sensing frameworks have been proposed [9], [22], [24].Specifically, work in [9] focuses on the multi-model computing (M 2 C) system integration.CAMJ [24] proposes an architecture-level modeling framework of CMOS Image Sensors (CIS).LeCA [22] proposes an in-sensor image compression accelerator to balance the backend computer vision (CV) model and sensor-capturing image quality.Although, all previous circuit-level and system-level intelligent sensing works mention the importance of ADC in the whole sensing system but none of them try to provide machine-learning solutions to optimize the ADC computation activity.Hyperdimensional computing Brain-inspired hyperdimensional computing (HDC) is based on the understanding that brains compute with patterns of neural activity that are not readily associated with numbers.Due to the huge size of the brain's circuits, neural patterns can be modeled with hypervectors [15].HDC builds upon a well-defined set of operations with random hypervectors, is extremely robust in the presence of failures, and offers a complete computational paradigm that is easily applied to multiple learning problems, such as speech recognition [12], genome sequence alignment [18], graph learning [16], [28], and computer vision [7], [8].Although HDC-based machine learning models have shown high memorization capability, strong robustness against noise, and nature model interpretability, none of the previous HDC works have tried to solve object detection problems in autonomous driving systems.In this work, we try to use the HDC model to detect objects in radar imaging datasets.

III. HyperSense MODEL DESIGN A. HDC Basics
The fundamental representational unit of HDC is called a hyperdimensional vector.A hypervector H indicates a vector R D with high dimensionality D. The hyperdimensional vectors are compared to each other by a similarity function δ.Utilizing the similarity measure, HDC can facilitate cognitive tasks such as memorization, classification, clustering, and more.HDC frameworks designed to support these tasks rely on three fundamental HDC operations that directly correspond to brain functionalities: bundling, binding, and permutation.Details on each operation are as follows: 1) Bundling: this operation, denoted as +, is typically implemented as element-wise addition.
. From a cognitive perspective, it can be interpreted as association. 3) Permutation: this operator, denoted as ρ, is typically implemented as a rotation of vector elements.Generally, δ(ρ(H), H) ≃ 0. The permutation is usually used to encode order in sequences.Using the three basic HDC operations enables a hyperdimensional learning framework for many different tasks.For Based on these class hypervectors, we can start retraining.For each data point to retrain ⃗ x i , each class hypervector is updated as follows: , and η is learning rate.
3) Inference: After the class hypervectors ⃗ C i are updated by the initial training and retraining, given query ⃗ q ∈ D can be classified in a straightforward way.A class i is considered to be predicted class when δ( ⃗ C i , ⃗ ϕ(⃗ q)) > δ( ⃗ C j , ⃗ ϕ(⃗ q)) is satisfied for all j ̸ = i.

B. HDC Intelligent Sensor Control Framework
Figure 3 presents an overview of our framework, which leverages HDC to enable intelligent sensing by tightly integrating it with the sensing circuit.The ADC module, responsible for converting analog data into the digital domain, is the most power-consuming and latency-inducing component of every sensor.To achieve efficient and intelligent sensing, our HDC algorithms operate over low-precision digital data generated from energy-efficient low-precision ADC and provide realtime feedback for selective sampling, reducing the data generation rate from the sensor.
Our proposed framework comprises two major components: (1) A neural encoding module that receives raw low-precision data and transforms it into holographic vectors in a highdimensional space.These hypervectors store information in their patterns and are learnable by our HDC algorithms.(2) A learning algorithm that makes decisions about the sampling rate of the ADC block.The HDC learning aims to lower the ADC sampling rate for data points that do not carry useful information.For instance, our sensing circuit, which typically generates 60 samples/second, would only generate data at a minimum frequency (e.g., 1 frame/second) unless HDC detects that the incoming data points carry relevant information.
Our framework primarily focuses on visual sensing data, particularly radar data, where useful information exhibits locality.As such, the HDC learning and classification algorithm in our framework performs object detection tasks to determine the presence of any object in a given frame of sensing data.Figure 4 illustrates how our framework operates in controlling the ADC.In the figure, an object of interest appears in the second frame and disappears in the last frame.Since the first and last frames do not contain any objects of interest, our framework, shown at the top of the figure, disables the ADC from generating digital frames.However, at the bottom of the figure, without our HyperSense model, digital frames are generated even without any objects of interest, resulting in an abundance of useless data that increases the overall system's cost.

C. HDC Object Detection
As we mentioned above, our framework controls the frequency of ADC by detecting objects in a given sensing that corresponds to the hyper-dimensional data point of the fragment x i .The 5K or 10K dimensionality is usually selected, however, it highly varies depending on the input data, thus needs to be tuned to have optimized dimensionality balancing between performance and efficiency.By performing the normalization and HDC encoding process for all negative and positive fragments, we can have negative hypervectors and positive hypervectors respectively.Now, initial training of HDC classification proceeds to have class hypervector by bundling each negative To have a better-performing HDC classification model, retraining using the negative and positive hypervectors is conducted in an iterative way ( 4).The bestperforming HDC model is selected by running inference over test fragments to compute performance metrics such as accuracy, f1 score, etc. ( 5).For each test fragment q, the inference is conducted by computing the similarity between each class hypervector with normalized and HDC encoded hypervector of q, which can be formalized by δ( ⃗ C i , ϕ(⃗ q ′ )).And the class with the higher similarity value is considered to be predicted.With the best performing HDC model, we now have trained Fragment model.
Using the trained Fragment model, we can have HyperSense model as shown in the Figure 5.(b).In order to form a HyperSense model, not only the trained Fragment model, we also need to set additional hyperparameters T score , T detection , and stride.As long as we have the four of them, we do not need additional training steps for HyperSense model.We will discuss each of the hyperparameters in the following.In a given frame of sensor data, inference over the sensor frame using HyperSense model is starting with cropping fragments in a sliding window manner ( 6).In this step, the hyperparameter stride is used.The sliding window moves over the frame by stride amount at a time for both horizontal and vertical movement.Then, we give the resulting m fragments f i to the trained Fragment model to have prediction scores s i for each fragment f i ( 7 ).Now, we will have m scores each corresponding to a fragment in a given frame of sensor data.In order to determine which fragments are detected objects, the hyperparameter T score is used ( 8 ).Each score s i is compared with the score threshold T score to determine the prediction of objects in a fragment f i .If it is larger than T score , it is considered to have objects in the fragment, and store prediction value of 1 otherwise 0. Finally, we compare the summation of all prediction values with the hyperparameter T detection which indicates the threshold value for the number of detection ( 9).If the summation value is larger than T detection , the final prediction of HyperSense model is positive which indicates there are objects in a given frame of sensor data otherwise, the prediction is negative which indicates there is no objects in a given frame of sensor data.
Figure 6 illustrates an actual demonstration conducted on the CRUW dataset, which will be further explained in a later section, showcasing how our HyperSense model interacts with the trained Fragment model.The demonstration consists of four different types of scenes, each with a corresponding area in the heatmap displayed at the bottom of the figure.The heatmap represents the prediction scores of the Fragment model, where each fragment corresponds to the y-axis, and each time frame corresponds to the x-axis.The fragments are organized in a single column through row-major ordering, with the score for the topmost left fragment located at the top of each column in the heatmap.In scenes involving horizontal movements, we observe high confidence values arranged horizontally with little vertical movements, while in scenes with vertical movements, they are vertically located.Moreover, in the static scene, we can observe consistent scores across the time frames.

IV. ACCELERATOR ARCHITECTURE DESIGN A. Computation Bottleneck
Though HDC-based models show promising potential in conducting object detection tasks, to perform real-time learning, it is still necessary to design domain-specific accelerators to avoid unnecessary computations.Specifically, we illustrate the computation overhead in Figure 7.As is discussed in subsection III-C, to detect an object, the HDC model needs to encode each fragment by mapping input image data from normal space into hyperspace.As is shown in Figure 7.(a), the sliding window will scan the input image in both the X direction and the Y direction.For simplicity, here we only zoom into the situation of X direction.To maximize the computation efficiency, the most straightforward optimization strategy is to pipeline the encoding operations of different fragments in one direction as shown in Figure 7.(b).Here we suppose the dimension of the sliding window is 2 × 3 and the step size of the scanning operation is 1.To encode the fragment, we need to first unroll the fragments from 2 × 3 into 6 × 1 and then multiply each element of the unrolled vector with base hypervector as we discussed in subsection III-A.
Here in Figure 7.(c), we present the encoding process of the first rows of fragment S 1 , S 2 , and S 3 .However, as we show in the Figure 7.(c), fragments share the same elements with each other.For example, fragment S 1 and fragment S 2 will share 2 elements in the first row.Therefore, to improve the computation efficiency, one of the optimization strategies is to reuse the computation of common elements between different fragments.

B. HDC Encoding Optimization
To reuse the common elements between continuous fragments, here we use the attributes of permutation of different base hypervectors.As is discussed in subsection III-A, to keep the orthogonality of different base hypervectors, one of the ways is first randomly generating the first base hypervector and then using HDC permutation operation to generate the rest base hypervectors.Suppose the size of the fragment is h×w, and the base hypervector matrix is B. If the hypervector dimension is D then the size of the base hypervector matrix B is h × w × D. To achieve the balance between efficiency and accuracy, we randomly generate the hypervector of the first axis of the matrix B and use HDC permutation operation to generate the rest in the second axis.In mathematics, this process can be represented as: An example of a permutation base hypervector generation process is shown in Figure 7.(d).Here we suppose the base hypervector of element S 12 is left shifted by the base hypervector of element S 11 .The same principle can be also applied to the generation of the base hypervector of element S 13 which is left shifted by the base hypervector of element S 12 .
Here we use a simple example to illustrate how to reuse the computation in Figure 7.(d).It is easy to notice that the second element of S 1 is the same as the first element of S 2 .The HDC encoding computation of the first-row fragment S 1 could be represented as: The encoding computation of the first row of fragment S 2 could be represented as:    In this case, we can rewrite HDC encoding computation as: (5) As is shown in Figure 7.(d), for those overlapping elements of each fragment, the D-1 multiplications result could be reused to save the computation.Therefore, the overall encoding overhead is significantly reduced.Here we want to mention that the computation reuse is based on the fact that HDC base hypervectors will still keep holography property when applying permutation operations to generate base hypervector.This attribute is unique in the HDC model and cannot be used for traditional neural network-based models.

C. Accelerator Architecture Design
In Figure 8, we present the architecture design of HDC near the sensor processing accelerator.For each input image frame, we first partition it into several small pieces, as is shown in Figure 8.(a).Each small piece will be mapped into a systolic array (SA) IP for HDC encoding operations and be processed parallel as is shown in Figure 8.(b).Here we suppose the original input image's dimension is H ×W .There are total T 1 × T 2 SA IPs inside the near sensor accelerator.Inside each SA IP, the sliding window will move in both X and Y directions to generate multiple fragments.Each fragment inside a single SA IP will be generated in pipeline style and different SA IPs will run parallel.In the end, an HDC classifier will perform a cosine similarity check to determine whether the object is detected.For FPGA on-chip kernel function and classifier implementation, we refer to previous HDC FPGA work [13].
In Figure 8.(c), we present the architecture design of each SA IP.Suppose the dimension of each fragment is h×w, in this case, each SA IP will consist of h × w process element (PE) IPs.Those PE IPs in the same row will execute in pipeline style and PE IPs in the different rows will run parallel.Suppose the hypervector dimension is D, then each PE IP will perform encoding operations of hypervector chunks whose dimension is D w .In Figure 9, we present each PE IP's microarchitecture.To realize computation reuse, each PE IP will accept the image element and partial encoding result from the left PE IP and buffer it inside its FIFO IP, as is shown in Figure 9 1 .The input image element will multiply with base hypervector chunks ( 2 and 3 ).As we will see in subsection IV-D, only part of the base hypervector chunks will participate in the multiply operations.After the multiplication, a partial of the encoding result will pass into the next PE IP ( 4 ).All the new encoding results of the current PE IP ( 5 ) together with partial encoding results from the last PE IP ( 6 ) will be used to update encoding hypervector chunks at registers IP ( 7 ).

D. Computation reuse example
In Figure 10, we present an example of HDC encoding computation reuse.Due to page restriction, here we limit the dimension of the fragment as 2 × 3.As is discussed in subsection IV-C, we parallel the computation of each row of the fragment and reuse the computation of the elements in a single row.Therefore, in Figure 10, we show the computation activity of the first row of SA IP which includes three PE IPs.Suppose the hyperspace dimension is D, then inside each PE IP, the chunk vector dimension is D 3 .We use P ij to represent the i th row and j th column PE IP inside single SA IP.In Figure 10, as we only present the first row of SA IP's computing activity, all PE IP's i index is 1.Due to page limitation, we only show 5 stages of the pipeline process but for longer time steps, the principle is still the same.Here we use ⃗ B ij to represent the base encoding hypervector.Specifically, i represent the position and j represents the chunk index.Therefore, for the first row of the k th fragment, the encoding computation mathematics representation of the m th hypervector chunk should be: In Figure 10, we show the computation reuse of fragment 1, fragment 2, and fragment 3 whose computation mathematics of the m th chunk should be: As we discussed in subsection IV-B, to reuse the encoding computation of different hypevector chunks, we generate the base hypervector based on HDC permutation operation.Specifically, for the first base hypervector ⃗ B 1 , all three chunks ⃗ B 11 , ⃗ B 12 and ⃗ B 13 are generated based on Gaussian distribution as is discussed in subsection III-A.For base hypervector ⃗ B 2 , we have: This means the second base hypervector ⃗ B 2 is generated by applying permutation operation for the first base hypervector ) is still generated with Gaussian distribution.For the base hypervector ⃗ B 3 , we have: Next, we will discuss the computation reuse step by step.As is shown in Figure 10 step 1, the first element of the fragment S 1 (I 1 ) will be first multiplied with the first chunk of the first base hypervector (B 11 ) inside P E 11 .After the multiplication operation, the temporal encoding result will be saved inside each PE IP's registers (Regs).Also, I 1 will be buffered inside P E 11 's buffer which is connected with P E 12 .At the first stage of the pipeline, only P E 11 is active and the other two PE IPs are idle.At step 2, element I 2 will be loaded into P E 11 IP and element I 1 will be passed into P E 12 IP.Since I 2 simultaneously correspond to the first element of fragment S 2 and the second element of fragment S 1 , inside P E 11 , I 2 will multiply with both B 11 and B 21 .The multiplication result of I 2 × B 11 will be saved inside the second row of Regs and the result of I 2 × B 21 will update the first row of Regs.After update the temporal fragment S 1 encoding result inside P E 11 will be I 1 × B 11 + I 2 × B 21 .To reuse the computation in stage 3, we also buffer the result of I 2 × B 11 .We will reuse the computation result of I 2 × B 11 at stage 3.During the computing activity of P E 11 , there is also another computation inside P E 12 , where element I 1 is multiplied with base hypervector chunk B 12 .I 1 ×B 12 will also be saved inside the first row of Regs as the partial encoding result of fragment S 1 .During step 2, the P E 13 is still idle.
At step 3, element I 3 will be loaded into P E 11 , element I 2 will be forward into P E 12 , and element I 13 will be passed into P E 13 IP.Starting from element I 3 , all elements will be shared by three fragments, which means it needs to be multiplied with all 3 base hypervectors.So at step 3, inside P E 11 , we compute the result of I 3 × B 11 , I 3 × B 21 , and I 3 × B 31 .Like step 2, I 3 ×B 11 is a partial result of fragment S 3 which will be saved inside the third rows of Regs.I 3 ×B 31 and I 3 ×B 21 will be used to update partial result of fragment S 1 and fragment S 2 respectively.Until now, we have finished the first chunk encoding operation of fragment S 1 which is ⃗ H 11 .We will pop out ⃗ H 11 , buffer it and waiting for the result of ⃗ H 12 and ⃗ H 13 .The P E 12 IP will simultaneously accept element I 2 and partial encoding result I 2 × B 11 .Based on Equation 10, we have: So we directly use I 2 × B 11 coming from P E 11 to update the partial encoding result of S 1 and only perform the calculation of I 2 × B 12 which will be used as a partial result as fragment chunk encoding result of fragment S 2 .Since ⃗ B 12 = ⃗ B 23 , to reuse the computation in stage 4, we also buffer the result of I 2 × B 12 which will be passed into P E 13 .
After step 3, the later stages' computation activities are the same.At the P E 11 IP, the input element needs to be multiplied with all base hypervector chunks.To reuse the computation result, the input element's multiplication result with the first two base hypervector chunks ( ⃗ B 11 and ⃗ B 21 ) will be buffered and loaded into the later PE IPs at the next stages.For other IPs (P E 12 and P E 13 ), the input element only needs to be multiplied with the first base hypervector chunk ( ⃗ B 21 and ⃗ B 31 ), Meanwhile, the computation of those multiplications gonna be reused by next PE IPs.As is shown in Figure 10, as we adopt the HDC permutation operations to base hypervectors, the computation is significantly reduced.

A. Experimental Setup
The proposed framework has been executed with a software framework and a hardware accelerator.Our software framework is implemented using a combination of Pytorch and NumPy that supports HDC encoding and classification.We study the effectiveness of our technique over the CRUW dataset [34], which is a public camera-radar dataset for autonomous vehicle applications.The radar images are captured by TI AWR1843 whose operating power is around 30W [21], [34].For consistency and simplicity, we limited our experiments to considering square shapes for the fragments.Thus, when we refer to a fragment size of x, it corresponds to x × x sized square fragments.For the hardware accelerator, we implemented our design using SystemVerilog and tested it on Xilinx Zynq UltraScale + MPSoC ZCU104 (Xilinx).Our accelerator architecture design is platform agnostic and can be implemented on both FPGA and ASIC.In this paper, we choose to use FPGA as the evaluation platform to quickly test our design's efficiency.We leave ASIC evaluation as our future work.

B. Evaluation of HyperSense model
First, we focused on a scenario where we aim to allow a specific amount of false positives while maximizing the true positive rate (TPR).To analyze this scenario and determine the setting of our models that achieves the maximum TPR at the desired false positive rate (FPR), we employed the Receiver Operating Characteristic (ROC) curve evaluation.By plotting the ROC curves for different model configurations, we can identify the optimal operating points that strike the right balance between true positives and false positives, enabling us to achieve the highest TPR while adhering to the desired FPR.This analysis provides valuable insights into the performance and effectiveness of our models under different settings, ensuring their efficiency and suitability for intelligent sensing applications.
1) Fragment model performance: In our initial set of experiments, we sought to compare the performance of an HDC  As fragment size gets larger, we get a higher true positive rate on a higher false positive rate.
model with a 10K dimensionality against other widely-used lightweight models as baselines, namely a small multi-layer perceptron (MLP) model and YOLOv4 tiny, in the context of object detection for the Fragment model.The Fragment model here refers to a specific configuration with a fragment size of 128. Figure 11 depicts ROC curves obtained from these different models when applied to the Fragment model.Notably, on the left side of the ROC curves, even though the YOLOv4 tiny model has more than 5 million parameters which is much larger compared to other models making it hard to efficiently implement in near-sensor, it is evident that the YOLOv4 tiny model exhibits the lowest quality in terms of ROC curve performance.We assume it is due to the YOLOv4 tiny model's lack of performance on radar data as we can see in the previous work which reported the lowest performance of the YOLOv4 tiny model in terms of F1 score while having the minimum level of model size and latency [4].Conversely, on the right side of the ROC curves, we observe the performance distinction between HDC and MLP in the TPR range of 0.8 to 1. Remarkably, HDC showcases the most discerning ROC curve compared to the other models.The quantified comparison of the Area Under the Curve (AUC) for the right ROC curves in Figure 11 as presented in Table I further confirms the highest performance of HDC compared to the baselines.

2) Hyperparameters of HyperSense model exploration:
The HyperSense model is composed of the Fragment model and three essential hyperparameters: T score , T detection , and stride as explained in subsection III-C.For the Fragment model with a fixed fragment size of 128, which aligns with the sensing frames in the CRUW dataset, only one fragment is generated, resulting in a single ROC curve based on a single prediction score.This behavior is evident in Figure 11.However, when the Fragment model employs a different fragment size, the HyperSense model produces multiple prediction scores, necessitating the use of T detection .Additionally, the HyperSense model now processes a given frame of sensing data using a moving window approach, introducing another hyperparameter, stride, into the mix.Consequently, to comprehensively analyze the impact of each hyperparameter on the ROC curve, we undertook an exploration of different hyperparameter values.In Figure 12, we present the results of exploring two thresholds, T score and T detection , on two distinct Fragment models with varying sizes.As depicted in the left heatmaps, different values of T detection yield varying F1 scores across different T score values.This observation indicates that different selections of T detection give rise to distinct ROC curves.As a result, on the right side of the figure, we observe that the HyperSense model now exhibits a range of ROC curves rather than a single ROC curve.This observation indicates the importance of selecting the appropriate ROC curve by identifying the highest TPR at a specific FPR, necessitating the right choice of T detection value.
The experimental results on different stride sizes reveal that a larger stride leads to a substantial skipping area (i.e., the parts of the fragment that do not fit into the frame), as illustrated in Figure 13.(a).Consequently, if an object is located within this skipped area, the model might miss detecting it, resulting in potential mispredictions.This effect is evident in the line graphs in Figure 13.The models with fewer skipped areas exhibit higher performance, while those with more skipped areas show lower performance as shown in the relationship between the skipped area line graph and the top 10 average F1 score graphs.The ROC curves (shown in Figure 13.(b)) also support our observation: smaller stride benefit the model.Nonetheless, using a smaller stride results in the generation of more fragments, which in turn increases the computational load.Hence, striking a balance between computation and performance becomes crucial, and the objective is to select the largest stride that provides comparable performance to the model using a stride of 1.
Interestingly, even though a stride size of 10 generates a larger skipped area compared to most stride sizes less than 10, its performance remains similar or even higher than the others.Considering the reduced number of fragments generated by larger stride sizes, leading to a possible decrease in compu- tation, selecting a higher stride size that offers comparable performance to a stride size of 1 would be an efficient choice in a trade-off relationship.
3) Fragment size effects on ROC curve: From the preceding experiments, we observed that selecting appropriate hyperparameters can result in a HyperSense model with a higher TPR.In this experiment, our focus shifted to the effect of fragment size on performance, considering a ROC curve composed of the highest TPR achievable from a given Fragment model with a specific fragment size.The behavior of the ROC curve with different fragment sizes is illustrated in Figure 14.
At the lower FPR range A , the HyperSense model with the largest fragment size of 128 exhibits the highest TPR.However, as we move to the range of higher FPR, HyperSense models with smaller fragment sizes achieve higher FPR.Specifically, in the middle range B , a fragment size of 112 yields the highest TPR, while the smallest fragment size of 96 performs best in the highest FPR range C .From this observation, we can capture a trend where in larger fragment sizes tend to perform better at lower FPRs, maximizing the TPR while minimizing quality loss due to reduced data granularity.Conversely, smaller fragment sizes are more effective at higher FPRs, where the risk of false positives is more manageable.This indicates that the choice of fragment size varies with a trade-off trend depending on the desired FPR.
Considering the results obtained with a dimensionality of 10K, we decided to conduct a comprehensive exploration of different fragment sizes and various dimensionalities to validate the observations from Figure 14.The heatmaps shown in Figure 15 represent the true positive rates under the different target false positive rates.When targeting the lowest FPR of 0.05, the highest fragment size consistently exhibits the highest TPR for all dimensionalities.However, as we increase the target FPR, the fragment size that achieves the maximum TPR starts to decrease for all dimensionalities, aligning with the trend observed in Figure 14.These results further support the notion that the choice of fragment size plays a crucial role in achieving optimal performance in the HyperSense model, depending on the desired trade-off between true positive and false positive rates.

C. FPGA Resource Utilization
In Table II, we present the FPGA resource utilization of HDC accelerator on ZCU 104.We supposed the dimensionality of hypervector is 5K and the fragment size is 96.The data precision of each hypervector element is 8 bits.The encoding happens inside SA IP.Our classifier IP refers to classic HDC classifier FPGA accelerator design [13].We also include other peripheral IPs' resource utilization in Table II, i.e., encoder IP [13], AXI Interconnect, and DRAM controller.The operating frequency of the accelerator is 100MHz with a power consumption of 8.2W.We got the power result based on Xilinx Power Estimator [3].For a single fragment, the HDC encoding and classification process takes 9397 clock cycles.Although the operation on a single fragment does not show an obvious advantage in our system, the throughput of a whole radar image sensing shows a notable improvement using our tailored computation reuse scheme and pipeline data flow.The details about the computation reuse and pipeline execution flow are elaborated in V-D

D. FPGA Accelerator Performance
Figure 16 presents the cross-platforms and cross-models comparison.For the hardware side, we picked three different platforms that operate at 10W levels, including Raspberry PI 4B CPU (R Pi 4B), NVIDIA Jetson Orin (Jetson Orin), and Xilinx ZCU 104.The operating power of R Pi 4B and Jetson Orin is 15W.We restricted all platforms' power consumption below 50% of the sensor's power.For the model side, we compared MLP, YOLOv4, and HDC models.Each model is implemented on all three platforms.In Figure 16, we annotate MLP i j as MLP model with i layers and fragment size j.We also annotate HDC d j as HDC models with hypervector dimension d and fragment size j.For MLP and Yolo model acceleration on FPGA, we chose to use Xilinx DPU softcore IP [17].For HDC model acceleration on FPGA without computation reuse (HDC d wo), we implemented it based on the previous HDC FPGA framework [13].As is shown in Figure 16, after introducing HDC encoding computation reuse, HyperSense achieves on average 2.4× speedup when compared with MLP models running on Jetson Orin.When

E. End-to-end System Evaluation
To assess the efficacy of our framework in conserving overall system energy, we conducted a comprehensive endto-end system evaluation.Our analysis focused on estimating the average energy consumption for a single radar frame in a scenario where radar data is captured using a TI AWR1843 sensor and transmitted through a 3G network communication channel to a cloud server equipped with a resource-intensive machine-learning model for complex tasks.The energy cost estimation for the cloud server aligns with prior methodologies [31].
In Figure 17, we present a breakdown of energy consumption in two scenarios: one with infrequent occurrences of the object of interest (1% probability) and another with a tenfold increase in frequency.The comparison involves three methods: 1) the conventional method absensing an additional mechanism for energy conservation, 2) a widely recognized approach-compressive sensing, leveraging data compression, particularly Bit Depth Compression (BDC), a recent real-time application [11], and 3) our framework, which incorporates varying target FPRs.Notably, our approach demonstrates a noteworthy capability to conserve energy in both scenarios, affecting not only the total system but also the energy consumption at the edge by showing up to 92.1% and 64.7% energy saving respectively.
It is worth noting that the most energy-efficient case, targeting an FPR of 0.05, leads to the highest quality loss.This loss indicates that a portion of the data containing objects of interest, which was supposed to be transmitted to the cloud, was retained due to mispredictions by our near-sensor model, as elucidated in Table III.Despite this trade-off relationship, our approach maintains a reasonable level of quality loss at approximately 1.95%, while achieving more than a 71% reduction in total system energy consumption.

A. Adapting HyperSense to various sensors
The adaptability of the HyperSense system to various sensors, such as radar, cameras, and microphones, is fundamental to its broad utility across diverse applications.This system employs HDC to uniformly process segmented data units like frames, demonstrating that as long as the input can be transformed into a two-dimensional data unit, our framework can be applied, regardless of the sensor type.For example, in the case of microphones, a fast Fourier transform can convert audio into a two-dimensional spectrogram of fixed length.Our system's ability to interpret input data in standardized units (frames or fixed-length spectrograms) enables seamless integration with different sensor technologies.Additionally, the inherent robustness of the HDC architecture against noise and data quality variance, which are prevalent issues in sensors like microphones and LiDAR, further enhances the system's adaptability.This versatility not only expands the potential applications of our system but also bolsters its practicality for real-world deployments across various industries that require intelligent sensing solutions.

B. Scalability and deployment challenges
When integrating HyperSense with other sensors, such as camera sensors, to perform object detection tasks on different types of data (such as images), two challenges may arise.The first challenge involves using the HDC model for complex vision tasks.Previous studies have indicated the necessity of integrating a Deep Neural Network (DNN) model with the HDC model [8], [35].This integration introduces significant computational overheads, as DNN's convolution operations are generally computation-intensive.For near-sensor processing tasks aiming for real-time image frame processing speeds, one potential solution is to add a domain-specific DNN accelerator.This accelerator would enhance the speed of DNN convolution operations [7], [19].It would be integrated with the HDC accelerator module via an on-chip connection protocol such as AMBA AXI [6], [27], [30].
The second potential challenge arises when using Hyper-Sense to high-resolution data.To achieve real-time processing speeds with high-resolution input data, the most straightforward design choice is to increase the number of computing units in the accelerator to enhance computational parallelism.However, more computing units typically result in higher power consumption and a larger chip area.The near-sensor processing environment often faces strict power and space restrictions.Therefore, increasing computational parallelism to handle high-resolution input data may cause the whole system to fail to fit into the near-sensor environment.One potential solution is to integrate a variational autoencoder (VAE) to compress the high-resolution input data [33].

VII. CONCLUSIONS
We introduce HyperSense, an innovative co-designed hardware and software that tries to solve the gap between intelligent sensing and machine learning.Our cutting-edge system effectively manages data generation from the Analogto-Digital Converter (ADC) modules by predicting object presence in sensor data.Our framework outperforms other models with the highest Area Under the Curve (AUC).At the same time, the FPGA-based hardware exhibits remarkable speedups, rendering HyperSense an ideal solution for realtime intelligent sensing and data processing by showing up to 92.1% energy saving on end-to-end system evaluation.

Fig. 2 .
Fig. 2. Characteristics HDC possesses while DNN lacks.These characteristics of HDC make our intelligent sensing more powerful.

Fig. 3 . 1 ) 2 )
Fig. 3. Overview of our Intelligent Sensing pipeline.classification,each step of the framework can be described below.1)Encoding: The first step in the HDC framework is to map the input data ⃗ F ∈ U into high-dimensional space by introducing an encoding function ⃗ ϕ : U → H, which is often referred to as encoding.Assume an input vector withn features ⃗ F = {f 1 , f 2 , . . ., f n } that represents features from voice, image, etc. ⃗ ϕ( ⃗ F ) = cos ( ⃗ F × ⃗ B + ⃗ b) × sin ( ⃗ F × ⃗ B),where ⃗ B is a n × D matrix where every element in ⃗ B is sampled from i.i.d Gaussian distribution (µ = 0, σ = 1), and ⃗ b is sampled from i.i.d uniform distribution over [0, 2π].⃗ ϕ preserves some notion of similarity in the input space.Thus, given some input ⃗ x, ⃗ y ∈ U , ⃗ ϕ(⃗ x), ⃗ ϕ(⃗ y) are their corresponding hypervector, and ⃗ ϕ(⃗ x) is similar to ⃗ ϕ(⃗ y) if and only if ⃗ x is similar to ⃗ y. 2) Training and Retraining: Suppose we have a dataset D ⊂ U where each data point ⃗ x i ∈ D has corresponding label 1 ≤ y i ≤ m out of m classes.Initial training is done by generating m class hypervectors using bundling: ⃗ C i = yj =i ⃗ ϕ(⃗ x j ).Based on these class hypervectors, we can start retraining.For each data point to retrain ⃗ x i , each class hypervector is updated as follows:

Fig. 4 .
Fig. 4. Illustration of how our proposed framework disables ADC to prevent the excessive generation of digital frames.Generating digital frames with no useful information cause an unnecessary increased cost in the system without HyperSense.

Fig. 5 .Fig. 6 .
Fig. 5. Overview of our object detection framework for Intelligent Sensing.The object detection framework consists of two models: (a) Fragment model and (b) HyperSense model.The trained Fragment model is applied to the HyperSense model.

Fig. 8 .
Fig. 8.The top-level architecture design.(a) Input radar image partition.(b) Systolic array (SA) groups.(c) Processing elements (PE) interconnect.Suppose the base hypervector ⃗ B 2 is left shifted by ⃗ B 1 and base hypervector ⃗ B 3 is left shifted by ⃗ B 2 .In this case, we can rewrite HDC encoding computation as:

Fig. 10 .
Fig. 10.HDC fragments encoding computation reuse example.Here we suppose the width of the sliding window is 3. Due to page size limitation, only 5 steps are presented in this figure.

Fig. 11 .
Fig. 11.Testing Fragment model with other baseline models that widely used lightweight models for object detection when fragment size is 128.

Fig. 15 .
Fig. 15.Exploration of maximum true positive rate (TPR) when targetting certain false positive rate (FPR) with different fragment sizes and dimensionalities.

Fig. 16 .
Fig.16.Cross models and cross platforms comparison.All three hardware platforms' operating power is between 5W to 15W.Here we suppose MLP with fragment size of 96 on R Pi 4B as the baseline.
If H = H 1 +H 2 , then both H 1 and H 2 are similar to H. From a cognitive perspective, it can be interpreted as memorization.2) Binding: this operation, denoted as * , is typically implemented as element-wise multiplication.If H = H 1 * H 2 , then H is dissimilar to both H 1 and H 2 .Binding also has the important property of similarity preservation in the sense that for some hypervector

TABLE I AREA
UNDER THE CURVE (AUC) COMPARISON WITH DIFFERENT BASELINE MODELS INCLUDING OUR Fragment model WHEN CONSIDERING TRUE POSITIVE RATE (TPR) LARGER THAN 0.8.

TABLE II FPGA
RESOURCE UTILIZATION ON XILINX ZCU104.THE HYPERVECTOR DIMENSION IS 5K AND FRAGMENT SIZE IS 96.

1% Probability of object of interest 10% Probability of object of interest 84.3% Total Energy Saving 58.2% Edge Energy Saving
17g.17.Energy consumption breakdown estimation in two different scenarios where the object of interest is infrequent (left) and 10 times frequent (right) for the conventional method, compressive sensing method, and ours with different target FPRs.