Fast OMP algorithm and its FPGA implementation for compressed sensing ‐ based sparse signal acquisition systems

Compressed sensing ‐ based radio frequency signal acquisition systems call for higher reconstruction speed and low dynamic power. In this study, a novel low power fast orthogonal matching pursuit (LPF ‐ OMP) algorithm is proposed for faster reconstruction of sparse signals from their compressively sensed samples and the reconstruction circuit consumes very low dynamic power. The searching time to find the best column is reduced by reducing the number of columns to be searched in successive iterations. A novel architecture of the proposed LPF ‐ OMP algorithm is also presented here. The proposed architecture is implemented on field programmable gate array for demonstrating the performance enhancement. Computation of pseudoinverse in OMP is avoided to save time and storage requirement to store the pseudoinverse matrix. The proposed design incorporates a novel strategy to stop the algorithm without consuming any extra circuitry. A case study is carried out to reconstruct the RADAR test pulses. The design is implemented for K = 256, N = 1024 using XILINX Virtex6 device and supports maximum of K /4 iterations. The proposed design is faster, hardware efficient and consumes very less dynamic power than the previous implementations of OMP. In addition, the proposed implementation proves to be efficient in reconstructing low sparse signals.


| INTRODUCTION
High-frequency radio frequency (RF) signals, such as RADAR pulses, are sparse in nature in the transform domain. Exploiting this sparsity property, modern signal measurement systems use compressed sensing (CS) [1,2] in place of other existing sampling techniques [3] to acquire RF signals. CSbased acquisition systems can work with low speed analog-todigital converters due to sampling at sub-Nyquist rate [4]. In CS-based sampling paradigm, random measurements are taken from the signal and then the original signal is recovered from the measurement samples using signal recovery algorithms. Orthogonal matching pursuit (OMP) [5,6] is a well known recovery algorithm. Unlike the other greedy pursuit algorithms, OMP provides better performance with moderate computational complexity. OMP estimates a sparse signal by executing two steps in every iteration, viz., perform the atom searching (AS) and solve a least squares (LS) problem. In AS step, OMP identifies an atom or a column of the sampling matrix which gives maximum correlation with the current residual. Subsequently the signal is estimated by solving an LS problem.
The timing complexity of the AS step is very high as it is a linear function of the signal sparsity and the number of samples. Many techniques are reported in literature to reduce the timing complexity of AS step. In [7] authors applied clustering algorithms to group the similar columns and reported a tree-based pursuit algorithm. But such algorithm has no reports of implementation. Researchers reported parallel selection of multiple columns to address the timing complexity problem in [8,9]. Multiple selection of columns reduces the timing complexity but with greater chances of choosing wrong columns.
Many implementations are reported based on either field programmable gate arrays (FPGAs) or application-specific integrated circuits. The LS problem is solved in different ways in current research works. The implementations of OMP reported in [9][10][11][12][13][14][15] used modified Cholesky factorization [16] to solve the LS problem. The LS problem is solved by lowerupper decomposition [16] in [17]. QR decomposition [16] is another powerful matrix factorization technique which is used in many implementations of OMP. The implementations in [18][19][20][21][22] preferred modified Gram-Schimdt algorithm [16] to perform QR factorization. Besides the matrix factorizationbased solutions, matrix inversion bypass (MIB) technique reduces the computational complexity of OMP. An MIB-based implementation OMP is reported in [23]. Authors in [24] further reduce the complexity of OMP and achieved high signal-to-noise ratio.
Real-time CS-based sampling of high frequency periodic and RF pulses requires fast signal reconstruction. On the other hand, high digital overhead and high dynamic power consumption of the implementations of the recovery algorithms are other concerns in sampling practical sparse signals. In such applications, OMP implementations are required to have high speed of reconstruction, low hardware complexity and less power consumption. In addition, the hardware complexity must have less dependency on unknown signal sparsity. Previously in [25,26] we have presented hardware efficient architectures of OMP based on QR decomposition and incremental Gaussian elimination, respectively where same hardware resources are shared to perform different operations.
The classic OMP algorithm is reformulated to speed up the reconstruction process simultaneously reducing the dynamic power consumption. The proposed reformulated OMP is termed as low power fast OMP (LPF-OMP) algorithm. A novel architecture for the LPF-OMP algorithm is also proposed. The proposed hardware architecture is implemented on the FPGA platform for demonstration of performance. Low power devices are always desired in real-life portable applications where enhancing the battery life is one of the prime objectives. The speed enhancement is desirable in time critical applications. This algorithm can find use in portable RADAR systems, remote sensing systems like drones, unmanned underwater vehicles and many portable signal acquisition devices. This manuscript has the following major contributions: 1. The correlation step of the classic OMP algorithm is modified to search through different number of columns in successive iterations to reduce reconstruction time. 2. The OMP algorithm uses a partial evaluation of incremental QR decomposition by the modified Gram-Schimdt algorithm. 3. The computation of pseudoinverse is avoided which saves time and storage requirement to store pseudoinverse matrix. 4. A novel method to stop the algorithm is proposed which allows the design to be invariant to the signal sparsity. 5. The reformulated OMP algorithm in the aforesaid manner achieves low dynamic power consumption.
Organization of this manuscript is as follows: Section 2 describes the proposed LPF-OMP algorithm. First the classic OMP algorithm is briefed and then the modifications to that are illustrated. This section also describes the novel strategy to stop the algorithm and a strategy for estimation. In Section 3, the proposed architecture for hardware implementation is discussed in details. Section 4 describes the performance estimation of the LPF-OMP algorithm and its implementation. And finally the conclusions of this research are provided in Section 5.

| PROPOSED LOW POWER FAST OMP-BASED SPARSE SIGNAL RECONSTRUCTION
Real world signals can be sparse either in time or any other transformed domain. An m-sparse input signal x ∈ R N�1 can also be written as x ¼ D s, where D ∈ R N�N is the dictionary and s is the sparse representation of x. CS-based reconstruction of signal x can be expressed as: where y is the measurement vector and η is the noise vector. The sampling matrix (ϕ ∈ R K�N ) can be expressed as ϕ ¼ ψD , where ψ is the sensing matrix which is used to take linear measurements from x.

| Proposed LPF-OMP algorithm
The proposed LPF-OMP algorithm is shown in Algorithm 1. It differs from the original version of OMP algorithm in the technique of choosing suitable columns from the sampling matrix at each iteration. Initially two null sets Λ and Ω are defined. The set Λ stores indices of the most correlated column selected at each iteration and the set Ω stores the best set of columns. The constant parameters w and ϵ are also input to the algorithm. Signal estimation vector b s and the residual vector r are the outputs of OMP. Initial value of the residual vector r 0 is taken as y. To denote the total number of iterations, I is used and i is a variable used for iteration count. The constant I max stands for maximum iteration count and the parameter ϵ is a threshold to determine stopping criteria.

Algorithm 1 LPF-OMP algorithm for signal recovery
In the classic OMP, the AS step finds the index of the column for which the magnitude of the inner product hϕ j , r i i is maximum, where ϕ j is the j th column of ϕ. This index and the respective column of the matrix ϕ are augmented to Λ and b ϕ, respectively. In the estimation step, an LS equation is solved to estimate b s. The solution to the LS equation is: where b � † is the Moore-Penrose pseudoinverse of the underdetermined matrix b ϕ. Residual is calculated at the final step. These steps are repeated until a halting criterion is true.
The AS step of the classic OMP algorithm is modified in this study to achieve high speed of reconstruction by reducing the timing complexity of AS step. In the iterations where the condition mod(i, w) = 0 is satisfied, a set Ω is formed which contains the indices of the best W columns. Here, Ω W indicates the W th index from set Ω. All other steps are evaluated in the same way as in the original OMP. In the next (w − 1) iterations, only the columns which are indexed by Ω are searched to find the best suitable column. This way the timing complexity of the AS step in the iterations which do not satisfy the above-mentioned condition, is reduced from N to W. The constant parameter w can take values greater than one and can be tuned to produce better results. The size of the Ω (which has W indices) depends on the value of w. It is observed that the parameter W can be set to a minimum of 2w to produce an acceptable signal estimate.
In general, the power and speed are the trade-offs in design of integrated circuits (ICs). When a designer tries to improve one parameter the other gets degraded. Generally, speed improvements are done by using parallelism and more resources which in turn increase power consumption. Power improvement techniques at the circuit and architecture level reduce the speed. Power can also be significantly reduced when the algorithm is modified. Here we are not improving power at the cost of speed rather the algorithm is modified to achieve power reduction which also improves the speed due to involvement of less steps of computation.

| Proposed stopping criteria
Conventionally there are three ways to stop the OMP algorithm.
1. Stopping after fixed m number of iterations. 2. Waiting until ‖r i ‖ 2 drops below a threshold level. 3. Stopping the algorithm when max(| 〈ϕ j , r i 〉 |) drops below another threshold value.
Stopping after fixed iterations becomes impractical with no prior knowledge of signal sparsity. The other methods to stop the algorithm are based on progressive decay of residual. The second possibility demands extra circuitry for Euclidean norm computation. An alternative option is to make decision based on the value of ‖r i ‖ 2 2 . The problem associated with the second and third option is the choice of threshold. They converge easily for noiseless situation. For compressed signals in presence of noise, the choice of threshold is very difficult. Also the length of the measurement vector is a key parameter to decide the threshold.
The stopping criteria based on the decaying residual work well for signals having fixed amplitudes with Gaussian sensing matrix but does not guarantee reconstruction of practical signals which are approximately sparse. Also, the threshold value changes drastically with the type of signal. The stopping criterion used in this work is: whenever the ratio in the left hand side is less than a predetermined threshold value, the algorithm stops.

| Proposed signal estimation strategy
QR-based matrix factorization technique is used to solve the LS problem. The pseudoinverse is computed as: where Q ∈ R K�I is an orthogonal matrix and R ∈ R I�I is an upper triangular matrix. The signal estimation is computed as follows: where Y = 〈Q T , yi, inner product between Q T and y. The signal estimation b s, can be computed in two ways. One possible way is to compute the pseudoinverse and then multiply with y.
On the other hand, Y can be evaluated first and then the result is multiplied by inverse of R. The later method saves the timing complexity of pseudoinverse computation and storage requirement to store pseudoinverse matrix. The second method is followed which eliminates the need for computation of pseudoinverse matrix.
consists of two major blocks, viz., a multiply-accumulate (MAC) unit and an adder tree as shown in Figure 1. Here, DSP block-based MAC unit is used. Each MAC is controlled by a control signal fn that configures it to perform different operations. Dispatcher unit provides inputs to the VMU. The bank of MUXes (M1) selects between y and ϕ. A dedicated divider performs division operation in QR stage and computes reciprocal in matrix inversion stage. Memory block ϕ_mem stores the matrix ϕ and y_mem receives the measurement samples from the external world. Serial data from y_mem are captured by a register bank (RB). Memory block q_mem stores Q and also y to decrease the logic complexity. The block r_mem temporarily stores e i and r i where e i represents the orthogonal columns of matrix Q.

| Sorting
The proposed scheme for evaluation of the AS step is shown in Figure 2. In this step, inner product 〈ϕ j , r i 〉 is evaluated to search for the most correlated column. Initially y is written to the r_mem memory through M1 and the MAC unit. During this step, MAC performs the operation z = c−z when the control signal fn is "00". The measurement vector y is also written into q_mem to reduce hardware complexity of the dispatcher unit.
The dispatcher unit first selects r i through both the port to compute 〈r i , r i 〉 and then selects r i and ϕ j to find column which gives maximum correlation. The memory block ϕ_mem is accessed in such a way that the columns of ϕ can be read in parallel. These two inner products are evaluated in pipeline. During the evaluation of the inner products, the control input fn is "01" and MAC unit is configured to perform the operation z = a�b. The inner product 〈r i , r i 〉 is saved in a controlled register to evaluate the stopping criteria.
In the first iteration, N number of columns are searched to find the column which is most correlated to r 0 and also best W columns are identified for next (w− 1) iterations so that only W columns are needed to be searched. Now when the iteration count is equal to w, again N number of columns are to be searched and the same process is to be repeated for next (w− 1) iterations. This way the timing complexity of the AS step is reduced.
Sort block (SB) gives the index of the most correlated column and also gives the indices of the best W columns. It also generates an indxc signal, which is used to start the QR factorization and other functions. The basic network (BN) block is designed to find the maximum of a serial input data stream and is shown in Figure 3. The BN block is the basic subblock of the SB. The proposed architecture of the SB is shown in Figure 4. There are W number of BN blocks connected in series to sort W elements. The absolute value of the inner product (ip k ) is fed to the SB. Indices of the columns are identified either from a counter which counts from 0 to N or from the set Ω depending on the iteration count. Best W indices are stored in RAM3 memory and the value of indx1 is stored in RAM2 memory in each iteration. The value of indx1 is the index of the most correlated column.
Comparison is done in parallel with the multiplication and in (W− 1) clock cycles best W columns are identified. This sorting process is carried out in parallel with QR factorization in iterations where the condition is satisfied as shown in  Table 1.

| Partial incremental QRD
The modified Gram-Schmidt algorithm is used here to perform the QR factorization. The LS step is solved by computing pseudoinverse matrix in the prior QR-based implementations [20][21][22] of OMP. Also, columns of Q are normalized by an extra normalizing step which involves computation of square root reciprocal. Previously in [25], we have avoided the normalization step but computation of pseudoinverse still degrades the performance of OMP. In this study, evaluation of QR factorization is simplified by avoiding the computation of pseudoinverse and also avoiding the normalization step.

Algorithm 2
Sparse signal estimation Input: Augmented matrixφ, iteration count i. Steps 2-14 of the Algorithm 2 describe the QR decomposition. Actual QR factorization is not evaluated. In case of actual QR decomposition, Q has orthonormal columns and these columns are multiplied by b ϕ to generate R. But here, columns of Q are kept as orthogonal. The normalization of the columns of Q is required neither in computation of pseudoinverse, nor in estimation of b s, but for convenience the terms Q and R are retained.

| Implementation
The above-mentioned QR decomposition is performed in six steps which are evaluated sequentially in each iteration. These steps are: Multiply with e j and accumulate. Subtract the accumulated result from b ϕ i 4. Compute he i ; b ϕ i i which is R ii 5. Compute 〈e i , e i 〉 and save it. 6. Compute 〈e i , y〉 and save it.
Step by step evaluation of the QR factorization is shown in Figure 5. In iteration 1, e 1 ¼ b ϕ 1 , so evaluation of first 3 steps is skipped. The RB r_mem captures b ϕ i through the MAC unit in iteration 1 to evaluate other functions.
Step 1 is an inner product operation between b ϕ i and previously saved e j . This step produces the i th column of matrix R and saved in R_mem. In step 2, the divider divides the resulting inner product by previously saved d 1j . A temporary storage RAM2 is used to store the outputs of the divider. Depending upon the iteration count is lesser or greater than the overall VMU-divider path latency, divider outputs from RAM2 (op) are read instantly or after some clock periods, respectively. In step 3, first a scalar-vector multiplication is performed and then the result is subtracted from b ϕ i . The MAC unit performs the operation z = (a�b) + z when the input fn is "10". Previously saved e k are again read from q_mem and multiplied with the resulting value (op). The resulting vector e i is saved in q_mem. The RB r_mem also holds the vector e i to evaluate steps 4, 5 and 6.
Steps 4, 5 and 6 compute inner products and are evaluated in pipeline, one after another. he i ; b ϕ i i is equal to R ii which is saved in R_mem for matrix inversion. The result of the step 5 (d 1i ), is F I G U R E 3 BN block architecture -5 saved in ip_mem for the next iterations. The evaluation of Y i ¼ ½he 1 ; yi; he 2 ; yi; ‥; he i ; yi� T , which is the step 6, is also carried out along with the evaluation of steps 4 and 5 of QR decomposition. The measurement vector y is read from q_mem (u) and is multiplied with e i stored in r_mem. The resulting value is stored in Y_reg which is a controlled register.
Total timing complexity T QR to evaluate QR factorization for K = 256, N = 1024 is: where the last term is for iterations greater than 27 which is the total latency for VMU-divider path.

| Matrix inversion, computation of b s and residual
Steps 16-20 of Algorithm 2 are for matrix inversion. The matrix B is evaluated in the same way as depicted in [25]. In each iteration, R_mem stores an evaluated column of R and B_mem stores the elements of matrix B. Rows of matrix B are accessed in parallel and R_mem is a RB. So, column of matrix R from R_mem and rows of matrix B from B_mem are fed to the VMU unit. Output from the VMU block is taken from the ip I1 signal. The dedicated divider is used to compute the reciprocal of R ii in parallel. Extra registers are inserted to match the latencies of both the operations. The total time

Design
Timing complexity Classic OMP [14] Nm, N columns are searched at each iteration Improved OMP [9] N(m/2 + 1), parallel selection of two columns otherwise � F I G U R E 4 Proposed architecture of the SB for finding best columns F I G U R E 5 Proposed scheme for QR factorization complexity for computation of inverse is I(I− 1)/2 where I denotes the total number of iterations. The signal estimation b s can be computed by steps 21-24 of the Algorithm 2. In this estimation procedure, previously computed values b s ði−1Þ are used to compute the current estimation b s i . It is not required to store all the components of vector Y. Only 〈e i , y〉, evaluated at each iteration is saved in a single-controlled register Y_reg. The estimate b s is stored in b s_mem memory block. The proposed scheme for computation of signal estimation is shown in Figure 6. Time complexity for this function is I(I + 1)/2.
To compute the residual, b ϕ is first multiplied with b s. Columns of b ϕ are accessed from ϕ_mem and memory RAM1 provides ϕ_mem, the indices of those columns. b s is accessed from b s_mem serially. This scalar-vector multiplication is performed by the same MAC unit. After the multiplication and accumulation, operation y is selected through the MUX M1 for subtraction. The product b ϕb s is subtracted from y by the MAC unit. Total timing complexity for residual computation is I(I + 1)/2 for multiplication and accumulation and extra I clock cycles to perform the subtraction.
The timing complexity (T oth ) for matrix inversion, estimation of b s and computation of residual for K = 256, N = 1024 can be expressed as: where the last term is for iterations greater than 16 which is the overall latency for the path from divider to signal estimation block.

| Implementation of stopping criteria
Evaluation of the stopping criteria is avoided in the existing implementations. Determination of the stopping criteria can be easily achieved making the design invariant to the knowledge of sparsity. Determination of the stopping criteria doesn't need any extra hardware except a comparator. In each iteration, hr i , r i i is computed in pipeline with the function hϕ j , r i i. The inner product hy, yi is saved in a register. The divider is shared to find the ratio in parallel with the computation of hϕ j , r i i.
The value of the threshold ϵ can be tuned close to zero for best results. A simplest choice of the parameter ϵ is of the order of 2 −10 . Stopping criteria for (i−1) th iteration are checked in parallel to the evaluation of the inner product function hϕ j , r i i for i th iteration. Total time to determine the stopping criteria is l ip + l dv + 1, where l ip and l dv are latency for inner product and division, respectively. Moreover, only I clock cycle is wasted to check the stopping criteria.

| Reconstruction efficiency
Software analysis of the proposed LPF-OMP algorithm is carried out for random input signals which take values from the set {−1, 1} for different sparsity values. The random Gaussian distribution-based sampling matrix is considered. A comparison of the performance of different versions of OMP is shown in Figure 7. Performance depends on the number of iterations allowed for a certain signal sparsity. If the algorithm runs for a fixed m number of iterations, performance degrades for low sparse signals. Extra iterations improve the performance, however, it adds extra hardware. The proposed LPF-OMP incorporates a stopping criteria to improve the performance allowing maximum of K/4 iterations. Probability of success for the proposed LPF-OMP is better than the OMP with fixed iterations and also better than the improved OMP [9]. Simulation results are noted for different sparsity values and for each signal sparsity 100 simulations are carried out to calculate probability of success. Probability of success is measured in terms of the performance metric relative or normalized root mean square error (RMSE) which is expressed as: If the RMSE value is of the order of e −15 or lesser, then it is considered as a success.
OMP algorithm requires at least O(m log N) [6] number of measurement samples to recover a m-sparse signal. So, for a certain value K, the proposed LPF-OMP is capable of reconstructing signals having sparsity close to the maximum theoretical value. Table 2 shows a caparison of performances of different versions of OMP for a random signal with parameters K = 256, N = 1024 and m = 36. As the value of w is increased, reconstruction time reduces significantly. But when the value of w is 32, the reconstruction time is higher. This is because, minimum searching time in each iteration is now 64 clock cycles which dominate. In the SB, higher value of w increases the number of required comparators. The value of w can be chosen based on two considerations: reconstruction speed requirement and number of permissible comparators.
The choice of w and W is crucial for optimum performance of the LPF-OMP algorithm. Several simulations have been carried out to qualitatively find the optimum value of W as shown in Table 3. Here, each test set consists of 100 random input signals which are same as for Figure 7. Performance of classic OMP and LPF-OMP is compared for each test set. Performance is measured using probability of success as it is done for Figure 7. Parameters are chosen as N = 1024, K = 256 and simulations are stopped after m = 36 number of iterations. It can be seen from Table 3 that when W ≤ 2w the reconstruction performance is very poor. Reconstruction performance of LPF-OMP improves when W ≥ 2w and extra iterations are added as shown in Figure 7.

| Recovery signal-to-noise ratio
The design metric recovery signal-to-noise ratio (RSNR) [25] is preferred here to estimate the design performance. RSNR is measured as: Various input signals are taken as input to the proposed design for parameters, K = 256, N = 1024 and m = 36. The algorithm is halted after I = 36 iterations to compare the design performance with other existing designs. Fixed point data width of the design is varied in XILINX platform and RSNR is measured for each data width. In order to consume minimum hardware resources and for acceptable RSNR requirement, 18-bit data width is used for the proposed design. RSNR performance is similar to the RSNR performance reported in [25]. Fractional part is represented by either 10 bits or by 12 bits. RSNR of 16.498 dB and 18.336 dB are achieved for 10 bits and 12 bits, respectively. The proposed design is targeted to the XILINX Vertex6FPGA device for K = 256, N = 1024, w = 8 and supports maximum I max = K/4 iterations.

| Resource utilization
Hardware resources are shared and reused to reduce the digital overhead. The proposed architecture uses inbuilt DSP blocks to realize different arithmetic functions. Inbuilt block RAMs (BRAMs) are configured to be used as RAM or ROM. Memory utilization of the proposed design is depicted in Table 4. A parallel CORDIC [27] based divider with 10 pipeline stages is used and is shared in different steps. The resources used in this design are independent of the sparsity number but depend on the value of I max and K. The choice of w is as crucial as count of comparators depends on the value of w.

-
The proposed design is compared with the other existing designs in Table 5. In comparison to the Cholesky-based architecture in [14], the proposed architecture uses less number of DSP and RAM blocks when targeted to the same FPGA device. The architecture depicted in [9] has implemented improved OMP considering complex sampling matrix. The proposed architecture considers real sampling matrix and very much hardware efficient than the architecture mentioned in [9]. Resource utilization of the Cholesky-based designs reported in [9,14] depends heavily on signal sparsity. The proposed design is hardware efficient although it supports maximum sparsity level of K/4 and resource utilization is less variant to the sparsity level. The MIB-based design reported in [23] uses very less DSP blocks and memory elements but its slice occupancy is quite high. Also, for N = 1024 the architecture depicted in [23] has very high reconstruction time. The proposed study has similar resource utilization compared to our previous study [25]. But the proposed architecture has less memory utilization as it avoids pseudoinverse computation. Also, the matrix multiplication unit is simplified to reduce power consumption and to achieve better reconstruction speed. If w = 2, then the proposed variation is similar to the improved OMP reported in [9] and the proposed algorithm is similar to classic OMP for w = 1.

| Reconstruction speed
The classic OMP algorithm is reformulated to achieve high speed of reconstruction. The total time complexity of the OMP algorithm is reduced in two ways. The first obvious reduction is taken place in the correlation step because of the change in the number of columns to be searched in successive iterations. Secondly, the timing complexity involved in computation of pseudoinverse is saved in estimating b s. Extra two clock cycles are required for the first iteration: one clock cycle to store y and another to store e 1 in r_mem temporarily at the initialization step and in the QR stage, respectively. Also, to evaluate the stopping criteria total I number of clock cycles are wasted. Interdependency of the steps is reduced by inserting buffers where it is necessary so that evaluation of a step can be started before full evaluation of its previous step. An estimation of the total reconstruction time of the proposed design for K = 256, N = 1024, w = 8 is T ov = T AS + T QR + T oth + (I + 2). Hence total 10051 number of clock cycles are required for I = 36 number of iterations. The proposed architecture is faster in comparison to other designs as shown in Table 5. Thus the proposed technique can be adopted for real-time measurement of high frequency RF signals such as RADAR pulses. The proposed architecture achieves better reconstruction speed even for classic OMP as shown in Table 2. In comparison to our previous architectures [25,26], though the hardware performance is same the reconstruction speed is higher.

| Power consumption
Power consumption of a circuit can be reduced from algorithm to device levels of abstraction. Modification at the algorithmic level has the capacity to achieve drastic power reduction as compared to the other lower levels of abstraction in IC design. Dynamic power consumption is reduced by means of both algorithmic and architectural improvements. The power consumption of the proposed design is considerably low in comparison to other reported designs [9,14,25]. XILINX XPower analyzer is used to estimate the power consumption of the design implemented in FPGA.
A considerable amount of dynamic power is reduced by algorithmic reformulation of the OMP algorithm. In case of classical OMP, ϕ_mem block is active for N clock cycles in each iteration to perform the AS step. Thus a major fraction of total dynamic power is consumed in accessing ϕ_mem. The access time of ϕ_mem in AS step for the proposed LPF-OMP in all the iterations is not same. In some of the iterations, ϕ_mem is accessed for a shorter period of time and thus the dynamic power is reduced.
The first architectural modification to reduce power is sharing of resources. For example, a single VMU is used to perform several operations. The proposed architecture uses less hardware resources than the Cholesky-based implementations reported in [9,14]. Avoiding the pseudoinverse computation, a simpler VMU is proposed to reduce power consumption and to improve timing in comparison to [25]. Secondly, all the BRAMs including ϕ_mem are selectively accessed. During evaluation of a step, BRAMs are selectively accessed when it is necessary. In Table 5, power consumption of the proposed design for m = 64 and m = 36 is almost same. This is because both the architectures have similar resource utilization.

| A practical case for implementation
The reconstruction of high frequency RADAR pulses is studied here. The RADAR signals are sparse when represented with Gabor time-frequency dictionary. A general framework for CS-based capturing of unknown RADAR pulses is given in [25] where different types of Gaussian pulses are taken as test pulses. The OMP algorithm must incorporate a stopping criteria due to unavailability of information about sparsity. Signal strength and variety of RADAR pulses may not be fixed in such cases. Thus the stopping criteria should be robust and must be able to recover an unknown pulse in minimum number of iterations. Reconstruction of two types of RADAR test pulses, viz., a trapezoidal pulse and a Gaussian fifth derivative pulse are compared. The same value of threshold parameter ϵ is used for both the cases. Figure 8 shows a comparison of reconstructed output of the FPGA implementation with the original signals for two different pulses with same halting threshold.

| CONCLUSION
A novel LPF-OMP algorithm is proposed here to achieve low power and high reconstruction speed for measurement of high frequency RF pulses such as RADAR pulses. Both the performances are improved by reducing the number of columns to be searched in the AS step of successive iterations. Pseudoinverse computation is avoided for further acceleration in reconstruction time. A novel architecture is proposed to implement the LPF-OMP algorithm. The dynamic power consumption of the proposed design is considerably low due to the algorithmic and architectural improvements. The proposed design proves to be faster than any other existing designs. Area is also reduced by reusing of the same VMU for every operation. The proposed design consumes less memory and DSP blocks for the same design parameters. A novel stopping criterion is proposed which is robust and does not consume any extra hardware. The proposed design is efficient for signals having low sparsity level as resource utilization varies marginally with the sparsity level.

ACKNOWLEDGMENTS
There is no funding agency involved in this manuscript. It is a result of research work that was carried out at the Department of electronics and communication engineering in NIT Rourkela.