### Abstract

- Top of page
- Abstract
- Introduction
- Related Work
- Low Rank and Sparse Decomposition
- Low Latency Reconstruction
- Theoretical Justification
- Experimental Result
- Conclusions
- References
- Biographical Information

We propose a method for analysis of surveillance video by using low rank and sparse decomposition (LRSD) with low latency combined with compressive sensing to segment the background and extract moving objects in a surveillance video. Video is acquired by compressive measurements, and the measurements are used to analyze the video by a low rank and sparse decomposition of a matrix. The low rank component represents the background, and the sparse component, which is obtained in a tight wavelet frame domain, is used to identify moving objects in the surveillance video. An important feature of the proposed low latency method is that the decomposition can be performed with a small number of video frames, which reduces latency in the reconstruction and makes it possible for real time processing of surveillance video. The low latency method is both justified theoretically and validated experimentally. © 2014 Alcatel-Lucent.

### Introduction

- Top of page
- Abstract
- Introduction
- Related Work
- Low Rank and Sparse Decomposition
- Low Latency Reconstruction
- Theoretical Justification
- Experimental Result
- Conclusions
- References
- Biographical Information

In a surveillance network, cameras transmit surveillance videos to a processing center where the video streams are processed and analyzed. The ability to detect moving objects in a scene quickly and automatically is of particular interest in surveillance video processing. Detection of moving objects is traditionally achieved by background subtraction methods [1, 2] which segment the background from moving objects in a sequence of surveillance video frames. The technique described in [1] stores, for each pixel, a set of values taken in the past in the same location or neighborhood. It then compares this set to the current pixel value in order to determine whether that pixel belongs to the background and adapts the model by randomly choosing which values to substitute from the background model. The mixture of Gaussians technique [22] assumes that each pixel has a distribution that is a sum of Gaussians and the background and foreground are modeled by the different size of the Gaussians. In low rank and sparse decomposition (LRSD) [6], the background is modeled by a low rank matrix, and the moving objects are identified by a sparse component.

These traditional background subtraction techniques apply to video in the pixel domain, and require the pixels in a surveillance video to be captured, transmitted, and analyzed. The ever-growing number of surveillance cameras generates an enormous amount of data that needs to be transported over the network**.** There is a high risk that congestion in the network will prevent timely detection of moving objects. Addressing the congestion problem by conventional video coding methods makes the video transmission highly sensitive to varying channel conditions and significantly increases the complexity of the cameras as well as the processing center. Therefore, it is highly desirable to have a network of cameras in which each camera transmits a small amount of data with enough information for reliable detection and tracking of moving objects. Compressive sensing [7] allows us to achieve this goal. Compressive sensing has previously been used for both video processing [6, 7] and background subtraction [8, 15].

In [15], an LRSD of a matrix is used in processing compressive measurements to segment the background and extract moving objects. The method described in [15], which was motivated by the work in [6], assumes that the surveillance video is comprised of a low rank component (background) and a sparse component (the moving objects) which is possibly in a tight wavelet frame domain. Therefore, the background subtraction becomes part of the reconstruction. Furthermore, the reconstruction in [15] takes advantage of the knowledge that the video has a background, which helps to reduce the number of measurements required.

Since the method used in [15] reconstructs the background without a training process, compressive measurements from a large number of frames are needed in order to recover the background properly. Typically, the number of frames used in the reconstruction process is on the order of a hundred frames, representing a few seconds of video in real time. This causes an inherent latency of a few seconds, which is independent of and added to the computational time. Such latency may not be appropriate for real time applications.

In this paper, we propose a low latency LRSD method. This method extends the framework of [15] to reduce the latency needed in the reconstruction process. As in [15]**,** segmentation of background is performed by using an LRSD of the matrix. However, in this paper, the low rank matrix is augmented with known background frames. The background frames may be learned via a training process, for example, by using the methods in [15] and [23]. By using the augmented low rank matrix, the reconstruction by LRSD can be carried out with compressive measurements from a few video frames, as few as one frame. In other words, as soon as the measurements from one video frame are available, we can start processing the measurements to reconstruct the background and to compute the silhouette of the moving objects in that frame. Therefore, the method presented in this paper paves the way for real time processing of compressive sensed surveillance video by using compressive LRSD.

Panel 1. Abbreviations, Acronyms, and Terms |
---|

2D | —Two-dimensional |

ADM | —Alternating Direction Method |

JPEG | —Joint Photographic Experts Group |

LRSD | —Low rank and sparse decomposition |

MPEG | —Moving Picture Experts Group |

PCA | —Principle component analysis |

RGB | —Red, green, blue |

The method we propose removes a fundamental barrier to real time processing of surveillance video in the methods that use LRSD, such as those in [6] and [15], by relaxing the requirement of the number of frames needed in the reconstruction process.

### Related Work

- Top of page
- Abstract
- Introduction
- Related Work
- Low Rank and Sparse Decomposition
- Low Latency Reconstruction
- Theoretical Justification
- Experimental Result
- Conclusions
- References
- Biographical Information

We make a few notes to compare this paper with other work in the literature. The novelty of this paper is the low latency for LRSD with compressive sensing. It has been demonstrated in [15] that LRSD with compressive sensing achieves what the traditional methods cannot. For example, in the Daniel video sequence [11], a sudden illumination change in the background was falsely detected as a moving object in the foreground when using the principal component analysis (PCA) method [11], while in [15], where LRSD was used, the event did not register. This demonstrates the advantage of LRSD. Furthermore, traditional background subtraction methods such as [1, 2, 22] do not use compressive sensing, so video data must be processed in the pixel domain, not in the compressed domain. Although compressive sensing has been used in video processing, existing compressive video sensing methods reconstruct the video frames, but do not segment the background and foreground, thus requiring the use of an additional pixel domain background subtraction method for background segmentation. Our method performs the background segmentation as part of the reconstruction. This paper differs from [15] in that a large number of frames must be used to reliably detect moving objects in [15], which results in a long latency, making real-time processing difficult. This paper is an improvement over [15] because of low latency achieved by processing a small number of frames at a time, which makes it viable for real time processing. Reference [23] is a companion paper to this work for the reason that this paper establishes a theoretical basis for low latency LRSD by assuming that the background is known, while [23] relies on this theoretical basis and develops an algorithm to adaptively train the background model. Therefore, this work is a theoretical justification for the development in [23], while [23] in turn provides more experimental results to validate the theory in this work.

The paper is organized as follows. We will introduce notations and review the framework for reconstruction of compressively-sensed video using LRSD. We next describe a low latency method for LRSD, and follow with a theoretical justification. We also report results from numerical experiments.

### Theoretical Justification

- Top of page
- Abstract
- Introduction
- Related Work
- Low Rank and Sparse Decomposition
- Low Latency Reconstruction
- Theoretical Justification
- Experimental Result
- Conclusions
- References
- Biographical Information

In this section, we provide a theoretical justification for the low latency method that is presented in the section on Low Latency Reconstruction. We will show that if a large number of frames are used to obtain the background frames as in [15], then subsequent LRSD may be performed with any number of frames.

We start by making some definitions. Let *J, J*_{1}, *J*_{2} > 0 be positive integers, such that

- (15)

Note that in the definitions above, the superscripts 1 and 2 denote the first *J*_{1} frames and the last *J*_{2} frames of a matrix, respectively. The subscripts 1 and 2 denote the background and foreground, respectively. This is illustrated in Figure 2.

With these definitions, we have the following result.

If *X** = *X*^{*}_{1} + *X*^{*}_{2} is a solution of the minimization problem

- (18)

- (19)

then the matrix

- (20)

is a solution to the minimization problem of equation 2 and equation 3, i.e., X̂ = X̂_{1}, + X̂_{2}, where (X̂_{1}, X̂_{2}) is a solution to the minimization problem defined in equation 17.

If we assume that there is a unique solution to LRSD minimization given in equation 2 and equation 3, which is justified by the work of [5] and [6] with the condition that ϕ and *W*_{i}, i = 1,2, are incoherent [5], then we have a stronger result as follows.

With notations of theorem 1, if the minimization problem in equation 2 and equation 3 has a unique solution, then the minimization problem in equation 18 and equation 19 also has a unique solution, and furthermore, the solution satisfies

- (21)

Corollary 1 can be interpreted as follows. Let the background and foreground frames be properly segmented by using a large number of frames to compute the solution X̄ of equation 2 and equation 3 with *J* ≫ 1. The computed solution X̄ has a large number of frames, from which we take the first *J*_{1} background frames and use them as known background frames. We then form an augmented matrix to process *J*_{2} frames of video by solving equation 18 and equation 19, which is a problem for a smaller number of video frames. Corollary 1 shows that the *J*_{2} video frames computed from equation 18 and equation 19 must be the same as the last *J*_{2} frames of X̄. Since *J*_{2} can be any positive integer in theorem 1 and corollary 1, we can choose a small *J*_{2} ≪ *J*. Hence, the problem defined by equation 18 and equation 19 has low latency because *J*_{2} can be small, as small as 1. Therefore, corollary 1 shows that if the background frames are known, the augmented matrix can be used to form a problem where the LRSD can be performed with low latency.

We want to show that X̂ of equation 20 is a solution to equation 2 and equation 3. Towards this purpose, we will first show that X̂ satisfies the constraint of equation 2, and then we will show that X̂ minimizes the cost function in equation 2.

First, from the definition of operator “°” given in equation 1, we have

- (22)

Then because *X** is the solution of the problem given by equation 18 and equation 19, we also have

- (23)

In the equation above, the first equality is from the constraint in equation 18, and the second equality is from the definition of *y*^{(2)} in the last equation of equation 16.

Now, because X̄ is the solution to equation 2 and equation 3, by using equation 22 and equation 23, we can derive

- (24)

which shows that X̂ satisfies the constraints of equation 2.

Next, we show that X̂ has an expression in the form of equation 3 which minimizes the cost function in equation 2. Let

- (25)

Then from the definition of X̂, we have

- (26)

which shows X̂ can be expressed in the form of equation 3, but we still need to show that X̂_{1}, X̂_{2} also minimizes the cost function of equation 2. In order to do so, we need the following property which can be derived from the definition of ‖ ‖_{1} and *W*_{i} given in equation 5 and equation 7, respectively:

- (27)

Since X̄ = X̄_{1} + X̄_{2} minimizes the cost function in equation 2, equation 30 implies that X̂ = X̂_{1} + X̂_{2} also minimizes it. This shows that X̂ = X̂_{1} + X̂_{2} is a solution to equation 2, and concludes the proof.

#### Proof of corollary 1:

From theorem 1, X̂ is a solution to equation 2 and equation 3. Since equation 2 and equation 3 have a unique solution which is X̄, we conclude X̂ = X̄, which is equivalent to and in particular *X** = X̄^{(2)}, and this hence proves equation 21. This holds true for any solution of equation 18 and equation 19, and therefore, there is a unique solution to equation 18 and equation 19, which concludes the proof.

### Biographical Information

- Top of page
- Abstract
- Introduction
- Related Work
- Low Rank and Sparse Decomposition
- Low Latency Reconstruction
- Theoretical Justification
- Experimental Result
- Conclusions
- References
- Biographical Information

*HONG JIANG is a researcher with Alcatel-Lucent Bell Labs in Murray Hill, New Jersey. He received his B.S. from Southwestern Jiaotong University, Chengdu, China, M. Math from the University of Waterloo, and Ph.D. from the University of Alberta in Canada. Dr. Jiang is currently conducting research on digital communications, image and video processing. He has authored more than 50 technical papers in scientific and engineering journals, and has more than 40 U.S. patents in digital communications.*

*SONGQING ZHAO received B.E. and M.E. degrees in control science and engineering from Huazhong University of Science and Technology, Wuhan, China, and received his Ph.D. degree in electrical and computer engineering from University of Illinois at Chicago. He was formerly a member of technical staff at Alcatel-Lucent in Murray Hill, New Jersey where his research focus was on fourth generation (4G) Long Term Evolution (LTE) and video quality. Prior to joining Bell Labs, he worked as a research intern at Mitsubishi Electric Research Labs in Boston, Massachusetts, and for the Institute for Telecommunications Research, Adelaide, Australia. His research interests include wireless multimedia communications, information theory, video quality, quality of experience, video transmission, video processing, and pattern recognition.*

*ZUOWEI SHEN is the Tan Chin Tuan Centennial Professor at the National University of Singapore where he has been on the faculty at the Department of Mathematics since 1993. His primary research interests include wavelet frames, Gabor frames, and applications. More recently his research has focused on imaging science using wavelet and Gabor frames.*

*WEI DENG is a Ph.D. student in the Department of Computational and Applied Mathematics at Rice University, Houston, Texas. He received a B.S. degree in mathematics from Nanjing University, Nanjing, China, and an M.A. degree in computational and applied mathematics from Rice University, Houston, Texas. He is currently conducting research on developing and analyzing numerical optimization algorithms for various applications including compressive sensing, image and video processing, and machine learning.*

*PAUL A. WILFORD is the director of Multimedia Research at Alcatel-Lucent Bell Labs in Murray Hill, New Jersey. He received his B.S. and M.S. in electrical engineering from Cornell University, Ithaca, New York. His research focus was communication theory and predictive coding. Mr. Wilford is a Bell Labs fellow. He has made extensive contributions in the development of digital video processing and multimedia transport technology. He was a key leader in the development of Lucent Technologies' first high-definition television (HDTV) broadcast encoder and decoder. Under his leadership, Bell Labs then developed the world's first Moving Picture Experts Group 2 (MPEG2) encoder. He has made fundamental contributions in the high speed optical transmission area. Currently he is leading a department working on next-generation video transport systems, hybrid satellite-terrestrial networks, and high-speed mobility networks.*

*RAZIEL HAIMI-COHEN is a researcher with Alcatel- Lucent Bell Labs in Murray Hill, New Jersey and a member of the Alcatel-Lucent Technical Academy. His current research is in compressive sensing of video. Previously, he worked in the areas of video delivery and processing, audio and speech compression, speech recognition, signal processing, and cellular communication. Dr. Haimi-Cohen holds a B.Sc. in mathematics from Tel-Aviv University in Israel, an M.Sc. in applied mathematics from Cornell University, Ithaca, New York, and a Ph.D. in electrical engineering from Ben-Gurion University in Israel.*