Reduced-Gate Convolutional LSTM Using Predictive Coding for Spatiotemporal Prediction

—Spatiotemporal sequence prediction is an important problem in deep learning. We study next-frame(s) video prediction using a deep-learning-based predictive coding framework that uses convolutional, long short-term memory (convLSTM) modules. We introduce a novel reduced-gate convolutional LSTM (rgcLSTM) architecture that requires a signiﬁcantly lower parameter budget than a comparable convLSTM. Our reduced-gate model achieves equal or better next-frame(s) prediction accuracy than the original convolutional LSTM while using a smaller parameter budget, thereby reducing training time. We tested our reduced gate modules within a predictive coding architecture on the moving MNIST and KITTI datasets. We found that our reduced-gate model has a signiﬁcant reduction of approximately 40 percent of the total number of training parameters and a 25 percent redution in elapsed training time in comparison with the standard convolutional LSTM model. This makes our model more attractive for hardware implementation especially on small devices.


I. INTRODUCTION
The brain in part acquires representations by using learning mechanisms that are triggered by prediction errors when processing sensory input [2], [8].An early implementation using this approach is by Rao et al. [24] to model non-classical receptive field properties in the neocortex.The hypothesized brain mechanisms underlying this "predictive coding" are based on the concept of bi-directional interactions between the higher and lower-level areas of the visual cortex.The higherlevel areas send predictions about the incoming sensory input to the lower-level areas.The lower-level areas compare the predictions with ground truth sensory input and calculate the prediction errors.These are in turn forwarded to the higherlevel areas to update their predictive representations in light of the new input information.
Lotter et al. [22] introduced the predictive coding architecture, PredNet, for next-frame video prediction.The architecture was based on a deep neural-network framework that used a hierarchy of LSTMs.Since it was a deep network, it could readily be implemented and studied using off-theshelf deep learning frameworks (e.g.Keras [4], PyTorch [23], TensorFlow [1]).This was in contrast to earlier models with a purely mathematical formulation [8].
The LSTM modules in PredNet were based on standard convolutional LSTMs (cLSTMs) that did not have peephole connections.The original convolutional LSTMs (convLSTMs) of Shi et al. [25] did have peephole connections.Convolutional LSTMs differ from regular LSTMs because they process images instead of feature vectors.Because of this, Shi et al. [25] replaced the matrix multiplication operation (affine transformation) with the convolution operation.
In the present paper, our work uses the PredNet architecture [22] but is novel in the LSTM-like modules used to form the architecture.Unlike the modules in PredNet, we use peephole connections.More important, we make two novel changes to the Shi et al. [25] convLSTM design.First, Shi et al. [25] implemented peephole connections by using the elementwise (Hadamard) multiply [17].Although this technique is used in conventional LSTMs, convolutional LSTMs process images.In a convLSTM, the Hadamard operation will multiply corresponding pixels between two images to create a new image.This is suitable for feature vectors, but for images it creates an excessive number of trainable parameters (N 2 for an N ×N image), and we find that performance can be improved by replacing the Hadamard operation with convolution.Our second modification is to use a novel, reduced-gate convolutional LSTM (rgcLSTM).Specifically, we use one trainable gate (the forget gate) and couple it by equality to the traditional input and output gates.This gives fewer trainable parameters but with equal or better performance accuracy, according to our experiments, than the original PredNet.Our rgcLSTM design is shown in Figure 1.The present paper motivates the design of our rgcLSTM, specifically, the use of convolutionbased peephole connections and the single forget-gate coupled to the input and output gates.We present performance results on the moving MNIST and KITTI datasets with and without these modifications.We find that our architecture gives better next-frame(s) prediction accuracy on these data sets while using a smaller parameter budget in comparison to the original Shi et al. [25] convLSTM which uses elementwise multiply for peephole connections and also to the standard convLSTM (cLSTM) which lacks peephole connections.

II. RELATED WORK
Recurrent neural networks (RNNs) are used to process sequential data.Spatiotemporal datasets such as video are sequential datasets where both temporal and spatial information is related.Spatiotemporal prediction (e.g.video prediction) is a challenge that has received intense interest in deep learning over the last few years.Spatiotemporal prediction typically uses unsupervised learning.There are several mechanisms to predict future video frames using unsupervised learning such as [7], [22], [25], [27].However, the models all use complex architectures and a large parameter budget.
The long short-term memory (LSTM) network, introduced by Hochreiter and Schmidhuber [16], was the first gated RNN approach to mitigate the vanishing and/or exploding gradient problem that prevented RNNs from learning long-term dependencies.Hochreiter and Schmidhuber [16] introduced a recurrent block that contained a memory cell which was updated by a constant-error-carousel (CEC) mechanism.The CEC was the key modification to mitigate the vanishing/exploding gradient problem.The main drawback of this first-generation LSTM was that if a continuous input stream was presented to the model long enough it could cause saturation of the output activation function due to the unlimited growth of the memory cell state values [10].
To address this problem, Gers et al. [10] added a forget gate to the LSTM block.which has since become known as the standard LSTM.The forget gate allowed the LSTM to update and/or reset its memory cell.This improved performance for learning long-term dependencies.However, this model also had weaknesses.First, the model had no direct connection from the memory state to the three gates.Thus, there was no control from the memory to the gates to assist in preventing the gradient from vanishing or exploding.Second, the CEC had no influence over the forget and input gates when the output gate was closed.This could harm the model due to the lack of primary information flow within the model [11].These problems were handled by adding peephole connections from the memory cell to each of the LSTM gates [11].Although the peephole generalization of the standard LSTM became a powerful model, there was a significant increase in the number of trainable parameters, the training time, and memory requirements.
There were attempts to design gated models to reduce the gate count while preserving learning power.Cho et al. [3] proposed a gated recurrent unit (GRU) model.Instead of three gates, it used two: an update gate and a reset gate.The update gate combined the input gate, forget gate [13].The reset gate functioned as the output gate of the LSTM block [13].This GRU model eliminated the output activation function, memory unit, and CEC [13].The GRU yielded a reduction in trainable parameters compared with the standard LSTM.Zhou et al. [31] used a single gate recurrent model called a minimal gated unit (MGU).Both models reduced the number of trainable parameters and gave results comparable to the LSTM [5].However, neither model preserved the CEC.This may lead to exploding and/or vanishing gradients.There was a study based on the LSTM model in [10] to examine the role and significance of each gate in the LSTM [15], [19].This study showed that the most significant gate in the LSTM was the forget gate.However, it also showed that the forget gate needs other support to enhance its performance.
While the above gated RNNs are designed to process sequential data, they must be augmented somehow to process spatial temporal data which contains images.The convLSTM operated on sequences of images.Shi et al. [25] proposed a convolutional LSTM to enhance performance for spatiotemporal prediction.This model replaced the matrix multiplications (affine transformations) by convolutional operations for the input and recurrent input of each gate.The model achieved higher accuracy in comparison to the classical LSTM model.However, the number of parameters remained higher because the peephole connections still used elementwise multiplication.In recent research, Lotter et al. [22] showed how to build a predictive coding model using a convolutional LSTM architecture.The LSTM had three gates but no peephole connections.The model achieved significant improvement in a predictive model as it did not need an encoder and decoder.However, the number of parameters was large and grew linearly with the number of layers in the model.
There were other attempts to design smaller recurrent gated models based on the standard LSTM.They were based on either removing one of the gates or the activation functions from the standard LSTM unit.Nonetheless, empirical analysis [13] compared these models and the standard LSTM.The conclusion found in Greff et al. [13] was that these models had no significant improvement in either performance or training time.Moreover, these models had no significant reduction in trainable parameters.This empirical analysis [13] also stated that the critical components of the LSTM model were the forget gate and the output activation function.
Our new gated model is named the reduced-gate, convolutional LSTM (rgcLSTM).Based on the empirical results of Greff et al. [13] , our model preserves the critical components of the LSTM while removing parameter redundancy within the LSTM block.We use our model within a predictive coding framework introduced by Lotter et al. [22] as a state-ofthe-art approach for spatiotemporal prediction.The Lotter et al. [22] model uses standard convLSTM modules (no peephole connections) within their predictive coding architecture.We replace those modules with rgcLSTM.Our model showed an efficiency comparable to Lotter et al. [22] design.However, our rgcLSTM reduces the number of trainable parameters and memory requirements by about 40% and the training time by 25% which makes it more attractive to be implemented on hardware and trainable on low power devices.

III. REDUCED-GATE CONVLSTM ARCHITECTURE
The proposed rgcLSTM architecture block appears in  where arithmetic, weights, and biases are made explicit.The architecture has one trainable gated unit which we will call the forget gate or network gate.Because of its inputs, it is a forget gate, but since it is the only gate in the network it also makes sense to call it the network gate.The model also preserves the cell memory state and keeps the CEC to avoid vanishing and/or exploding gradients.There is a peephole connection from the cell state to the network gate but we have converted its operator from elementwise multiplication to convolution.This reduces the learning capacity of the rgcLSTM in comparison to a full LSTM with elementwise peephole connections to all three gates.However, it preserves information needed to allow the memory state to exert control over the network gate.Our model is a convolutional model which is preferable for image and video related tasks.Also, our model retains the output activation function.Thus, our model preserves the critical components of the LSTM as stated by Greff et al. [13] while removing much of the parameter redundancy in the LSTM unit.This gives in a significant reduction in the number of required trainable parameters, training time, memory, and hardware requirements compared to the standard convolutional LSTM and recurrent gated architectures.Furthermore, our rgcLSTM model preserves these model's prediction accuracy results.
During the forward pass within a block at time step t, the net input image, a (t) , to the single network gate f (t) is calculated by where a (t) ∈ R η×υ×n , and η, υ, and n are the width, the hight and the number of in-channels of the input a (t) .x (t) is the input video frame at time t, x (t) ∈ R η×υ×γ where η and υ are the width and height of the image, and γ is number of input image channels.h (t−1) is the output image stack of the block at time t − 1, and c (t−1) is the image stack representing the block's internal state at time t − 1.
Both h (t−1) , c (t−1) ∈ R η×υ×n where η, υ, and n are the width, height, and number of core channels, respectively.W f x , U f h , and W f c are convolution weight sets operating on their respective images or stacks.W f x is ∈ R m×m×γ and both U f h , and W f c are ∈ R m×m×κ .m is the convolution kernel width and height.For a given block, all kernels are the same size (m × m).All three weight sets W f x , U f h , and W f c and biases b f are trainable.The square brackets indicate stacking.We will let where n is the number of output channels.Also, we let but is added to each channel in W f * I f by broadcasting to the appropriate image channel.Note that the convolution operation between W f c and c (t−1) represents a departure from Shi et al. [25] where an elementwise multiply was used (in contrast to convolution).The total number of trainable parameters for the network gate is The network gate image value, f gate ∈ R η×υ×n , is obtained by applying a pixelwise activation function G to the net input image using f Depending on the application, G can be either the logistic (σ) or hard sigmoid (hardSig) [14].The pixel values of f t will fall in the range (0, 1) or [0, 1], depending on which function is used.Using σ, the gate value f t is calculated by Stacking makes the learning process more powerful than the non-stacked weights due to the influence of the x t , h (t−1) and c (t) across convolutional weight set.Lotter et al. [22] and Heck et al. [15] show empirically that stacking the input for recurrent units achieves better results than non-stacked input.
Fig. 3.This is the bottom layer of the PredNet architecture with our rgcLSTM module substituted for the cLSTM in the original PredNet [22].The layer subscript l is used to signify that higher layers obey the same design, with the caveat that the input to higher layers is prediction error from the next lower layer.
The input update (memory activation) uses a similar equation as for the network gate.
In the above, W g = [W gx , U gh ] ∈ R m×m×(γ+κ)×n , and γ+κ) .The number of core-channels in W g is matching the number of core-channels of the W f which make them computationally compatible.This approach is taken so that the dimension of g gate ∈ R η×υ×n matches the dimension of f (t) gate .Similarly, the dimension of b g ∈ R n×1 matches that of b f ∈ R n×1 .Finally, the number of trainable parameters for the input update is Eqns. 2 and 6 count the total number of trainable parameters for the rgcLSTM module/block, so the final count is given by The final equations to complete the specification of the rgcLSTM are given below.
The symbol denotes elementwise multiplication.κ is constrained to equal n so that the dimensions match for the elementwise multiplication operations.

IV. OVERALL ARCHITECTURE
We use our rgcLSTM module within the PredNet predictive coding framework of Lotter et al. [22], so have named our variant Pred-rgcLSTM.The bottom layer of the architecture appears in Fig. 3. Our contribution lies in the design of the convLSTM block, where our rgcLSTM replaces the standard cLSTM.In this design, whether PreNet or Pred-rgcLSTM, learning only occurs within the gated recurrent blocks but is triggered by a prediction error signal.
Although Fig. 3 shows the bottom layer (because the input is a video frame), we still include layer index subscripts, l, because the higher layers have similar structure (see Fig. 5).The prediction error module drives the gated-recurrent learning.It calculates the error between the input frame A l and the predicted output frame Âl .The trans l convolution module converts the output of the rgcLSTM, h The prediction error module stacks the pixelwise difference between the predicted image frame Âl and the actual frame A l by subtracting the two frames and applying the ReLU as follows: e l is a stack of the two error images from Eqns. 10 and 11.The E layer feedback is sent to the rgcLSTM input arranger unit R l and to the next higher layer R l+1 after applying convolutional downsampling for dimensions comparability with the next layer parameters dimensions.Inputs to the rgcLSTM module are its internal recurrent input and its error feedback.In the multi-layer case, the input contains an additional parameter which is the feedback from the directly higher rgcLSTM block, r upper .The input arranger unit stacks these inputs to fit into the rgcLSTM block.
The rgcLSTM block learns the spatiotemporal changes using the training input data to predict future frames.The update process as a function of the sequential input within one layer of the proposed model as shown in Fig. 4. For illustration, we show three unrollment (unfolding in time) steps and we start tracking the model from the leftmost (earliest) part of Fig. 4. In the beginning, the error module evaluates the prediction error by the difference between the assumed default prediction and the first input which is the first video frame.This prediction error then goes to the rgcLSTM module as an input.The rgcLSTM module processes this input, its initial state and initial output through its internal gates to produce the next predicted video frame as its output.Next, the current output of the rgcLSTM module passes through the error module to evaluate the next prediction error which is forwarded to the next unfolded rgcLSTM module.At this time the prediction error acts as a correction factor for the rgcLSTM module which guides the rgcLSTM module to adjust its weights and state according to the current frame and the previous prediction of this frame.The process repeats until the final predicted frame.
Since our model targets spatiotemporal prediction, we used hardSig as the recurrent activation function and tanh as the output activation function.hardSig is calculated as hardSig(x) = max(min(0.25x + 0.5, 1), 0).
We chose the hardSig because it has shown better empirical results in our earlier experiments than the sigmoid in LSTM  convolutional downsampling from layer l to l + 1. biases l total number of biases in layer l spatiotemporal prediction models [6].The hard saturation may help it escape from local minima [14].
We now explain the full multi-layer model shown in Fig. 5. Table I defines the terms used in the model.The input dimensions of each layer of the four-layered Pred-rgcLSTM and PredNet models are shown in Table II.h l is the recurrent output of the rgcLSTM or the cLSTM (which is equivalent to h (t−1) ), e l is the error feedback (equivalent to x (t) ), c l is the memory cell of the gated unit (equivalent to c (t−1) ), and r upper(l) is the higher-level feedback from layer l + 1. R l is the total dimension of each the forget gate f (t) input stack and Ra 0 is the total dimension of the input stack to the activation unit g (t) .Table III and Fig. 5 show the dimensions of the weights and biases in each layer of the the Pred-rgcLSTM and PredNet were for each layer index l ∈ {0, 1, 2, 3}.f l , i l , o l are the forget, input and output gates, respectively.g l is the activation unit.downsample l is the transition kernel from the lower layer to the higher layer.Âl is the convolution kernel applied to the rgcLSTM output or the cLSTM output.Biases are the total number of biases in each layer (i.e.including the gate, activation, downsampling and the transition between the output of the rgcLSTM and the error module).In Table II, Table III, and Fig. 5 we assume that the model input is a gray-scale image of size 64 × 64 pixels and has a one input channel depth.
For the purpose of counting trainable parameters, the values of m, γ, κ, and n from Section III are given in Table IV for all four layers in the model.

A. Keeping the Gate Output within a Functional Operating Range
From Fig. 1 we see that if the network gate value drops to zero, the rgcLSTM cell does an extreme reset: the memory state is forgotten, the current input is ignored, and the output

V. METHODS
Our experiments compare our rgcLSTM used in Pred-rgcLSTM with the standard cLSTM used in PredNet.We show that our rgcLSTM achieves the same or better accuracy than the standard cLSTM using a smaller parameter budget, less training time, and smaller memory requirements.To build Pred-rgcLSTM, we modified the code for PredNet which is available at https://github.com/coxlab/prednet.To make the timing results comparable for both models, the only modification to the code was to replace the cLSTM with the rgcLSTM of compatible size to ensure the comparisons were fair.Our rgcLSTM code is available at https://github.com/NellyElsayed/rgcLSTM.
We conducted two experiments on next-frame(s) prediction of a spatiotemporal (video) sequence: one for a gray-scale dataset (moving MNIST) and the other for an RGB dataset (KITTI traffic).The moving MNIST dataset [27] is a good example of how the model can determine movement of two specific objects within a frame and how the model can handle new object shapes and new movement directions.The KITTI dataset [9] requires the model to recognize several different moving and non-moving objects within the frame and the roofmounted camera records vehicle traffic which has static objects (scene) and dynamic objects (moving vehicles).In addition, it shows how the model can deal with RGB videos.For both experiments, we trained our model using four layers like that shown in Fig. 3.The number of parameters and inputs are the same except for the first layer in each experiment, due to the input different input sizes for moving MNIST versus KITTI.Both experiments used the Adam optimizer [21] with an initial learning rate α = 0.001 and a decay factor of 10 after half of the training process, and β 1 = 0.9 and β 1 = 0.999.Frame size was downsampled by factor of 2 moving upwards through the layers and upsampled by factor of 2 moving down the layers.
For training elapsed time comparisons in the first experiment, the model was trained on an Intel(R) Core i7-7820HK CPU with 32GB memory and an NIVIDIA GeForce GXT 1080 graphics card.For the second experiment, both Pred-rgcLSTM and PredNet were trained on Intel(R) Core i7-6700@4.00GHx8processor with 64 GB memory and NI-VIDIA GeForce GXT 980 Ti/PCle/SSE2 graphics card.
A. Experiment 1: Moving MNIST Dataset 1) Method: The moving MNIST dataset consists of 10, 000 video sequences of two randomly selected moving digits
2) Results: We compared the training time in minutes for the Pred-rgcLSTM and PredNet models for one batch of 6000 trials.The results appear in Table VI We compared Pred-rgcLSTM with several other unsupervised recurrent gate-based models (i.e.either LSTM or GRU based models) for the moving MNIST dataset.Performance measures included mean squared error (MSE) [12], mean absolute error (MAE) [12] and the structural similarity index (SSIM) [30] which is a measure for image quality similarity structure between predicted and actual images.We also compared our model to PredNet which our model is based on but using the standard-convLSTM.For the remaining models that we were unable to test due to hardware limits we obtained the results from their published work.All the models are  VII.Our Pred-rgcLSTM shows a reduction of both the MSE and MAE compared to all of the other models tested.This includes FCLSTM [27], CDNA [7], DFN [18], VPN [20], ConvLSTM [25], ConvGRU, TrajGRU [26], PredRNN [29], PredRNN++ [28] and Prednet [22].The main comparison is between our rgcLSTM block and the cLSTM block using the same model architecture.Our model also has the highest (best) structural similarity index measurement (SSIM) among the tested models.
Based on Eqn. 7, the number of trainable parameters for one rgcLSTM block is calculated as follows: For the standard-convLSTM, the number of parameters is calculated by: Where i # gate and o # gate are the number of trainable parameters for the input and output gates of the standard-convLSTM block.The multiplication by four is due to the input update activation, forget gate, input gate, and output gate that each has the same number of trainable parameters.
Hence, the number of trainable parameters is reduced approximately by 40% in the rgcLSTM compared to the cLSTM.Additional parameters increasing the halving are due to the peephole connection which increases the size of the input of the rgcLSTM block.
We also compared the memory requirement (MB) to save the Pred-rgcLSTM and PredNet trained models.The training parameters of the Pred-rgcLSTM and the Prednet were saved using the Keras API and the size of each file was examined.The memory requirement (MB) to save the Pred-rgcLSTM and PredNet trained models is shown in Table VIII.
Using the Keras API model.count_params()[4], we empirically compared the number of training parameters used in Pred-rgcLSTM and PredNet of the implemented models in Table IX.For the remaining models in Table IX, we obtained the results from their published work [25]- [27] which also were counted by using the same Keras API method [4].The empirical parameter counts for the implemented PredNet  III.
For experiment one, visual predictions for the moving MNIST data set are shown in Fig. 6.In contrast to most of the models, neither PredNet nor our Pred-rgcLSTM require an initial input trajectory for predicting the upcoming frames.They use only the current frame and current state of the trained LSTM to predict the next frame.Our rgcLSTM demonstrates comparable results with the other models and finds better visual contours than the cLSTM used in PredNet.To see the improved prediction, Fig. 8 magnifies the predicted images for prediction time step 6 for each model.B. Experiment 2: KITTI Dataset 1) Method: Our second experiment was applied to the KITTI dataset [9].This dataset is a collection of real life traffic videos captured by a roof-mounted camera of a car driving in an urban environment.The data is divided into three categories: city, residential, and road.Each category contains 61 different recorded sessions (i.e.different video sequences) which were divided into 57 sessions for training and 4 for validation.For training both the rgcLSTM and cLSTM we used the same initialization, training, validation, and data down-sampling which was used by Lotter et al. [22].Each frame size was 128×160 pixels.The total length of the training dataset (of the 57 sessions) was approximately 41K frames.
The number of kernels in each layer, indicated by the value of h in Table XI, were 3, 48, 96, and 192, respectively.The resulting dimension changes for the weights in layer R 0 are shown in Table X as the other layers weights remain unchanged from Table III.The input dimensions for each layers are changed in layer R 0 in the width and hight and number of in-channel of the original video frame.However, the rest layers has changes only in the width and hight of the input sizes.
2) Results: For the KITTI dataset, the visual testing results comparing our rgcLSTM and the cLSTM appear in Fig. 7.This figure shows that the rgcLSTM (with smaller training time and parameter count) matches the accuracy in comparison to the cLSTM.To compare the accuracy more clearly between the rgcLSTM and cLSTM, we magnified the images of prediction step 6 for each model, shown in Fig. 9.The predictions for both models in comparison to ground truth are essentially perfect.
In addition, the memory required (MB) to save the trainable Pred-rgcLSTM model is smaller than the Prednet.Fig. 10 shows the training and validation loss through 150 epochs of the KITTI dataset training using both the proposed Pred-rgcLSTM and PredNet.The graphs show that the training and validation losses of both models is nearly identical suggesting that both models may be approaching the limit in Bayesian prediction error.

VI. CONCLUSION
The novelty of our rgcLSTM architecture lies in the following features.First, there is one network gate which serves the function of the forget, input, and output gates.This reduces the number of gate parameters to one third that of a cLSTM.Second, in this reduced model there is still a peephole connection from the cell memory state to the network gate, in contrast to the cLSTM.Finally, the rgcLSTM uses a convolutional architecture and we have replaced the elementwise mulitiply operation originally used in Shi et al. [25] with the convolution operation.This reduces the number of trainable parameters in addition to being a more appropriate operator for image processing.
Despite this parameter reduction, the proposed rgcLSTM model either outperforms or matches the performance of the cLSTM and other gate-based recurrent architectures.This was achieved by maintaining the critical components of the standard LSTM model, enhanced by the peephole connections and removing redundant training parameters.These results make our rgcLSTM model attractive for future hardware implementation on small and mobile devices.

Fig. 1 .
Fig. 1.An unrolled architecture of the rgcLSTM.The single network gate (output indicated by σ) sends information to three locations which correspond to the outputs of the forget, input, and output gates of the standard LSTM.

Fig- ure 2 .
The operation level of the rgcLSTM is shown in Figure2

Fig. 2 .
Fig. 2.An unrolled operation level of the rgcLSTM architecture block design where arithmetic, weights, and biases are made explicit.
are compatible with A l for pixelwise subtraction.

Fig. 6 .
Fig. 6.Visual results of Moving MNIST predictions after training based on our rgcLSTM and other models.

Fig. 7 .
Fig. 7. Next-frame prediction on the KITTI dataset.The predicted image is used as input for predicting the next frame.

Fig. 8 .
Fig. 8. Magnified visual result of Moving MNIST predictions for prediction time step 6 after training based on our rgcLSTM and other models.

Fig. 9 .
Fig. 9.The magnified visual results for the KITTI dataset prediction for the 6th prediction step for Pred-rgcLSTM and Prednet.

TABLE I DEFINITION
OF TERMS l Eqn.12. e 0 = x.c l Eqn. 8. r upper(l) feedback from layer l + 1. R l dimension of gate input stack Ra l dimension activation input stack f l Eqn. 4. g l Eqn. 5. i l PredNet input gate for layer l. o l PredNet output gate for layer l.A l next frame input for layer l.Âl predicted next frame input for layer l. downsample l

TABLE II DIMENSIONS
OF INPUT COMPONENTS TO PRED-RGCLSTM AND PREDNET FOR MOVING MNIST

TABLE IV DIMENSION
PARAMETERS FOR THE GATED MODULE (EITHER RGCLSTM OR CLSTM) FOR EACH OF THE FOUR LAYERS FOR MOVING MNIST.
the ReLU, the e l values fall in the range [0, 1].These form the inputs x = e l to the rgcLSTM.Within the rgcLSTM, the values of h and c (Fig.1) are constrained to fall in the range (−1, 1) because they are outputs of tanh or modulated tanh.Finally, the network gate uses a hard sigmoid configured to be linear in the range [−2, 2].The inputs to the hard sigmoid stay well within this range, thereby keeping the gate functional.

TABLE V EXPERIMENTAL
CONDITIONS IN MOVING MNIST EXPERIMENT FOR RECURRENT GATE-BASED BLOCKS.N/A MEANS THERE CANNOT BE PEEPHOLE CONNECTIONS BECAUSE THERE IS NO MEMORY CELL STATE.

TABLE VI MOVING
MNIST AVERAGE ELAPSED TRAINING TIME AND STANDARD ERROR (N=3).
. Using the rgcLSTM block instead of the cLSTM reduced elapsed training time by about 26% for training moving MNIST for one epoch.Table VI also shows the SE of model training time for both the cLSTM and our rgcLSTM model.The sample was n = 3 for each model.

TABLE VIII MOVING
MNIST MEMORY REQUIREMENT TO SAVE PRED-RGCLSTM AND PREDNET TRAINED MODELS.

TABLE IX MOVING
MNIST NUMBER OF TRAINABLE PARAMETERS IN EACH MODEL.

TABLE X KITTI
DATASET EXPERIMENT DIMENSIONS OF KERNEL (WEIGHT) COMPONENTS OF THE FIRST LAYER OF PRED-RGCLSTM AND PREDNET of our rgcLSTM and cLSTM are shown inTable XII.The MSE, MAE, SE, and SSIM have a sample size of n = 3.Table XII also shows the memory required (MB) to save the trainable model for both Pred-rgcLSTM and Prednet.Both models have approximately the same MSE, MAE, and SSIM values.However, the rgcLSTM uses fewer trainable parameters and the training time is

TABLE XII PERFORMANCE
COMPARISON ON THE KITTI TRAFFIC VIDEO DATASET.