Interactive Control over Temporal Consistency while Stylizing Video Streams

Image stylization has seen significant advancement and widespread interest over the years, leading to the development of a multitude of techniques. Extending these stylization techniques, such as Neural Style Transfer (NST), to videos is often achieved by applying them on a per-frame basis. However, per-frame stylization usually lacks temporal consistency, expressed by undesirable flickering artifacts. Most of the existing approaches for enforcing temporal consistency suffer from one or more of the following drawbacks: They (1) are only suitable for a limited range of techniques, (2) do not support online processing as they require the complete video as input, (3) cannot provide consistency for the task of stylization, or (4) do not provide interactive consistency control. Domain-agnostic techniques for temporal consistency aim to eradicate flickering completely but typically disregard aesthetic aspects. For stylization tasks, however, consistency control is an essential requirement as a certain amount of flickering adds to the artistic look and feel. Moreover, making this control interactive is paramount from a usability perspective. To achieve the above requirements, we propose an approach that stylizes video streams in real-time at full HD resolutions while providing interactive consistency control. We develop a lite optical-flow network that operates at 80 FPS on desktop systems with sufficient accuracy. Further, we employ an adaptive combination of local and global consistency features and enable interactive selection between them. Objective and subjective evaluations demonstrate that our method is superior to state-of-the-art video consistency approaches.


Introduction
For thousands of years, paintings have served as a tool for visual communication and expression. However, it was not until the late 20 th century that computers were used to simulate paintings [Hae90]. In the course of following decades, the field of artistic stylization [KCWI13] has significantly developed and extended by learning-based methods, such as Neural Style Transfers (NSTs) [SID17,JYF * 20]. Even though a large number of image stylization techniques exist, extending these to video remains challenging. A major obstacle in this regard is the enforcement of temporal coherence between stylized video frames. With the proliferation of video streaming applications, stylizing video streams has also become popular, however, the requirements of low-latency processing add additional challenges. Most of the existing methods, to address the above, can be classified into one of the following four categories: Style Specific. A common approach is to develop a specific method for a particular artistic style and exploit its characteristics for temporal coherency [BNTS07, NSC * 11]. Such meth-ods work effectively for the specific target style, however, do not generalize well. Many of these specialized approaches have been discussed by Bénard et al. [BTC13]. Coherent Noise. Another class of techniques adopts and transforms a generic, temporally-coherent noise function to yield a visually plausible stylized output [BLV * 10, KP11]. Compared to target-based coherence enforcement [BNTS07], these apply to a wider range of techniques but are limited to scenarios with rapid temporal changes. Stylization by Example. More recently, authors have adopted a stylization-by-example approach to support a wide range of stylization techniques [BCK * 13, JST * 19, TFK * 20, FKL * 21]. However, this approach requires the paring of the complete video and keyframe marking. Thus, by design, it does not apply to video streams. Consistent Video Filtering. One can also enable the stylization of video streams using consistent video filtering techniques. Existing approaches are either not well-suited for Image-based Artistic Rendering (IB-AR) [BTS * 15, YCC17] (Fig. 1) or do not provide interactive consistency control [LHW * 18,TDKP21], Table 1: Comparing existing consistent video filtering methods with ours with regards to consistency control. Here, the color green denotes the aspect which is favorable to interactive consistency-control while the color red denotes otherwise ("N/A" denotes Not-Applicable).  which is an essential requirement for artistic rendering [FLJ * 14]. Currently, the only method that provides interactive consistency control is limited to offline processing and requires preprocessing [SST * 19].
We aim to develop a temporal consistency enforcement approach for artistic stylization techniques that provides (1) interactive consistency control and (2) online processing to facilitate the application to video streams.
A determining factor towards the slow performance of existing online and interactive consistent video filtering technique [BTS * 15] is the costly step of the optical-flow computation. Previous works using learning-based methods are able to achieve a considerable accuracy for optical-flow estimation [TD20,JCL * 21]. However, we argue that such high accuracy is not particularly necessary to enforce temporal consistency for artistic stylization tasks. To validate our conjecture, we conduct a user study, wherein the participants prefer the final consistent video output generated using our flow network as compared to that being obtained using Stateof-the-art (SOTA) approaches.
In contrast to accuracy, less attention has been paid to improving the run-time performance of optical-flow estimation, which is essential for online-interactive editing. To this end, we develop a lite optical-flow neural network that runs at a high-speed (approx. 80 FPS on mid-tier desktop GPUs) while maintaining sufficient accuracy. The compact network is also deployable on mobile devices (iPhones and iPads) where it runs at interactive frame rates (24 FPS on iPad Pro 2020). We use the optical-flow output from the above network to warp neighboring processed frames (for local consistency) and previous consistent output (for global consistency), which allows for interactive global and local temporal consistency control. Our approach is able to stabilize incoming video streams in real-time with one frame latency on a consumer desktop GPU at HD resolutions, and, using a fast preset, also in full HD.
To summarize we present the following contributions: 1. A novel approach for making per-frame stylized videos temporally consistent via an adaptive combination of local and global consistency features which allows for interactive consistency control. 2. A lite optical-flow network, to achieve interactive performance, that runs at 80 FPS on a mid-tier desktop PC and at 24 FPS on a mobile device while achieving reasonable accuracy.
Note, that we define artistic stylization as the adaptation of colors, textures, and strokes. While our approach is effective for most image-based stylization techniques (e.g., NSTs, algorithmic filtering), it cannot handle significant shape or content inconsistencies between frames introduced by semantically-driven image synthesis (e.g., image-to-image diffusion-based models [RBL * 22]). Flowbased warping is insufficient to enforce consistency in such cases.  Figure 2: Schematic overview of our approach: (1) We start by calculating the warping weights wp and wn by applying Eqn. 3 on the input image sequence I t−1 , It , I t+1 .

Background & Related
(2) The computed weights are used to linearly combine the per-frame processed sequence P t−1 , Pt , and P t+1 to obtain the locally consistent image Lt , see Eqn. 2. (3) To obtain the globally consistent version Gt we warp the output at previous time instance O t−1 as depicted in Eqn. 4. (4) The local and global consistent images, Lt and Gt , are linearly combined to obtain a temporally smooth version At , see Eqn. 5. (5) To include high-frequency details from the per-frame processed result, At and Pt are adaptively combined via the optimization in Eqn. 1 using the weights wc (Eqn. 7) to obtain the final result Ot .
optimization-based problems via iterative filtering along the motion path. Dong et al. [DBZY15] address the problem of temporal inconsistency for enhancement algorithms by dividing individual video frames into multiple regions and performing a region-based spatio-temporal optimization. Bonneel et al. [BTS * 15] was the first to present a generalized approach for consistent video filtering which is agnostic to the type of filtering applied on individual video frames. The method combines gradient-based characteristics of the per-frame processed result with the warped version of the previousframe output using a gradient-domain-based optimization scheme. Yao et al. [YCC17] propose a similar approach however considers multiple key-frames for warping-based consistency to avoid problems due to occlusion. Both of the approaches assume that the gradient of the processed video is similar to that of the input video and thus cannot handle artistic rendering tasks where new gradients resembling brush strokes are generated as part of the stylization process. Moreover, due to slow optical-flow computation, they are noninteractive in nature. Shekhar et al. [SST * 19] employs a similar formulation as Bonneel et al. , with the difference of using a temporally denoised version of the current frame for consistency guidance. However, the temporal denoising requires the complete video as input making the method offline in nature. Lai et al. [LHW * 18] propose the first learning-based technique in this context. The authors employ perceptual loss to enforce similarity with the processed frames and for consistency make use of short-term and long-term temporal losses. Thimonier et al. [TDKP21] employs a ping-pong loss and a corresponding training procedure for temporal consistency. Both learning-based techniques are faster than their optimization-based counterpart since they do not perform opticalflow computation at inference time. However, these learning-based techniques do not allow to control of the degree of consistency in the final output which is vital for the task of stylization. Thus, the above-discussed methods are either non-interactive/offline or do not provide any consistency control at inference time. Our approach addresses these limitations (Tab. 1).  17] for flow-based warping to design their short-term and long-term temporal consistency losses. FlowNet 2.0 is on par with the quality of state-of-the-art classical methods, however, due to a large number of parameters and operations, achieves only interactive frame rates even on high-end desktop Graphical Processing Units (GPUs). An improved compact optical-flow Convolutional Neural Network (CNN) is proposed by Sun et al. [SYLK18] -PWC-Net. It combines coarse-to-fine estimation with pyramidal image features, correlation, warping, and CNN-based estimation. Furthermore, a refinement CNN is stacked at the end to improve the final flow estimate. PWC-Net is orders of magnitude smaller than FlowNet 2.0 and runs at real-time frame rates using desktop GPUs. Liu et al. [LZH * 20] employ their approach to train a similar architecture in an unsupervised setting and achieve reasonable accuracy -ARFlow. LiteFlowNet and its successor LiteFlowNet2, both proposed by Hui et al. [HTL18,HTL20], have similar compact architectures. Further improvement in accuracy is achieved by models using iterative refinements, such as RAFT [TD20] and transformer modules such as GMA [JCL * 21], however, they heavily trade runtime for accuracy. Based on a runtime-accuracy comparison (see Sec. 3.2), we select PWC-Net as a base network to develop a "Lite" flow network with improved performance for interactive consistent filtering.  [LLKY19] proposes a method for arbitrary style transfer and shows its applicability to real-time video style transfer by applying style features to consecutive frames using a shallow autoencoder. However, we show that our approach, applied to their per-frame processed videos is able to significantly reduce flickering and is more consistent than their stabilized version (see supplementary). Puy and Pérez [PP19] develop a flexible deep CNN for controllable artistic style transfer that allows for the addition of a temporal regularizer at testing time to remove the flickering artifacts. The above method comes closest in terms of providing some consistency control at test time for NST-based methods. However, they cannot handle classical stylization techniques. Keyframebased Stylization (KBS) [BCK * 13, JST * 19, TFK * 20, FKL * 21] caters to both classical and neural paradigms via priors involving keyframe-based warping. Nonetheless, it is usually applied as an offline process involving pre-training on the input video. Moreover, we show that our approach is able to interactively stabilize online KBS approaches such as [TFK * 20]. We aim to propose a generic solution that is agnostic to the type of stylization and provides online performance and interactive consistency control.

Temporal Consistency Enforcement
Given an input video stream . . . I t−1 , It , I t+1 , . . . and its per-frame processed version . . . P t−1 , Pt , P t+1 , . . . , we seek to find a temporally consistent output . . . O t−1 , Ot , O t+1 . . . . Our method is agnostic to the stylization technique f applied to each frame, where Pt = f (It ). However, it is necessary for f to not introduce significant shape or content inconsistencies between consecutive frames, as the changes in the stylized frames should correspond to the optical flow (calculated based on the content). We initialize the consistent output for the first frame as its per-frame processed result i.e., O 1 = P 1 . To obtain the output for subsequent frames (Ot at any given instance t) we require only a snippet of input (I t−1 , It , I t+1 ) and processed streams (P t−1 , Pt , P t+1 ), and the consistent output at the previous instance O t−1 . For enforcing consistency, we solve the following gradient-domain optimization scheme: where Ω represents the image domain. The data term in this optimization enforces similarity with the per-frame processed result Pt in the gradient domain. The gradient-based data term ensures that we borrow only the necessary details from the per-frame processed results (in the form of edges) while avoiding inconsistencies. Thus, high-frequency details are taken from Pt and the smoothness term enforces temporal consistency where low-frequency content is taken from the image At . The optimization formulation in Eqn. 2). However, our novelty is the way in which we construct our smoothness term that, unlike previous approaches, considers both global and local consistency aspects. Our novel smoothness term is able to better preserve the color and textures in the stylized output while providing both short-term and long-term temporal consistency.
Local Consistency. For enforcing temporal consistency at a local level, we use optical flow to warp neighboring per-frame processed results to the current time instance t. This is performed by computing an adaptive combination of (1) warped previous per-frame processed image Γ(P t−1 ), (2) warped next per-frame processed image Γ(P t+1 ), and (3) the current per-frame processed image Pt , where Γ is the warping function. By including both backward and forward warping in our formulation, we are able to significantly reduce artifacts due to occlusion and flow inaccuracies. The linear combination of (1), (2), and (3) gives us a locally consistent version Lt where, The weights wp and wn capture the inaccuracies in the warping of previous and next frames respectively and are defined as follows: (3)  In order to also incorporate contribution from Pt , we clamp the weights wp and wn as follows: wp ∈ [0, k 1 ] and wn ∈ [0, k 2 ], where k 1 and k 2 are two constants and their sum is less than one, i.e., 0 < (k 1 + k 2 ) < 1. The locally consistent image sequence given by Lt has improved temporal consistency over the per-frame processed output, however, it still has visible flickering artifacts. Thus, the reduction in flickering due to the warping of only one temporal neighbor is not sufficient. To further improve consistency, one can warp more neighboring frames around the current time instance t. As we increase the temporal window size for such an adaptive combination it has a denoising effect leading to further reduction in flickering. The temporal denoising performed by Shekhar et al.
[SST * 19], for enforcing consistency, can be considered as a specific example of the above scenario. However, for interactive stylization, warping more frames to the current instance is not feasible due to time constraints. Moreover, in the case of video streams, we do not have frames to warp from the forward temporal direction.
Global Consistency. In order to overcome this limitation, existing techniques [BTS * 15,LHW * 18] adopt a global approach. For global consistency, one can consider the previous stabilized output O t−1 and enforce similarity with its warped version Gt where, To enforce only global temporal smoothness, we replace At with Gt in Eqn. 1. Further, in order to compensate for optical-flow inaccuracies, the smoothness term is weighted using wp (i.e., wc = wp) in Eqn. 1.
However, considering only global consistency for flicker reduction leads to a loss of stylization (in terms of colors and textures) and local temporal variations in the final output. Moreover, in this case, any warping error (due to flow inaccuracies) or noise (as part of the stylization process) keeps getting propagated to future frames. Due to the above factors, such an approach only gives plausible results where the gradients of the original video are similar to the gradients of the processed video. The above does not hold for the task of stylization where stylistic elements such as brush strokes, textures, or stroke textons [ZGWX05], in general, can vary largely between frames even for small changes in input gradient.
Combining Local and Global Consistency. For preserving local temporal variations (in terms of look and feel) while significantly reducing flickering artifacts, we linearly combine globally and locally consistent images Gt and Lt respectively, We use the adaptively combined image At as our reference for consistency while enforcing temporal smoothness in Eqn. 1. The upper limit of weight wp (i.e., k 1 ) can be increased to increase the influence of global-temporal smoothness and vice versa. Further, the influence of the smoothness term is controlled by per-pixel consistency weights wc. We would like to invoke the smoothness term only when the warping accuracy is sufficiently high. To this end, we construct a warped version of the input image similar to Lt as, Only when the input image It is similar to A I t , the smoothness term is invoked. To measure this similarity, we use the weight wc, The parameter λ is used to scale up or down the weight wc.
Consistency Control Modes. The above adaptive combination of local and global consistency provides two different ways of consistency control in the final output. By increasing the upper limit of wp, i.e., k 1 we can increase the proportion of global consistency in the adaptively combined image At and vice versa. On the other hand, the optimization parameter λ dictates how close the output Ot will be to the adaptively combined image At . Thus, the level of consistency in the final output can be controlled in two different ways: (1) by setting the upper limit of parameter wp, i.e., k 1 or (2) by scaling the weight parameter λ. For low values of k 1 (Fig. 3b), the consistency enforced is negligible and the final result resembles the per-frame processed output (Fig. 3f). However, for higher values, we start observing noisy ghosting artifacts (Fig. 3e). The higher values for k 1 translate to using only global consistency which results in the accumulation of flow inaccuracies visualized as ghosting artifacts. Similarly, for lower values of λ (Fig. 3g), the final result is visually similar to the per-frame processed output (Fig. 3f). However, for higher values, the optimization becomes unstable resulting in noisy optimization-based artifacts. (Fig. 3j).
Optimization Solver. The energy terms in Eqn. 1 are smooth and convex in nature, which allows a straightforward energy minimization with respect to Ot . To this end, we employ an iterative approach thus avoiding: (i) storage of a large matrix in memory and (ii) further estimating its inverse. Moreover, an iterative approach allows us to stop the solver once we have achieved visually plausible results. An iterative update Ot j+1 is obtained by employing Stochastic Gradient Descent (SGD) with momentum [Qia99], where η and κ are the step size parameters, ∇E is the energy gradient with respect to Ot , and j is the iteration count. For most of our experiments, η = 0.15 and κ = 0.2 yield plausible results. We consider the trade-off between performance vs. accuracy as stopping criteria and do not compute energy residue for this purpose. To obtain a consistent output while having interactive performance, we empirically determine 150 iterations to be sufficient. The optimization is stable for the given parameter settings and early stopping is only employed for computational gain.
An integral aspect common to both our local and global consistency is the warping function Γ. Apart from the number of solver iterations, for interactive performance the above warping should also happen at a fast rate -which in turn necessitates fast optical-flow estimation.

Lite Optical-Flow Network
We aim to obtain a flow network capable of running at high-speed on consumer hardware with reasonable accuracy. To this end, we start by selecting an existing CNN-based optical flow estimation technique, based on accuracy vs. run-time analysis. After the selection of a base network, we perform further optimization steps to increase the performance as outlined in Fig. 4.
Base Network Selection for Compression. In Fig. 5, we compare several well-known optical methods to find a base network candidate that best matches our runtime/accuracy requirements. We employ the following models for this: FlowNet 2.0 [IMS * 17], SpyNet [RB17], LiteFlowNet2 [HTL20], PWCNet [SYLK18], ARFlow [LZH * 20], VCN [YR19a], RAFT [TD20] and finally GMA [JCL * 21] (state-of-the-art in terms of EPE-based accuracy). Our experiments are carried out on an Nvidia RTX 2070 GPU,  [BWSB12]. The Endpoint Error (EPE) metric measures Euclidean distance (in pixels) between ground truth and predicted optical flow vectors. Note how our method achieves a high FPS while being accurate enough for temporal consistency enforcement.
which we deem to be a good representative of a current mid-to higher-end consumer GPU. Under a constraint of interactive performance on consumer hardware, LiteFlowNet2 [HTL20] and PWC-Net [SYLK18] offer the best trade-off between run-time performance and accuracy (Fig. 5) standard convs.
-cP use P% of channels. 100% (b) Legend of our CNN variants. Figure 6: Accuracy vs. run-time performance of our CNN variants on desktop, measured on Sintel Final (Train) [BWSB12]. Optimization steps that lead to significant improvement in run-time are connected by a line. Our architectural modifications to PWC-Net [SYLK18] are detailed on the right, e.g., our-4light-sepref denotes a 4 light flow estimators and refinement using depthwise separable convolutions. We achieve a high accuracy on Sintel training data, however, for testing data the accuracy is low, see Fig. 5.
Optimized Network Architecture. We start with the base architecture of PWC-Net. As the first compression step, we reduce the computationally expensive DenseNet [HLvdMW17] connections in the flow estimators to retain connections only in the last two layers ("-light" in Fig. 6b). Similar to LiteFlowNet2 [HTL20], we remove the fifth flow estimator -operating on the highest resolution -as it heavily trades off run-time for only a marginal increase in accuracy (compare "4light" vs "5light" in Fig. 6b). We replace the standard convolutions in the refinement by depthwise separable convolutions [HZC * 17] ("-sepref" in Fig. 6b). Moreover, we also explore reducing the number of channels [HZC * 17], but find that reducing channels results in a worse trade-off as compared to other optimizations.
Training. For training, we follow the original PWC-Net [SYLK18] schedule. However, we find that weighting the multi-scale losses equally, instead of exponentially [SYLK18,HTL18,HTL20,YR19a] [BWSB12]. In the supplementary material, we provide training settings for each stage in detail. We employ a multi-scale loss [SYLK18] applied to each flow estimator and optimize using the AdamW optimizer [LH19] with β 1 = 0.09, β 2 = 0.99, and l 2 weight regularization with trade-off γ = 0.0004. Furthermore, extensive dataset augmentation is applied to prevent model overfitting. We refer to the supplementary material for more details.
Our Final Model. We analyze various optimization options and chose "our-4light-sepref " as our final model for desktop systems as it provides the best trade-off between accuracy vs. run-time. As depicted in Fig. 6a, our method improves the run-time performance of PWC-Net from 30 FPS to 85 FPS -a speed-up of factor 2.8. For Sintel training data, the accuracy drops by ≈ 0.5px in EPE terms, however for test data the drop in accuracy is significant where the final EPE is 7.43, see Fig. 5. Nevertheless, the accuracy is sufficient enough for enforcing warping-based consistency. To validate our design decisions, we conduct an extensive ablation study in which we vary the architectural and training choices -please see the supplementary for details. Furthermore, we tune our architecture for optical flow calculation on mobile devices using channel pruning and quantization, which we also detail in the supplementary material. Here, we improve run-time performance from 2.8 FPS to 24 FPS (iPad Pro 2020), and 1.5 FPS to 13 FPS (iPad Air) -an improvement of factor 8. Next to showing the general applicability of optical flow CNNs on mobile devices, this demonstrates that real-time on-device stabilization of videos using our presented approach will become feasible with a further moderate increase in mobile GPU computing power. A fast optical-flow-based warping enables our framework to interactively control the degree of consistency and generate visually plausible results.

Implementation Details
All our experiments were performed on a consumer PC with an AMD Ryzen 1920X 12-Core CPU, 48 GB of RAM, and a Nvidia GTX 1080Ti and RTX 3090 graphics cards with VRAMs of 11 (a) Frame Overlay (b) Ground-truth (c) RAFT [TD20] (d) PWC-Net [SYLK18] (e) Ours Figure 7: Optical flow estimated using the synthetic Sintel dataset [BWSB12]. GB and 24 GB respectively. We implement a real-time videoconsistency framework in C++, using ONNXRuntime for crossplatform acceleration of our lite optical-flow network, and implement the stabilization code using Nvidia CUDA (v11). In Tab. 3, we measure the runtime performance of our system. We find that an incoming stream of frames can be stabilized at real-time performance for VGA resolution even on low-and mid-tier GPUs and higher-tier GPUs (such as a RTX 3090) can stabilize HD at common video frame rates (approx. 24 FPS) and full-HD resolutions at interactive frame rates (> 10 FPS) (Tab. 3). We also test a fast preset that uses less iterations and computes optical flow on half-sized inputs, and find that a full-HD video stream can be processed in real-time at the cost of minor additional flickering -see the supplementary video for a comparison. We implement a graphical user interface that allows for real-time decoding and stabilization of stylized video streams, where the stabilization parameters can be interactively controlled, see the supplementary video for a demonstration.

Parameter Settings
Initially, we tune the parameters of our consistency framework towards achieving a low warping error (Tab. 5). We refer to this setting as Ours-objective with the following parameter values k 1 = k 2 = 0.3, α = 10 × 10 3 , and λ = 0.7. However, we observed that even though the warping error indicated good temporal stability, subjective flickering, and artifacts were noticeable. Unlike existing approaches, our framework allows for interactive parameter adjustment. Thus, a parameter set that subjectively produces wellstabilized results on a broad range of tasks and videos was obtained experimentally. As our final version, we use the values of k 1 = 0.3, k 2 = 0.5, α = 6.5 × 10 3 , and λ = 2.0 to generate all the images in the paper and the videos provided in the supplementary. We further compare Ours-objective settings with our final version as part of our user study to validate our parameter choices. The consistent outputs obtained using the above parameter settings are compared against state-of-the-art approaches thereby showcasing its efficacy.

Optical Flow Results
We visualize optical flow on frames from the Sintel [BWSB12] dataset in Fig. 7 and compare it to state-of-the-art methods. All depicted methods have been fine-tuned on Sintel. We find that our optimized method has more blurry motion boundaries and misses estimating certain details accurately (e.g., the right hand, however, PWCNet also fails at this), but still captures the overall motion direction of objects correctly with a smooth flow field. Fig. 8 shows results for real-world videos on the DAVIS dataset [PTPC * 17] (no ground-truth flow available). We find that some real-world image phenomena, such as complex/ambiguous occlusions (e.g., bus behind the tree) are not well-handled by state-of-the-art methods like RAFT [TD20] or PWC-Net [SYLK18], similarly, such results are also degraded for our optimized method. Besides the stronger blurred motion boundaries, we find that our network generally performs well and is also robust for real-world videos.

Consistent Outputs
We  the three competing methods Bonneel et al. is the least effective in preserving the underlying style for the final output (compare the second column with the fifth one in Fig. 9). Hyper-parameter tuning in the above method (with only global consistency) can provide a certain degree of consistency control. However, by employing both global and local consistency we achieve finer consistency control while being similar to the per-frame-processed result. For the method of Lai et al. , we observe some color bleeding or darkening in the output frames (compare the second column with the fourth one in Fig. 9). In comparison, we are able to preserve the style, color, and textures, while being consistent.

Quantitative Evaluation
Following Lai et al.
[LHW * 18], we measure the similarity between per-frame processed output and stabilized results, and the temporal warping error between consecutive stabilized frames.
For the former, we report the similarity in the form of the SSIM metric in Tab. 4. We achieve significantly higher similarity scores than the methods of Bonneel et al. ], we also measure the temporal warping error between a frame Vt and the warped consecutive frameV t+1 , defined as: where Mt ∈ {0, 1} is a non-occlusion mask [LHW * 18,RDB18], indicating non-occluded regions. The warped frameV t+1 is obtained by calculating the optical flow (using GMA [JCL * 21]) between frames Vt ,V t+1 , and applying a backwards warping to frame V t+1 . We compute Ewarp for every frame of a video and then averaged to obtain the warping error of a video Ewarp(V ). In Tab. 5 we report the average warping error per dataset (see the supplementary for a pertask breakdown). We find that the warping error is slightly higher than that of Bonneel et al. Using other Optical Flow Computations. We also tested other optical flow methods within our pipeline which were either faster [KTDVG16] or more accurate [TD20]. For the fast optical method by Kroeger et al. [KTDVG16](DIS) the final output is less consistent than ours in both objective and subjective metrics. Using DIS for our stabilization, the average warp-error is higher (Tab. 5), and perceptual-similarity with the per-frame processed result is lower than ours (0.9 in SSIM over DAVIS and VIDEVO). Visually, DIS-stabilized results show significantly more flickering, validating our design choice for the optical flow. A much more accurate optical flow is given by the method of Teed et al. [TD20] (RAFT) at the cost of slow computation. The stabilized results obtained using RAFT look visually indistinguishable from the one obtained using our flow; the average warp-error is the same or marginally lower (Tab. 5), while the perceptual-similarity is the same in terms of SSIM as in Tab. 4. Due to visually unnoticeable and metric-wise only minor differences for RAFT, we conjecture that there will not be any significant improvement in output quality for even more accurate flow methods.

Qualitative Evaluation
For qualitative evaluation, we perform a subjective user study where we ask participants to compare the temporally-consistent re-   Figure 12: Comparing to video style transfer [LLKY19]. We compare the implicit stabilization of their video style transfer technique to their per-frame NST stabilized with our approach.
supplementary video, we compare their stabilization approach to ours on KBS-stylized videos. For videos stylized using NST, visual similarities are apparent between both stabilized versions, though our method displays superior detail preservation. However, for complex-structured styles that introduce texture into homogeneous regions, such as pencil drawings, their Gaussian-mixture-based stabilization improves texture adherence. In contrast, our method may lead to over-smoothing due to inaccurate flow computation in such featureless regions (see supplementary video). This effect can be mitigated to a certain degree by employing a lower temporal consistency factor ( λ). Compared to Gaussian-mixture-based stabilization our approach runs at least an order of magnitude faster, making it better suited for interactive scenarios such as KBS-stylizing and stabilizing an incoming video stream. Futschik et al. improves on the temporal consistency of Texler et al. by considering additional frames from the video during training. However, this makes it less applicable to out-of-domain videos (i.e., content not seen during training) which is common in video streams. Our method can also effectively stabilize such videos as shown in the supplementary video.
Video Style Transfer. In Fig. 12, we compare our method to the arbitrary style transfer for videos of Li et al. [LLKY19]. Despite their method having full control over the stylization process, their results exhibit more temporal flickering and blurring, particularly in smooth regions such as the sky. Their video style transfer method also tends to under-stylize image features compared to their (a) Input frame (b) Per-frame I2I-SD (c) Stabilized (ours) Figure 13: Results on per-frame img2img stable diffusion (I2I-SD) [RBL * 22] applied with the prompt "a 1920s car in a roundabout". Latent codes are interpolated with the previous frames to improve consistency. We use CfG scale = 7.5 and a denoising factor of 0.4.
per-frame style transfer. Please see the supplementary video for a video-based comparison.

Discussion
Our approach takes a video pair as an input: (i) the original and (ii) its per-frame stylized version. We assume that the stylization is based on the input image gradients and appears as variations in the form of colors and/or textures. Thereby, we employ the original video as a guide for enforcing consistency. However, for textguided generative arts such as recent diffusion model-based approaches [RDN * 22, RBL * 22] the stylized frames are often only weakly correlated with the original input, and we cannot handle such cases. In Fig. 13 we provide an example of a per-frame stylization using stable diffusion [RBL * 22], in which despite using a latent pre-initialization from previous frames, new details are hallucinated in every frame, which cannot be effectively removed by our method, resulting in a blurry output video.
For the evaluation, we mainly use CNN-based stylization techniques. However, our approach can also handle classical stylization algorithms [KCWI13], we show a few such examples in the supplementary. Our local consistency component comprising a convex combination of temporal neighbors can be seen as a crude form of local temporal denoising. Previously it has been shown that temporal denoising is effective in enforcing consistency [SST * 19]. We conjecture that efficient temporal denoising combined with flowbased warping can further improve temporal stabilization not only for stylization but also for other tasks. We show examples for such non-stylization tasks, particularly for image enhancement (DBL [GCB * 17]) and intrinsic decomposition, in the supplementary.
We start with the assumption that temporal flickering is not completely undesirable for the task of stylization and thus we provide interactive consistency control. However, during the subjective user study, we observed that participants had different tolerance levels for flickering in the foreground as compared to that in the background. As part of future work, one can use depth-based or saliency-based masks to vary the consistency control parameters spatially for a more visually pleasing result.
Limitation. Our approach tends to have ghosting artifacts for fast-moving objects where the object motion between consecutive frames is large (Fig. 14). The above can be reduced by reducing the upper limit of wp (i.e., k 1 ), however, such a reduction also reduces consistency in the final output. We argue that since we provide interactive control of parameters the above trade-off between artifacts vs. consistency will not significantly hinder its usability.
(a) k 1 = 0.5 (b) k 1 = 0.1 Figure 14: The ghosting artifacts on the rear wheel of the scooter are significant in the final output for k 1 = 0.5, however, they are significantly reduced for k 1 = 0.1.

Conclusions
We propose an approach that makes per-frame stylized videos temporally coherent irrespective of the underlying stylization applied to individual frames. At this, we introduce a novel temporal consistency term that combines local and global consistency aspects. We maintain similarity with the per-frame processed result by minimizing the difference in the gradient domain. Unlike previous approaches, we provide interactive consistency control by computing optical flow on the incoming video stream at high speed and with sufficient accuracy for stabilization. The fast optical-flow inference is achieved by developing a lightweight flow network architecture based on PWC-Net. The entire optimization solving is GPU-based and runs at real-time frame rates for HD resolution. We showcase that our temporally consistent output is preferred over the output of competing methods by conducting a user study. As part of future work, we would like to employ learning-based temporal denoising to further improve the quality of results. Moreover, we would like to explore the usage of depth-based and saliency-based masks to spatially vary consistency parameters according to perceptual principles. We hope that our design paradigm of interactive consistency control will potentially make per-frame video stylization more user-friendly.

Supplementary Material
Some details had to be omitted from the main paper due to the page limit; we present those details here. In the following, we report on ablation experiments in Sec. 1.1, which were carried out to determine the best performing fast optical-flow network, and also expand on quantization and pruning which were employed for our mobile-optimized network. In Sec. 1.2 we expand on training and implementation details. In Sec. 2 we provide detailed numbers for the warping error. In Sec. 3 we compare our method subjectively against Shekhar et al.
[SST * 19] and Thimonier et al. [TDKP21] through a user study. Finally, in Sec. 4, we present further visual results of our optical-flow network.

Ablation Study
To analyze our optimization steps, we compare different variants of our CNN. All variants are trained on the full dataset schedule unless stated otherwise. We make use of Sintel Final Train dataset [BWSB12] as a benchmark and measure accuracy (in terms of EPE), number of parameters, and run-time for different variants. Due to different desktop and mobile GPU hardware, the run-time performance can vary between platforms, thus we measure them separately. DenseNet Connection Replacement. As a first architectural improvement, we replace DenseNet [HLvdMW17] connections in the flow estimators with light connections [LZH * 20]. Replacing these results in a significant run-time improvement on both desktop systems and mobile devices, with a larger relative speed-up on mobile devices (Tab. 6). We conjecture that convolutions with a large number of channels (dense architecture uses up to 565 channels) might perform worse on mobile GPUs due to smaller memory and cache sizes. Thus, reducing these high channel counts results in a larger speed-up on mobile devices. The light connections result in a loss in accuracy (Tab. 6), but due to the significant run-time improvements, we find it a reasonable trade-off and use light connections in the following experiments and in our proposed mobile architecture.
Channel Reduction. We reduce the number of channels throughout the CNN [HZC * 17]. In this case, the loss in accuracy and achieved trade-off is not beneficial (Tab. 7). We hypothesize that channel reduction is potentially better for high-level Computer Vision (CV) tasks, where high-dimensional convolution features are mapped to very low-dimensional results [HZC * 17]. Optical flow, however, requires pixel-precise predictions of continuous values (motion vectors) and thus requires a much higher spatial fidelity. Furthermore, we observe that the relative speed-up on mobile devices is again higher than on desktop systems which supports our previous belief that larger convolutions are more difficult for mobile GPUs with smaller memory and cache sizes.
Flow Estimators. We evaluate different configurations of separable convolutions for the five flow-estimator modules. Replacing all convolutions in the flow estimators with separable convolutions leads to a significant loss in accuracy (Tab. 8). The last two flow estimators operate on the highest pyramid resolutions and have the largest impact on run-time performance. Thus, loss of accuracy can be minimized by using separable convolutions only for the last two flow estimators. Moreover, we find that removing only the last flow estimator leads to a larger speed-up and overall better tradeoff [HTL20], both on desktop systems and mobile devices (Tab. 8).
The last flow estimator -operating on quarter input resolutioncomprises only 11.3 % of parameters, but removing it results in more than 100 % speed-up on mobile devices.

Refinement.
For the four previously chosen flow estimators, we find that dense refinement can be replaced by separable convolutions which even results in a slight increase in accuracy on both desktop and mobile devices (Tab. 9). For five flow estimators, we observe that dense refinement has a larger impact on accuracy.
Pruning: As an additional improvement for mobile deployment, we evaluate pruning as a post-training step. Convolution filter pruning is applied as proposed by Li et al.
[LKD * 16] and a l 1 strategy combined with automatic consistency checks [Fan19] is used for selecting which filters to prune. We apply it to each convolution layer that has more than two output channels. To keep pruning simple, we prune the same percentage of filters from each layer and then perform a single re-training to account for the loss of accuracy. We find that pruning 40 % of filters achieves a good trade-off for the final architecture -reducing accuracy by less than 10 % (< 0.5px EPE) for more than 40 % speed-up (Tab. 10). We re-train the pruned CNN with the same dataset schedule and settings as initial training, except for training on FlyingChairs [FDI * 15] where we start with the lower learning rate of 1 × 10 −5 and train for fewer iterations as training converges quickly, e.g., a maximum of 15 epochs (1.5 hours).
We evaluate different options of pruning as post-training optimization. As our initial training consists of multiple stages with different datasets, we first evaluate after which stage to prune and     Next, we evaluate which trade-offs result from different amounts of pruned channels. For mobile devices and our final CNN, we find that pruning up to 40% of channels results in significant runtime improvement with plausible accuracy loss. Pruning 50% of channels results in a substantially higher accuracy loss (Tab. 10). For desktop, pruning -similar to reducing the number of channels (Tab. 7) -results in a smaller speed-up than on mobile devices. Considering only a small improvement of the already high frame rate in exchange for a significant loss in quality, we do not recommend pruning for the desktop version.
Quantization and Mobile Deployment: For mobile deployment, we make use of CoreML [App] as the framework for executing our CNNs on Apple mobile devices. We apply 8 bit linear weightquantization and enable the accumulation of low-precision intermediate results (Tab. 11). This minimizes the file size by 75 % (compared to 32 bit weights) and further improves run-time performance by 30 % (on mobile devices) with only negligible accuracy loss. Further analysis shows that our method does not profit from using the on-device Neural Processing Unit (NPU).

Implementation Details
Pruning. For convolution filter pruning we use a PyTorch [P * 19] implementation by Gongfan Fang [Fan19]. We use l 1 strategy for selecting filters to prune. l 2 strategy is available too but Li et al. [LKD * 16] show that these strategies perform comparable. We round the number of resulting channels to a multiple of 8, as other filter counts result in a run-time overhead on mobile devices. We re-train the pruned CNN with the same dataset schedule and settings as initial training, except for training on Fly-ingChairs [FDI * 15] where we start with the lower learning rate of 1 × 10 −5 and train for fewer iterations as training converges quickly, e.g., a maximum of 15 epochs (1.5 hours).
Mobile execution. We use CoreML [App] as the framework for executing our CNNs on Apple mobile devices. We evaluate using iPad Pro (11-inch, 2 nd gen 2020) and iPad Air (3rd gen, 2019). CoreML efficiently implements standard CNN operations, however, the two operations specific to optical flow, i.e., correlation and warping, need to be implemented as custom layers using Metal GPU shaders for parallel and efficient computation using the mobile GPU. After CNN conversion to CoreML, we apply 8-bit linear weight quantization using coremltools , low precision acculumation is enabled to enforce low precision accumulation in all operations.

Training Settings
Similar to the original PWC-Net [SYLK18], we train our mobile architecture on the training dataset schedule FlyingChairs [FDI * 15] → FlyingThings3D → Sintel [BWSB12]. Tab. 13 lists training settings for respective stages. We compute the multi-scale losses by taking the per-pixel difference between the output of each flow estimator and an accordingly downscaled ground-truth optical flow. The pixel-level loss values are summed up to a final single value which is then used as a training objective by the optimizer. Like PWC-Net [SYLK18], we scale the ground-truth optical flow with a factor of 20 prior to calculating the loss and thus have to divide the flow estimate by 20 at test time. For training we use AdamW optimizer [LH19] with β 1 = 0.09, β 2 = 0.99, and l2 weight regularization with trade-off γ = 0.0004.
Given predicted flow field uv Pred and ground-truth flow field uv GT , EPE loss (Eqn. 10) and a robust l 1 loss (Eqn. 11) is defined as follows: with typical values of ε = 0.01 and q = 0.4. q < 1 results in less penalty to large error values and thus makes the loss more robust to outliers, which is necessary for fine-tuning on realistic, difficult datasets [SYLK18].

Hyperparameters Configuration
We evaluate different hyperparameters while training the original PWC-Net [SYLK18] to determine a baseline that achieves the best possible accuracy. We find that AdamW optimizer [LH19] reaches a significantly better accuracy than Adam [KB15] -with and without l 2 weight regularization (Tab. 14). Furthermore, we find that equally weighted multi-scale losses -as opposed to commonly used exponential weighting [SYLK18, HTL18, HTL20, YR19b] -

Data Augmentation
Dosovitskiy et al. [FDI * 15] found that augmentations are important for learning-based optical flow methods to prevent overfitting on synthetic training data and to ensure generalization for real-world data. Similarly, we apply geometric and color transformations to input frames and corresponding flow fields, as listed in tab. 15. Geometric transformations are applied equally to both frames of an input pair, and must be reflected accordingly in the flow field. For example, a translation applied to the frames requires a translation applied to the flow field; a rotation of frames requires a rotation of the flow field and its motion vectors. Color transformations need to be applied only to the frames, not the flow field. While it would be possible to apply different transformations (geometric, colors) per frame of an input pair [TD20] -to further increase robustness against illumination changes for example -we find that transformations applied equally per frame pair are sufficient augmentations that prevent overfitting, while not making training of small CNN variants too difficult [HZC * 17].

Warping Error
In Tab. 16 we show the warping error (using ℓ 1 metric, as defined in the main paper) over the stylization tasks following Lai et al.

Extended User Study
To also compare with the methods of Shekhar et al.
[SST * 19] and Thimonier et al. [TDKP21] we conducted another user study, involving only these two methods. The setup is similar to the one described in the main paper except that it was performed by a different group of participants to avoid bias. In total, 12 persons (3 female, 8 male, and 1 did not specify) between the ages of 25 to 40 years participated in the study. Fig. 15 shows that our method surpasses the other methods by a large margin.

More Optical-Flow Results
In Fig. 16 and Fig. 17 we show further results for our lite optical flow network (configured as presented in the main paper) compared to other methods on Sintel and DAVIS.