QoE Model for Video Delivered Over an LTE Network Using HTTP Adaptive Streaming

Authors


Abstract

The end user quality of experience (QoE) of content delivered over a radio network is (mainly) influenced by the radio parameters in the radio access net-work. This paper will present a QoE model for video delivered over a radio network (e.g., Long Term Evolution (LTE)) using HTTP (Hypertext Transfer Protocol) adaptive streaming (HAS). The model is based on experiments performed in the context of the Next Generation Mobile Networks (NGMN) project P-SERQU (Project Service Quality Definition and Measurement). In the first phase, a set of representative HAS profiles were selected based on a lab experiment where scenarios with typical radio impairments (fading, signal-to-interference-plus-noise ratio, round trip time and competing traffic) were investigated in a test network. Based on these HAS profiles, video files were prepared by concatenating chunks of the corresponding video quality. In a second phase, these video files were downloaded, viewed, and rated by a large number of volunteers. Based on these user scores a mean opinion score (MOS) was determined for each of the video files, and hence, the HAS profiles. Several QoE models that predict the MOS from the HAS profile have been analyzed. Using the preferred QoE model, a range of MOS values can be obtained for each set of initial radio impairments. It is argued that a QoE model based on the radio parameters is necessarily less accurate than a QoE model based on HAS profiles and an indication is given of how much the performance of the former is less than the latter. © 2014 Alcatel-Lucent.

Introduction

More and more video traffic is being consumed over the open Internet via fixed and mobile access networks [1]. The “Cisco Visual Networking Index” report [3] states that video traffic accounted for more than 50 percent of all mobile data at the end of 2012. This content is delivered over best effort (BE) networks that can be congested and is consumed on various types of new devices (e.g., traditional televisions, personal computers, tablets, smartphones).

Hypertext Transfer Protocol (HTTP) adaptive streaming [2, 5, 8, 17] is a popular technique to deliver video streams over a BE network that may be congested to clients that may reside behind firewalls. HTTP adaptive streaming (HAS) in its various forms, is widely used by professional content distributers for streaming over wireline networks to personal computers (PCs) and over Wi-Fi* and cellular networks to smartphones and tablets. In such a technique the video is segmented into intervals between two and ten seconds long, and each video segment (i.e., a consecutive non-overlapping interval of video) is encoded in multiple quality versions, where a higher quality version yields a higher video bit rate. The bit strings associated with these encodings are referred to as chunks. There are as many chunks associated with a video interval as there are bit rate versions, and a higher quality chunk consists of a larger number of bytes. The HAS client uses HTTP to request chunks—the first ones usually in a lower video bit rate than the network throughput thus building up a play-out buffer. If the play-out buffer is large enough and chunks are (consistently) delivered in a time shorter than the video segment length, the rate decision algorithm (RDA) in the client selects a higher bit rate for the next chunk (see Figure 1), such that download time becomes about equal to the video segment length, keeping the play-out buffer more or less steady. In that way the RDA continually senses the available throughput and adapts the video rate accordingly. Any mismatch between video rate and throughput is absorbed by the play-out buffer. Assuming the RDA is working properly, and there is enough bandwidth to support the lowest quality level, no stalling events should occur. Since the video is running over Transport Control Protocol (TCP), packet loss does not directly hamper the video quality (as lost packets are retransmitted), but indirectly this results in reduced throughput, and a subsequent lower video quality.

Figure 1.

HTTP adaptive streaming.

Panel 1. Abbreviations, Acronyms, and Terms

3GPP—3rd Generation Partnership Project
AVC—Advanced video codec
BE—Best effort
DASH—Dynamic adaptive streaming over HTTP
DMOS—Delta MOS
FTP—File Transfer Protocol
HAS—HTTP adaptive streaming
HLS—HTTP live streaming
HTTP—Hypertext Transfer Protocol
IP—Internet Protocol
LTE—Long Term Evolution
MOS—Mean opinion score
NGMN—Next Generation Mobile Networks
PCAP—Protocol capture
P-SERQU—Project Service Quality Definition and Measurement
PSNR—Peak signal-to-noise ratio
QoE—Quality of experience
QoS—Quality of service
RAN—Radio access network
RDA—Rate decision algorithm
RMSE—Root mean square error
RNN—Random neural network
RTT—Round trip time
SINR—Signal-to-interference-plus-noise ratio
SSIM—Structural similarity
TCP—Transport Control Protocol

In this paper we assess the influence of the radio parameters on the video quality as perceived by the end users. We will assess the correlation of the mean opinion score (MOS) with some radio parameters. The impact on the MOS of the individual radio parameters and the impact of the interdependency of these radio parameters are further analyzed. We will argue why a quality of experience (QoE) model based on radio parameters necessarily has a lower performance than a QoE model based on a HAS profile and provide an indication of how large this loss in performance is.

The rest of this paper is organized as follows. We briefly discuss related work, and then describe the lab test and the subjective test experiments we performed to acquire the data for the analysis. We then explain how the MOS for each radio scenario is derived, and provide further analysis of the MOS as a function of the radio parameters. Finally, we offer our conclusions.

Related Work

In this section we discuss some of the literature on QoE assessment of HAS video streaming.

In [12] the recently standardized QoE metrics and reporting framework in 3rd Generation Partnership Project (3GPP) are reviewed. The 3GPP dynamic adaptive streaming over HTTP (DASH) specification provides mechanisms for triggering QoE measurements at the client device as well as protocols and formats for delivery of QoE reports to the network servers. The paper also presents an end-to-end QoE evaluation study on HAS conducted over 3GPP Long Term Evolution (LTE) networks, showing that HAS significantly reduces rebuffering events compared to progressive download.

In [10], the correlation between network quality of service (QoS) in terms of delay, packet loss and throughput, application QoS in term of stalling, and QoE was evaluated for HTTP video streaming in a lab test. The same authors propose in [11] a system to improve the user-perceived quality of video, by integrating available bandwidth measurements into the video data probes, which facilitates the selection of video quality levels. In addition they assess the QoE of the quality transitions by carrying out subjective experiments, showing that users prefer a gradual quality change between the best and worst quality levels, instead of an abrupt switching.

A dedicated QoE model for YouTube* that takes into account the key influence factors (such as stalling) shaping quality perception is presented in [7]. In addition, a generic subjective QoE assessment methodology for multimedia applications (like online video) that is based on crowd sourcing is proposed. Crowd sourcing is a highly cost-efficient, fast and flexible way of conducting user experiments not only for QoE assessment of online video, but also for a wide range of other current and future Internet applications.

The authors of [13] focus on a hybrid QoE assessment model that retains the advantages of both subjective and objective schemes while minimizing their drawbacks. The method is based on statistical learning using a random neural network (RNN). The RNN learns the mapping of configurations and scores as defined in the training database. Once the RNN has been trained, the RNN can be used to map any possible value of parameters into a MOS. This method is applied in [15] for QoE estimation of HAS video streaming.

In [14], two experiments to test the end user subjective response to varying video quality are described. First, three commercially available HAS products were tested in a viewing room. This allowed the authors to control the introduction of (fixed) network impairments and to record the MOS. In a second experiment, clips were generated with impairments typical of HAS. These were downloaded and commented on by a group of young people. This provided insight into the response of users to different types of visual impairments, and hence, what steps can be taken to improve the end user experience.

Experiments

The end user QoE of content delivered over a radio network is influenced by the radio conditions in the radio access network (RAN). The radio resources are shared between all active users within the cell. For any user, his or her share of the bandwidth depends on network factors such as fading, signal-to-interference-plus-noise ratio (SINR), round trip time (RTT) and the amount of other concurrent traffic.

The end user QoE is also influenced by media and device factors. More precisely, the QoE will depend on the device type used (e.g., smartphone or tablet), the HAS technology that was used (i.e., the behavior of the RDA), the complexity of the content (e.g., high motion content requires a higher bit rate for good QoE) and the video (and audio) codec (i.e., type of codec and the bit rates used).

To obtain an assessment of the subjective quality of content, the conventional approach is to recruit a representative group of volunteers. These volunteers are asked to rate a selection of clips during one or more viewing sessions, while ensuring that the viewing conditions are carefully controlled. As will be shown below, such an approach is impractical in our case as we wanted to explore the impact of several network impairments.

However, an important consequence of using HAS is that it is possible to playback a video session. This is because it is possible to identify the video levels of each segment in that session. In this paper, we use the term HAS profile to describe the sequence of chunk identifiers (e.g., by nominal bit rate or by quality level), one for each video segment, that are downloaded and played on the video client. Since this HAS profile uniquely determines what video was played, the video can be recreated (and presented for assessment) at a later time on a device of choice. Hence we can use the “wisdom of the crowd” and recruit many more viewers without the need for them to attend a viewing session in person. As the viewing conditions cannot be as carefully controlled as in a lab experiment, there is likely to be a higher variability in our results than with the traditional method, but our set-up corresponds better to reality.

Next we describe our two-phase approach. First we describe the experiments performed with various radio parameters to identify a suitable set of HAS profiles. Second we describe how crowd sourcing was exploited to obtain a MOS for each of the HAS profiles.

Test Scenarios to Identify Typical HAS Profiles

It is clear from the previous subsection that there are a large number of network and media/device factors that can influence video QoE, and performing tests to cover every combination of these factors would lead to an extremely large number of test scenarios. In order to limit the testing time per tester and to keep the required number of MOS testers to provide statistical significance in the MOS results reasonable, the number of test scenarios should be limited.

The important parameters affecting QoE were identified as device type, content type, congestion, SINR, fading and RTT. Setting these parameters to values typical for wireless environments resulted in 84 scenarios:

  • Content (2). Low and high intensity (see below for details).

  • Competing traffic (3). 5, 10, or 20 competing File Transfer Protocol (FTP) users.

  • Combined fading and SINR profiles (7).

    • – Fading of 0.3 km/h combined with four fixed SINR profiles of 0, 2, 5 and 10 dB, and

    • – Fading of 30 km/h combined with three variable SINR profiles: 30 to 0 dB; 0 to 30 dB and 15 to 0 to 15 dB all changing in steps of 1 dB every two seconds.

  • RTT (2). 20 ms and 100 ms.

As the RDA does not take the device type into account in making its decisions, it does not have an impact on the HAS profile for a specific scenario setup, but it does have an impact on the perceived quality. The experiment used two devices: a smartphone (iPhone*) and tablet (iPad*).

The content is important. We chose two clips of about two minutes long from Sintel [16], a copyright free film. The first is from the opening sequence panning over a mountainscape and the second of high motion including a chase through a market. The content was encoded with advanced video codec (AVC) at six levels—for video, i.e, at 128 kb/s, 210 kb/s, 350 kb/s, 545 kb/s, 876 kb/s and 1.440 Mb/s—(whereby these bit rates were chosen so that there were equal differences in video quality between the versions at successive bit rates) and with AAC (advanced audio codec) at 64 kb/s for stereo audio.

The Next Generation Mobile Networks (NGMN) project P-SERQU (Project Service Quality Definition and Measurement) created a lab with a Long Term Evolution (LTE) network in which we could simulate the 84 scenarios defined above. This is shown in Figure 2.

Figure 2.

Lab environment.

The set-up consists of a HAS server connected to HAS clients via an LTE network. Because of the popularity of professional streamed video over wireless networks, we chose the Apple HTTP live streaming (HLS) [2] variant of HAS. As LTE versions of the iPhone and iPad were not available, we simulated them using an HLS client running on an Apple Mac* mini connected to an LTE dongle. This dongle was in a shielded box along with the aerial. The Propsim* F8 allowed the controlled introduction of various radio related impairments and a NetHawk EAST* T500 was used to introduce background traffic to alter the congestion on the air interface.

Crowd Sourcing to Rate Typical HAS Profiles

We identified the HAS profiles for each of the 84 radio scenarios as follows. The Propsim F8 was set to introduce the relevant network impairments, the NetHawk EAST T500 was configured to inject the background traffic, and then the HLS client in the Mac mini was launched. The corresponding HAS profile was extracted from the “access log” file extracted from the HLS Server. Each of the 84 scenarios was run an average of eight times in order to select a typical HAS profile for that scenario.

After this initial analysis, it was clear that in the set of typical HAS profiles some were similar, so there was no need to test all typical HAS profiles associated with all scenarios. The pruning of the set of typical profiles allowed us to introduce some synthetic profiles, i.e., HAS profiles that did not show up while using the Apple HLS client in our test network, but could occur with other RDAs. In total this resulted in 90 HAS profiles to be tested.

Based on each HAS profile, a video file was created. This was done by concatenating the video chunks as indicated in the HAS profile. For example if the HAS profile was “3 4 1 … ” segment 1 in quality 3, segment 2 in quality 4, segment 3 in quality 1, … were concatenated.

The resulting 90 files were grouped into 20 packages of five clips each (18 packages of unique clips and two packages with clips already in other packages). The NGMN partners recruited some 500 volunteers prepared to take part in a brief survey and rate the clips in one package. The volunteers all had to have access to either an iPhone or iPad. A package was allocated to each volunteer as they signed up. They were asked to download the package as a podcast and view the five clips using the built-in video app. This procedure yields a MOS value for each HAS profile.

The 90 videos created above did not have any stalls. Since HAS was especially designed to avoid such stalls we did not take them into account here. Moreover, from other work [6] it is clear that even one stall degrades the QoE to an unacceptable level. Therefore, in this paper we only consider the problem of predicting how an audience would rate a HAS video that is played out continuously and consists of segments that may have different qualities. However, in real systems it is also important to monitor stalls to detect events which can have a big impact on QoE.

Determining MOS Statistics per Scenario

The method described in the previous section gives a MOS value for each HAS profile. In this section we describe in more detail the procedure to associate a MOS value to each radio scenario (determined by radio parameters) as previously defined.

Figure 3 indicates the steps that were taken to derive the MOS statistics per scenario. In the following subsections, we will

Figure 3.

Determining MOS statistics per scenario.

  • Provide more details on how we determined the set (cloud) of HAS profiles corresponding to each scenario (i.e., each set of radio parameters).

  • Discuss more precisely how a typical HAS profile was determined for each of those scenarios, how the set of typical scenarios was pruned, and how some synthetic profiles were included.

  • Use the QoE model (introduced and trained on subjective data in [4]) that predicts the MOS as a function of the HAS profile.

  • Apply that QoE model to predict the range of MOS values that correspond with a scenario.

HAS Profile and Cloud of HAS Profiles per Scenario

As explained above for each of the runs in the lab tests, we captured the “access log” (which contains all the HTTP GET requests) to determine the HAS profile. In addition, we recorded a packet capture (PCAP) file, which contains the details of all the Internet Protocol (IP) packets that were transported between client and server. By analyzing these files, we observed some strange behavior in the Apple HLS RDA, e.g., when switching video bit rate, it requested chunk(s) that had already been successfully received (which is equivalent to going back in time). Therefore, as we had no full understanding of the HLS RDA, we had to make some assumptions about the HAS profile that was used by the client to play out the video to the user. We made the assumption that when a chunk was received multiple times; the highest quality level was played out. We also investigated some alternative assumptions (e.g., the last received quality level of a segment being played out or the first received quality level of a segment being played out) and concluded that the impact on the HAS profile used for play-out is rather limited, i.e., six percent of the chunks result in a different quality level (more than 50 percent of these, i.e., three percent of the chunks, with a difference of only one quality level and less than 15 percent, i.e., one percent of the chunks, with a difference of more than two levels).

We executed an average of eight runs for each of the 84 scenarios. Each run is an instance of the scenario and can result in a different HAS profile. Therefore each scenario corresponds to a set (i.e., a cloud in the vector space) of possible HAS profiles. Based on the PCAP files we also investigated the goodput (i.e., the rate at which the useful payloads arrive at the client) that the client perceived during the run to assess the correlation of the goodput trace and the HAS profile. The goodput mainly depends on the radio conditions the user is experiencing (e.g., SINR, fading, latency) as well as the competing traffic (e.g., competing TCP sources).

Some of these clouds of HAS profiles are very confined (in the vector space of HAS profiles), even in some cases each run producing exactly the same HAS profile. For other scenarios, although the runs had a very similar TCP goodput trace, the resulting HAS profiles were quite different (Figure 4). In such cases the RDA decisions are very sensitive to very small changes in goodput and an extended cloud of HAS profiles results. In other scenarios, the goodput from the different runs differed markedly, but still resulted in a confined cloud of possible HAS profiles, as shown in Figure 5. We concluded that depending on the scenarios (i.e., set of radio parameters), the cloud of HAS profiles can be either confined or extended; and that changes in goodput from run to run did not provide a clear indication of the potential size of the clouds.

Figure 4.

Scenario 82: clip2, 10 FTP users, fading: 30 km/hr, SINR: 15 dB to 0 dB to 15 dB in steps of 1 dB per 2s, RTT = 100ms.

Figure 5.

Scenario 2: clip1, 10 FTP users, fading: 0.3 km/hr, SINR: 0 dB, RTT = 100ms.

Typical HAS Profile per Scenario

In order to limit the number of HAS profiles to be evaluated during the subjective test phase; a typical profile was derived for each scenario. To do so, we introduced a distance measure between HAS profiles where the distance between profiles is equal to the square-root of the sum of the squared differences of the quality levels of the HAS profiles (i.e., the Euclidean distance in the vector space of HAS profiles), or:

equation image

with QLi,k the quality level of chunk k in profile pi, and K the number of chunks in the video (K = 20 for clip 1 and K = 23 for clip 2, with chunks of five seconds).

The profile with the minimum distance to the other profiles of that scenario, ignoring outlier HAS profiles (i.e., profiles that are quite different from the majority of the profiles) was selected as the typical profile. For most of the scenarios this was quite easy to do. Others, like the ones in Figure 5, sometimes required an arbitrary choice.

To illustrate the variation in HAS profiles (i.e., the size of the cloud) associated with a scenario we calculated the maximum and average of d2 between all pairs of HAS profiles associated with one scenario. The results of this are shown in Figure 6 and Figure 7, where the scenarios are ranked according to these metrics. These results have been obtained based on the 672 runs from the lab test. The average of d2 is lower than eight for 42 percent of the scenarios, and larger than 30 in only nine percent of the scenarios. The maximum of d2 is lower than eight for 18 percent of the scenarios and larger than 50 for 25 percent of the scenarios. The scatter plot of maximum and average of d2 is shown in Figure 8. The correlation between maximum and average of d2 is 90 percent.

Figure 6.

Average d2 of HAS profiles of the same scenario.

Figure 7.

Maximum d2 of HAS profiles of the same scenario.

Figure 8.

Maximum d2 versus average d2.

Selection of HAS Profiles for Subjective Testing

Once a typical HAS profile was selected for each of the scenarios, the set of profiles was reduced even further because some of the profiles were very similar.

First, we split the set of HAS profiles in two groups according to the clip. The HLS RDA reacted in a different way to the same radio parameters for the two clips since the complexity of the clips was different. Furthermore it can be expected that the MOS value for the same HAS profile will differ depending on the clip. We further reduced the number of HAS profiles by clustering them in such a way that the clusters contained only profiles with d2 < 8. By keeping only one representative HAS profile within a sphere of radius √8, we could reduce the number of profiles for the subjective test from 42 to 13 for clip 1 and from 42 to 28 for clip 2.

For full coverage of all the HAS profiles seen in all the runs of the scenarios (672 runs in total), we also clustered all the HAS profiles (again split into two groups according to the clip, and d2 < 8). The cluster representatives that were not within d2 < 8 of one of the earlier selected HAS profiles were added to the profiles for the subjective tests. This way we added an additional three HAS profiles for clip 1 and an additional 12 profiles for clip 2. In total we have 56 HAS profiles (from the 672) that can be linked with at least one of the scenarios.

The profiles derived from the runs were complemented with artificially created profiles selected to investigate the impact of, e.g., different kinds of steps (steep versus gradual), ramps, oscillations, constant quality level profiles, and other parameters. In this way we added 34 synthetic profiles.

Subjective Test and QoE Model

We previously described the subjective tests that were performed. Each of the 90 videos (corresponding to the 90 HAS profiles) were subjectively scored by between 11 and 16 persons using a tablet (iPad) and by between 5 and 10 persons using a smartphone (iPhone). The iPhone coverage was very weak; more volunteer testers were iPad users. Based on these scores, we computed the MOS per profile and device. The results of these subjective tests were used in [4] to derive a QoE model that predicts the MOS on the basis of the HAS profile.

In [4] we compared linear models that predict the MOS for profile P, where a HAS profile P is referenced by a sequence of numbers (l1, …, lk, …, lK) where lk ∈ {1, 2, …, L} indicates that in segment k the video was played-out in quality lk with L the number of quality levels (l = 1 indicates the lowest quality and l = L the highest) and with K the length of the profile (with k = 1 indicating the most recent segment). To each chunk, characterized by a pair (k, l), a value Mk,l is associated that gives an indication of the quality of that chunk. Examples of Mk,l are the nominal bit rate, the average peak signal-to-noise ratio (PSNR) or structural similarity (SSIM) averaged over all frames of that chunk, a quality level or MOS value for that chunk.

In particular, the predicted MOS Mpred is given by:

equation image

where α, β, γ and δ are (tunable) parameters and µ (average of quality info), σ (standard deviation of quality info) and ϕ (frequency of switches) are metrics associated with the profile:

equation image

where 1{A} is the indicator function: 1{A} = 1 if A is true and 0 otherwise.

In order to tune the parameters α, β, γ and δ and possibly Mk,l we minimize the root mean square error (RMSE) between what the model predicts and the measured subjective MOS values. In particular, let Msubj,n be the MOS associated with the n-th in a set of N HAS profiles (for a specific device and clip). Then, we tune the parameters by minimizing the RMSE, or minimize:

equation image

The conclusion in [4] is that the model based on MOS per quality level (or per chunk if available) yields the best results in terms of RMSE, and the least sensitivity to a change in content or device type. The RMSE is about 0.3 for the tablet and 0.4 for the iPhone (the value for the phone is higher because we obtained fewer scores). The model based on quality level is equivalent to the chunk-MOS model, provided that the quality levels are encoded at equidistant MOS (which approximately is the case for the two clips considered) and the MOS for lowest and highest level are known. A full analysis of the different QoE models investigated can be found in [4]. Being able to rely on the quality level makes it easy to monitor the quality of HAS streams in real time. In this paper we will use the best QoE model derived from that analysis in the subsequent sections to estimate the MOS of all the HAS profiles that were obtained during the lab test.

MOS Range Per Scenario

We previously described how we obtained a set of HAS profiles for each scenario, and how, based on a representative selection of all HAS profiles, we derived a QoE model that predicts a MOS for a given HAS profile. Combining these two results allowed us to associate a set of MOS values with each scenario by applying the QoE model to each of the HAS profiles associated with each of the scenarios. Let ps,i be the i-th profile of scenario s (s = 1, …, S), Ns the number of profiles corresponding with scenario s, and S the number of scenarios (S = 84). µs,i denotes the MOS value predicted by the QoE model for profile ps,i, µs the average MOS value corresponding with scenario s and σs the standard deviation of the MOS values (µs,i) for scenario s. DMOSs (delta MOS) denotes the spread of the MOS values, i.e., the difference between the maximum of µs,i and minimum of µs,i associated with one scenario. The values for µs, σs and DMOSs are calculated as:

equation image

The results for the standard deviation of the MOS values for each scenario are shown in Figure 9, where the scenarios have been ranked according to increasing standard deviation: 67 percent of the scenarios have a σs < 0.2, and for 8 percent of the scenarios σs > 0.3, with equation image equal to 0.199 for tablet and 0.212 for smartphone.

Figure 9.

Standard deviation of MOS values (σs) per scenario.

Figure 10 shows the DMOSs values: 62 percent of the scenarios have a DMOSs < 0.5, and for 5 percent of the scenarios DMOSs > 1.0. We can conclude that although a cloud of HAS profiles (which as we have seen, can be rather extended in the vector space of possible HAS profiles) is associated with each scenario, the variation of the corresponding MOS values is relatively low (more than 90 percent of scenarios have σs < 0.3) and lower than the estimated error of the QoE model (about 0.3 for tablet) [4]. As the values µs,i are only estimates with a prediction error σm of about 0.3 for tablet and 0.4 for smartphone, the prediction error of a QoE model operating on radio parameters σs,rm in µs is given by:

Figure 10.

DMOSs (max-min) per scenario.

equation image

Analysis

In the previous section we derived the statistics of the MOS (µs, σs and DMOSs) for each of the scenarios. In this section we will further analyze these results by comparing a QoE model based on the HAS profile and a QoE model based on radio parameters. Furthermore, we analyze the impact of both the individual radio parameters and of the interdependencies of the radio parameters on the quality perception (MOS).

Comparing QoE Models

In [4], we built a QoE model that predicts the MOS based on the HAS profile where some additional information was used that was content- and device-dependent. This model is expected to be independent of the RDA, as the implications of the RDA are included in the HAS profiles. Only when another RDA would provide a quite different type of HAS profile (meaning that the RDA would produce HAS profiles in the region of the vector space of HAS profiles where the HLS RDA did not produce any profiles, and hence, for which the model of [4] was not tuned) changes could be expected, but we think these will be rather limited, because with the introduction of 38 percent of synthetic HAS profiles, we broadened the space (i.e., the set of the HAS profiles) over which the model was trained.

The question can be raised if we can also build a QoE model that predicts the MOS based on the radio scenario parameters.

The model based on radio scenario parameters will depend not only on those radio parameters but also on the specific type of TCP (e.g., Reno, Vegas, CUBIC, Fluid) used and on the implementation details of the RDA algorithm (Figure 3). How TCP reacts to packet loss and RTT for a continuous byte stream is fairly well understood [9]: the simplest model being that the average throughput is inversely proportional to the RTT and the square-root of the packet loss. This however depends on the TCP chosen (Reno, Vegas, CUBIC, Fluid, … ). How the RDA reacts to TCP fluctuations depends on the design choices made in the RDA. (We noted an example in Figure 4 where the RDA reacts chaotically to a goodput trace). Therefore, we can conclude that a metric higher up in the protocol stack provides more information related to the MOS since it already includes the impact of the TCP layer and the RDA. If we collected details on the number of radio resources allocated to each video session by the eNodeB as well as the instantaneous SINR, then we would know the bit rate delivered on that scheduling cycle. But the volume of data is very high, and for the reasons above, it does not provide an accurate description of the TCP goodput or HAS RDA decision. We demonstrated above that, even though the different runs for one radio scenario can have very similar goodput evolutions, a cloud of HAS profiles can be associated with each scenario. However, the variation in MOS (e.g., σs) for those profiles is mostly moderate. Nevertheless, it further impacts such a QoE model. The error for the prediction based on radio parameters σs,rm is larger than the error for the prediction based on the HAS profile.

Moreover, a QoE model based on radio parameters is also much more complex. The model based on the HAS profile can be seen as a one-parameter model (a model with a single level of quality associated with each chunk). The QoE model based on radio parameters is a multi-parameter model (our scenarios have four radio parameters and one content parameter) and require many more data points for training than we have available from the current experiments. Instead of two or three values for each parameter (and seven for the SINR parameter) we would need data for many more parameter values, and furthermore these parameters will also vary in time.

Finally, the HAS profile is also easier to capture than all the radio parameters. Based on the above discussion, we conclude that it is better to predict MOS based on the HAS profile than based on radio parameters.

Frequency of Low Scores as a Function of MOS

So far we have been looking into the average MOS value µs per scenario, and the variation of the MOS (σs and DMOSs) across the cloud of HAS profiles corresponding with a scenario. Another aspect is the probability that users will find the quality “bad” (score = 1) or “poor” (score = 2). Figure 11 shows the relative frequency of the subjective testers giving a score of 1 or 2 for the video/profiles they assessed as a function of the MOS (for both clip 1 and clip 2) for those video/profiles. From the diagram we can see that in order to have less than 30 percent of scores below 3, the MOS must be at least 3.0, while further reducing the threshold to 15 percent requires a MOS of at least 3.4. This figure can be used by the service provider to define the minimum MOS the system needs to be designed for. This threshold of minimum MOS is chosen based on what the service provider accepts as the probability for scores 1 and 2.

Figure 11.

Relative frequency of score 1 or 2 as function of MOS.

Relation MOS and Scenario Parameters

In this section we investigate the relation between the scenario parameters and the corresponding MOS. This has to be done with care because to limit the number of scenarios during the lab tests, the parameters of the scenario were not always chosen independently. For example, the fading conditions have been linked with the SINR profiles: fading of 0.3 km/h for fixed SINR profiles and fading of 30 km/h for varying SINR profiles. On the contrary, the content (clip), number of FTP users, and RTT were chosen independent of the other parameters and the sensitivity of the MOS to these parameters can be studied more easily.

In order to determine the scenario parameters that have the most impact on the MOS, we grouped scenarios based on some filter criterion. Table I shows the results of the average of µs over the scenarios that were grouped according to the filter criterion given in the first column. The first row (labeled “average”) is the average over all scenarios. From the table and the above discussion on (in)dependence of scenario parameters, we can conclude the following expected relations:

Table I. Average of µs values per scenario parameters.
original image
  • Average MOS for clip 1 is higher than for clip 2 (which can be explained by the fact that clip 2 is harder to encode).

  • Average MOS increases when the number of FTP users decreases.

  • Higher MOS is seen with lower RTTs.

  • MOS increases when (average) SINR increases.

No conclusion can be made on the impact of fading. The table shows higher MOS for more fading, but this is due to the fact that the higher fading scenarios have higher average SINR in the radio scenarios we used.

Combination of Radio Parameters That Provide Sufficient Quality

Figure 12 shows the µs values for all the scenarios assessed on a tablet. The results are presented in such a way that the interdependencies of the scenario parameters can be observed. The color coding of the results is done using the thresholds mentioned in the previous section, i.e., a threshold of 3.0 for µss below 3.0 is shown in gray and µs above 3.4 is dark brown). For values of µs between 3 and 3.4, the color is light brown. From the table we can clearly see that good quality will be perceived when the SINR is high enough, competing traffic is not too high, the content is not too complex, and RTT is low. The borders that are splitting the region of good, fair, and poor quality are quite complex and depend on all four parameters (SINR, FTP users, RTT, and content). It is not possible to provide minimum requirements on any of the parameters without specifying the others for obtaining good video quality.

Figure 12.

µs values for all scenarios (tablet).

Conclusions

We have shown that to predict the QoE of a video streaming session, the HAS profile is sufficient and better than the radio scenario parameters. The HAS profile (which is a metric higher up in the protocol stack) provides more information since it already includes the impact of the TCP layer and the RDA.

Furthermore it is much more cumbersome to build a model based on a larger set of network and media parameters, which are in most cases even more difficult to measure. Since this is a multi-parameter problem (four radio parameters versus only the quality level when considering the HAS profile), it would require significantly more data points and radio scenarios to tune the model. Indeed for most parameters, we only have results for two or three values. The alternate approach would require a prohibitively large set of radio scenarios and force the need to redo the lab test for these scenarios.

We demonstrated that a cloud of HAS profiles corresponds with each radio scenario. The size (measured by the maximum and average d2) of the cloud of HAS profiles varies quite a bit. It can be very confined but can also be relatively extended. However, in terms of MOS, the standard deviation of the MOS scores σs for the HAS profiles of a scenario is typically moderate and lower than the error of the QoE model (about 0.3 for tablet). The prediction error for the MOS of a scenario σs,rm is higher than the prediction error for the model based on the HAS profile.

The results also confirmed the expected trend of the impact of a number of network parameters and content parameters. The MOS is higher for content that is easier to encode, and with a lower number of competing traffic sources (FTP users), lower RTT, and higher SINR. To obtain good quality there is a trade-off between SINR and competing traffic: lower SINR is acceptable provided there is not a lot of competition; higher SINR is needed when there is a lot of competition. This tradeoff between SINR and competing traffic is influenced by RTT and content.

Acknowledgement

This work is based on data collected in the context of the NGMN project P-SERQU (Project Service Quality Definition and Measurement).

(Manuscript approved August 2013)

*Trademarks

  1. 1

    EAST is a registered trademark of NetHawk OYJ.

  2. 2

    iPad, iPhone, and Mac are registered trademarks of Apple Inc.

  3. 3

    Propsim is a registered trademark of Elektrobit System Test OY, LLC.

  4. 4

    Wi-Fi is a registered trademark of the Wi-Fi Alliance Corporation.

  5. 5

    YouTube is a registered trademark of Google, Inc.

Biographical Information

original image

JOHAN DE VRIENDT is a product architect in Alcatel-Lucent's Fixed Networks Division and is based in Antwerp, Belgium. He received his M.S. and Ph.D. degrees in electrical engineering from Ghent State University, Belgium. In 1994 he received the BARCO N.V. scientific award for his Ph.D. thesis, “Edge Detection and Motion Estimation in Image Sequences.” Since joining Alcatel-Lucent in 1994, his research has focused, respectively, on mobile communication General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), and Next-Generation Network/IP Multimedia Subsystem (NGN/IMS) solutions and converged networks. He has participated in the European Telecommunications Standards Institute's Special Mobile Group (ETSI SMG), 3rd Generation Partnership for Progress Services & System Aspects 2 (3GPP SA2), and the Mobile Wireless Internet Forum (MWIF). His interests include video quality of experience and video content recommendation, predictive analytics, and application enablement.

original image

DANNY DE VLEESCHAUWER is a senior research engineer with Alcatel-Lucent in Antwerp, Belgium. He received an M.Sc. in electrical engineering and a Ph.D. degree in applied sciences from Ghent University in Belgium. Prior to joining Alcatel-Lucent, Dr. De Vleeschauwer was a researcher at Ghent University. His early work was on image processing, and he worked later on the application of queuing theory in packet-based networks. His current research focus is on ensuring adequate quality for multimedia services offered over packet-based networks. He is a guest professor in the Telecommunications and Information processing department (TELIN) at Ghent University.

original image

DAVE C. ROBINSON is a technology strategist in Alcatel-Lucent's IP Routing and Transport Division and is based in Maidenhead, United Kingdom. He holds a B.Sc.in computing science and a Ph.D., both from Imperial College in London. Dr. Robinson conducted research at GEC Research Laboratories on how high-speed wide area networks (WANs) will change inter-organization collaboration, and worked for several organizations including Digital Equipment Corporation and Oracle before joining Alcatel-Lucent to work on Internet Protocol television (IPTV) and Internet TV solutions. He has a long standing interest in measuring and improving end user QoE of IPTV and over-the-top (OTT) video entertainment systems. This includes studying how caching, recommendations engines, and adaptive streaming enable improved end user experience. Recently he has been investigating the convergence of Web Real-Time Communications (WebRTC) with IP Multimedia Subsystem (IMS) to support real-time communications directly from the browser.

Ancillary