Improved long‐term time‐series predictions of total blood use data from England

Red blood cells are essential for modern medicine but managing their collection and supply to cope with fluctuating demands represents a major challenge. As deterministic models based on predicted population changes have been problematic, there remains a need for more precise and reliable prediction of use. Here, we develop three new time‐series methods to predict red cell use 4 to 52 weeks ahead.


| INTRODUCTION
Red blood cells are necessary for modern medicine in elective and emergency surgery, major trauma, hemorrhage, cancer care, and to support patients with congenital or acquired anemia. 1 The call-up of donors, scheduling of donor sessions, and manufacturing and supply of red blood cells to hospitals must be coordinated to match demand. Managing the collection and supply of red blood cells to cope with the fluctuating demand presents a major challenge for blood services. In spite of this, there are few tools available to accurately predict demand for either short-term or long-term planning. Any improvement of prediction tools would allow greater efficiency in the use of resources as well as a more resilient and secure blood supply chain.
Weekly use of red blood cells can change by 30% from week to week in our dataset and annual use can change by 3%-7% from year to year. 2,3 Predicting use for red blood cells in a simple deterministic model using the age-structure of the population, the age-specific incidence of disease, and the requirement of blood by indication and procedure for each disease has been attempted. [4][5][6][7] However, such models have consistently underestimated the changes in medical and transfusion practice. [8][9][10] Predictions made using projected population growth, number, and type of transfusion episodes overestimated demands. 5 There have been a wide variety of changes in medical and surgical management, such as introducing less invasive surgery and lowering the hemoglobin threshold for transfusion, which have made deterministic modeling highly prone to substantial errors.
An alternative strategy for prediction is to use timeseries methods where any element in the time-series are assumed to be linearly related to previous elements by some mathematical relation with parameters that can be estimated. The estimated parameters can then be applied to extend the series into the future. The use of time-series methods for prediction have a long history. [11][12][13][14][15] There are a wide variety of time-series methods. 16 These approaches have been successfully applied in many fields including statistics, 17 communications, 18 signal processing, 19 adaptive noise cancellation, 20 earthquake prediction, 21 mathematical finance, 22 brain studies, 23,24 speech communication, 25 weather forecasting, 26 and econometrics. 27 A previous study looked at time-series prediction of blood use, 28 and although these methods showed promising results, the deterioration of accuracy of predictions for long-term forecasting would limit long-term planning. In this paper we focus on the seasonality in the data with the aim of improving accuracy in long-term predictions of blood use. Seasonality is likely to be a significant factor due to strong seasonal patterns of activity in hospitals around variation in the number, type of admission, availability of capacity for elective surgery, and of staff.
Daily red blood cell use is readily available. Although red blood cell use varies significantly on a daily and weekly basis, in practice the window for useful predictions of future use are for 1 to 6 months to allow for matching of donor appointments and planning of donor sessions to predicted use. Predictions at longer intervals, such as a year ahead, are also useful to match the overall collection, manufacturing and issue capacity, and blood price to overall use, particularly as use falls.
Here we use three new time-series methods to predict red cell use 4 weeks to 52 weeks (1 year) ahead and demonstrate that the mean red cell use can be predicted with a standard deviation of the percentage error of 2.5% for 4 weeks ahead and 3.4% for 52 weeks ahead. By adjusting for recurring temporal and secular trends through a year including seasonal variation and holidays, significant improvements have been made from previous predictions, giving a standard deviation of the percentage error of 3.0% for 4 weeks ahead and 5.8% for 52 weeks ahead. 28 The proposed paradigm may form the basis for reliable short-term and long-term prediction of not only RBCs but also other components and even therapeutic procedures by blood services.

| MATERIALS AND METHODS
The focus of this paper lies in predicting the RBC use from 4 to 52 weeks ahead using a previously developed prediction paradigm, 28 but now incorporating three novel timeseries methods. The three-stage prediction paradigm consists of: smoothing (eg, going from daily to monthly data) to reduce unhelpful noise; de-trending to extract and remove the long term variations from the data; and time-series modeling to accurately predict the remaining variations.

| Smoothing -data preparation
Daily aggregates of red blood cell units used cover a period of 6.5 years from February 1, 2005 to July 31, 2011 and were obtained from the NHS Department of Blood and Transport. We use aggregated data from seven consecutive days or integer multiples of seven consecutive days. This avoids both effects of daily variability as well as of variability between weekdays and weekends. A new set of non-overlapping weekly data was generated by summing the daily data over 7 days, ie, the first data F I G U R E 1 Average weekly blood use for each non- point corresponds to the sum of Days 1-7, the second corresponds to Days 8-14, and so on; this new dataset of weekly blood use contains 338 data points. Most time-series methods used non-overlapping 4-week data, as shown in Figure 1. This was generated by summing the weekly data over 4 weeks and dividing by four, to give an average blood use per week over that 4-week period. In other words, the first data point is a weekly average blood use over Weeks 1-4, the second is a weekly average over etc.; this non-overlapping 4-week dataset contains 84 data points.
Smoothed-overlapping data over a 52-week period, shown in Figure 2, was also used. This was generated by summing the weekly data over 52 weeks and dividing by 52, giving average blood use per week for that 52-week period, moving forward by 1 week each time. In other words, the first data point is a weekly average of Weeks 1-52, the second is a weekly average of Weeks 2-53, etc.; this overlapping 52-week dataset contains 287 data points. This generates a smoother time-series with less overall variation in average weekly number of blood units.

| Detrending
In the previous study, focusing on standard linear prediction, it was demonstrated that removing the underlying trend in the data, before applying time-series prediction methods, results in a very significant improvement in the accuracy of the prediction. 28 The trend is determined using a polynomial fit to the most recent w data points, where w is referred in this paper to as the time-window size. Figure 3 shows a schematic of the steps taken to predict future blood use. It is interesting to note, as discussed later, for one of the methods that does not use standard linear prediction, it was found that removing the trend was not necessary for improving prediction and as such only the mean was removed.

| Time series methods
In this paper three new methods for predicting RBC use are explored that focus around Minimum Mean Squared Error (MMSE), aiming to improve the long-term prediction accuracy of blood use.
Time-series prediction methods use a set of previous data points in the time-series to predict future values. In general, it is assumed that the predicted value,x, is some function of the past m values, as shown by, where n is the next time step in the series, α is the number of time steps ahead being predicted and x are the data points in the time-series. This defines m as the order of the prediction. In general, the function f is a non-linear function of the variables, but in this paper, we restrict the function f to be a linear function of the variables; this is known as linear prediction, which is illustrated by, where a i are a set of coefficients to be estimated. The error in this linear prediction, e(n + α), is defined to be, The linear time-series prediction problem lies in investigating methods for determining the a i coefficients. There are several algorithms for linear prediction techniques, ie, methods for computing the coefficients a i , that are well developed, eg, Minimum Mean Squared Error (MMSE) and Weighted Least Squares Error (WLSE). 12,16 However, there are circumstances when non-linear data analysis methods are required. Machine learning algorithms can be used to develop non-linear models for forecasting timeseries data. [29][30][31][32] Examples of these algorithms include kernel-based machine learning, genetic programming, and artificial neural networks. Non-linear prediction methods are equally valid for the time-series data; however, they will not be considered in this paper. First, MMSE provides an algorithm for determining the coefficients of the linear prediction based on minimizing the mean squared error, whose mathematical details are in Appendix A. This method is discussed in the previous study as Method 1. 28 Alternative methods based on the observation that the 4-week data contains F I G U R E 3 Schematic diagram of the processing steps involved in predicting future blood use. Rounded boxes represent data, while rectangles represent a processing stage. Variations to the processing for methods 6 (in blue) and 7 (in red) are shown in the diagram [Color figure can be viewed at wileyonlinelibrary.com] some large dips and peaks, aimed to improve the prediction by mitigating the effect of these outliers. This can be achieved by using WLSE, different amounts of weighting account for the differences between Methods 2, 3, and 4 in the previous study. 28 Overall, there was not much variation in the predictions from these four methods.
Here, three new methods (Methods 5, 6, and 7) are developed with the aim of improving long-term prediction accuracy. As discussed, Method 1 uses standard MMSE, which computes the coefficients a i from Equation (2) by minimizing the mean squared error. A new method, Method 5, was then considered, which involves flipping the time series data in the time-window over, so the most recent data point is at the beginning. Then the trend and mean were taken out before calculating the coefficients, which will be called b i . The data being used for prediction is given by, where w is the time-window size. Standard MMSE prediction is applied to the beginning of the window, as shown by, However, this predicts the data point d(w + 1 − m), which is already known. The value to be predicted is d(w + 1), which can be found by rearranging Equation (4) to give, This alternative method of predicting the next data point in the time series is referred here as backward MMSE. In order to control the uncertainty in this prediction, as discussed in Appendix B, we fix b m to some chosen value of order unity, referred to in this article as β, and use MMSE to calculate the remaining (m − 1) coefficients. Now, instead of using Equation (4), we use, The mathematical detail for this method can be found in Appendix C.
Method 6 involves applying MMSE to the overlapping 52-week data described in Smoothing -data preparation. The 52-week dataset is smoother, making predictions easier, however after prediction the result must be transformed into a 4week prediction. Each data point in this dataset contains 1 week of new information, therefore, in order to predict the next 4 weeks it is necessary to predict four data points ahead in the 52-week smoothed data, ie, α = 0 for the 4-week data corresponds to α = 3 for the 52-week smoothed data.
All the methods so far have used the most recent m data points, as shown in Equation (2). However, when predicting long-term blood use, the volume of blood issued for the same week of the year in previous years may contain more useful information. With that in mind, Method 7 uses non-standard linear prediction by applying MMSE on the original 4-week data, instead of using the most recent m data points; it uses the most recent m data points at the same time in previous years, To take advantage of the annual variation of the data, this method uses information of blood use at the same time in previous years as opposed to the most recent information that is available.

| Figure of merit
Implementing each of the three new time-series methods, described in De-trending, gives a set of predictions,x n ð Þ, for each of their corresponding known true data values, x (n). The percentage error for each data point was calculated, 100 x n ð Þ−x n ð Þ ð Þ =x n ð Þ. To assess quantitatively the accuracy of the prediction methods, the mean and the standard deviation of these percentage errors were calculated. Given that the mean percentage error is sufficiently small, it is more important that the standard deviation of the percentage errors is as small as possible, ie, the error in predictions does not vary by a large amount. Additionally, it is important to consider what proportion of the time is the prediction within a reasonable region around the true value. For the final results we also quote the percentage of predictions that lie within the ±5% range of the true value. In this paper we compare our results from Methods 5, 6, and 7, to those using standard MMSE (Method 1).

| Optimizing the parameters
The prediction paradigm, incorporating four time-series methods, contain various parameters that can be altered, which would affect the accuracy of the prediction. These parameters include the time-window size (w), the order  of the prediction (m), and the order of the polynomial fit (d). For Method 5, there is an additional parameter of the fixed coefficient (β). An important advantage of this prediction paradigm is that parameters can be optimized for different situations. In the previous study, the optimal parameters for Methods 1-4 were found to be w = 26, m = 5, and d = 2. 28 As we are now using three new methods focusing on different aspects of the data, we must reconsider the optimal parameters for the new methods. Method 5 uses the same data as Methods 1-4, so the same parameter values are used. However, as discussed in Section 2.3, Method 5 requires fixing the coefficient b m to some chosen value, β, in order to control the prediction error, which we must optimize for the current dataset. Table 1A shows a significant improvement in the prediction performance as β increases to five, but for β > 5 the quality of the prediction starts to plateau, ie, there does not seem to be an upper limit on β. Based on this investigation Method 5 was carried out using a value of β = 9.
Before any of the prediction methods can be applied the trend in the data must be removed, as discussed in De-trending. As the data used in Methods 6 and 7 are different to previous methods, due to the data smoothing applied to leverage different aspects of the data, the order of the polynomial must be investigated for these new methods. Table 1b shows that a polynomial fit of d = 1 provides the best predictions for the 52-week smoothed data, as for large values of α the error in the polynomial fit is exaggerated, resulting in the second order polynomial fit giving much greater errors than a linear polynomial fit. As Method 7 does not use the standard linear prediction method given by Equation (2), it can no longer be assumed that removing the trend before applying the prediction improves the prediction method. Table 1c shows that the predictions made using Method 7 are significantly improved when applying the method without removing the trend first. Therefore, Method 6 was carried out using a polynomial fit with d = 1, and Method 7 was carried out using d = 0, ie, the trend was not removed before applying the prediction method, only the mean was subtracted.
Due to the annual periodicity of the data, in Method 7 it may be beneficial to use a time-window size that is a multiple of 13 (corresponding to a year in 4-week data). According to Equation (7), the minimum window size is 13(m + 1). There are 84 data points in the 4-week data; as this is a small dataset, it would be beneficial to have a small a window size as possible to be able to predict more data points, ie, m < 3. Therefore, for Method 7 the window size could either be w = 26 or w = 39, ie, m = 1 or m = 2, respectively. A value of m = 1 corresponds to a two-parameter prediction and a value of m = 2 corresponds to a three-parameter prediction. Table 1d shows that using Method 7 with two parameters seems to be better for the prediction. Also, using two parameters corresponds to w = 26, which maintains consistency with the other methods and so allows for validity of comparisons. Therefore, Method 7 was carried out using two parameters (m = 1).
Final parameter values used for each of the Methods 1-7 are shown in Table 2.

| Comparison of the time-series methods
Each box in Table 3 shows the mean error, the standard deviation of the errors, as well as the percentage of predictions that lie within ±5% of the true value. These results are given for standard MMSE (Method 1) along with each of the three new prediction methods presented in this paper (Methods 5, 6, and 7). Predictions are made from one to seven 4-week periods ahead, as well as 1 year ahead (thirteen 4-week periods ahead), ie, 4-week, 8week, 12-week, 16-week, 20-week, 24-week, 28-week, and 52-week. Plots of the predictions for 4 weeks ahead and 52 weeks (1 year) ahead are shown in Figures 4 and  5, respectively.
The total blood use data has been predicted for the next 4-week period with a standard deviation in the error of 2.5%, with 95% of the predictions lying within 5%. The predictions for 52 weeks ahead achieve a standard deviation in the error of about 3.4%, with 85% of the predictions lying within 5% of the true value. The methods show similar performance for short-term predictions (1-6 months ahead), but Method 7 shows significantly improved performance when predicting more than 6 months ahead. As there are seven different time-series methods in total, for each data point there exists seven different predictions. These can be combined by calculating the average of different prediction methods, but this was found to show no significant improvement to the results.

| DISCUSSION
Here we have evaluated our proposed prediction paradigm, incorporating three new time-series methods, to T A B L E 3 Results for each of the standard MMSE and three new prediction methods applied to blood use data to predict 4 weeks ahead (α = 0), 8 weeks ahead (α = 1), 12 weeks ahead (α = 2), 16 weeks ahead (α = 3), 20 weeks ahead (α = 4), 24 weeks ahead (α = 5), 28 weeks ahead (α = 6), and 52 weeks ahead (α = 12). In each box, corresponding to each experiment, the first number is the mean percentage error, the second number is the standard deviation of the percentage errors, and the third number is the percentage of predictions that lie within ±5% of the true value  past RBC use data to make predictions 4 weeks, 8 weeks, 12 weeks, 16 weeks, 20 weeks, 24 weeks, 28 weeks, and 52 weeks ahead. These results show significant improvements on previous predictions of blood use using timeseries data, especially for long-term predictions of more than 6 months ahead. As such the application of these methods may improve the effective planning of collection to the benefit of donors and blood services. The standard MMSE prediction (Method 1) performs almost as well as other methods for short-term predictions, but the performance degrades significantly when predicting long-term use beyond 6 months ahead. However, the performance of Method 7 (using data points from the same time in previous years) remains impressive for up to a year ahead, with the potential to extend predicting further ahead. The ability to accurately predict long-term blood use is important for planning changes to the future blood collection strategy. Being prepared for changes to blood demand a year ahead presents the opportunity to effectively predict income and plan a more efficient use of resources throughout the blood supply chain.
The Method 7 provided predictions of aggregate use for 4 weeks ahead with a standard deviation of 2.5%, with 95% of the predictions lying within 5% of the true value, and for 52 weeks ahead with a standard deviation in of 3.4% with 85% of the predictions lying within 5% of the true value. For predicting 4 weeks ahead, of the 5% of predictions that lie outside 5% of the true value, a third overestimate use. The maximum surplus for any individual prediction was 2047 blood units, while the maximum deficit was 2048 blood units. For predicting 52 weeks ahead, of the 15% of predictions that lie outside 5% of the true value, 29% overestimate use (maximum surplus of 2260 units) and 71% underestimate use (maximum deficit of 2602 units).
These margins of error would be operationally acceptable as the current average weekly use of RBC units in England are approximately 27000 units or 3800 units per day averaged over 9 months. The current stock levels of red blood cells in the blood services and in hospitals are currently maintained at between 8-and 10-daysʼ supply. Therefore, the blood supply chain could tolerate fluctuation in stock of 4000 units in any 1 week. In practice, adjustments to the supply could be made to cover such variation by minor changes to the collection schedule to maintain stable stock levels.
Previous attempts at predicting medium-term use for a group of patients or within a region or country have relied on simple linear extrapolation of year-on-year trend. 33,34 Generally, these methods have predicted a rising demand for blood based on demographics where the proportion of people older than 75 years is rising, eg, in North America and Europe. In turn, these models generated concern about potential shortfall in the supply of blood from younger donors. 35,36 However, these attempts for medium-term forecasting have been inaccurate and were unable to predict the trends in reduced blood use due to changes in medical and surgical practice as well as patient blood management. 37,9 These methods have been unsuccessful in accurately predicting medium or longterm trends, and short-term planning has relied on timeseries methods from proprietary packages. This paper has developed time-series methods that produce accurate short-, medium-, and long-term predictions of blood use.
It is clear from this data set that seasonality was a significant factor in modulating use. This is likely to be due to strong seasonal patterns of activity in hospitals around variation in the number, type of admission, availability of capacity for elective surgery, and of staff. With this dataset it is not possible to establish the exact reasons for such seasonality but further exploration using hospitallevel data may help analyze these trends and delineate causal factors in the seasonality of blood use.
Further improvements could be made if information were available on changes in surgical procedure or practices in transfusion medicine and how they are being implemented in the different regions. The data could also be examined by location or blood group to provide a more targeted call-up of donors, however the benefits of this may be offset by the increased random error in dealing with fewer blood units.
Also, there is certainly low but significant waste of whole blood at blood centers and in hospital transfusion services due to expiry of units beyond their mandated shelflife. This study uses national data to look at overall blood use where the primary purpose is to allow better short-and medium-term matching of collection and demand. Waste of blood in hospitals would be addressed by different models using information based on regional or hospital level data, which is beyond the scope of this study. 38 These findings of improved predictions, especially long-term predictions, using several time-series methods that are tailored to the specific data sets, potentially represent a significant advance in the techniques available to predict use. The improved predictions with reduced errors could allow greater efficiency in the call-up of donors, scheduling of donor sessions, and manufacturing and supply of RBCs to match demand.
In conclusion, it is important to appreciate that a straightforward use of time-series methods would not have produced as good results as presented in this paper. By exploiting the annual periodicity of the time-series, we were able to improve significantly long-term predictions of blood use, with anticipated commensurate improvement in the effectiveness and efficiency of collection. These methods are in principle capable of further improvements using more granular local data and by more precise alignment of the methods with the data.
How to cite this article: Nandi AK, Roberts DJ, Nandi AK. Improved long-term time-series predictions of total blood use data from England. The error in the prediction is given by, e n + α ð Þ= x n + α ð Þ−x n + α ð Þ To apply MMSE we want to minimize the mean squared error, MSE, Therefore, we need to solve the equations, ∂f ∂a j = 0 for j = 1,2,…, m The left hand side is given by, Using the fact that, ð Þ, We will assume, δa 1 = δa 2 = δa 3 = δa 4 = δa 5 δa, and equivalently, δb 1 = δb 2 = δb 3 = δb 4 = δb 5 δb. Consider the errors in the predictions for standard MMSE and backward MMSE, This shows that the error in the backward MMSE prediction is roughly scaled by a factor of 1/b 5 . Therefore, for b 5 less than unity, this will negatively affect the errors on the backward MMSE predictions. We address this problem by fixing the value of b 5 , referred to as β, to be larger than unity. This parameter can be optimized for different datasets, but in this paper we use β = 9.