Are candle stick bars a good tool for data compression in natural science?

Authors

  • Alfred Hübler

    Corresponding author
    1. director of the Center for Complex Systems Research at the University of Illinois at Urbana- Champaign, Urbana, IL 61801
    • director of the Center for Complex Systems Research at the University of Illinois at Urbana- Champaign, Urbana, IL 61801
    Search for more papers by this author

Candle stick bars contain only the first, highest, lowest, and last value of a time series of dozens or thousands of data (see Figure 1). Therefore, candle sticks bars are a data compression tool. Mr. Homma in Sakata, Japan, introduced candle stick bars for trading rice around 1850 [1]. The Western version of the candle stick bar is the open–high–low–close bar (OHLC bar). It is commonly used to illustrate changes in price of currencies, securities, and other financial instruments. The vertical line indicated the range of prices over a certain time period, which can by a day or an hour, the left tick is the prices at the beginning of the time period (open), and the right tick is the price at the end of time period (close).

Figure 1.

Comparison between candle stick bars and open–high–low–close bars. The body of the candle stick bar is filled in black or red if the close is less than the open.

In natural science, different tools for data compression are more common. The time series is described by mean value and its deviation from the mean. More sophisticated data compression techniques include a fit with a linear or nonlinear function, a Fourier transformation, or a fit of the flow vector field [2, 3]. Most data compression tools in natural science are integral methods that describe averaged properties of the data set and ignore extreme values and outliers. Data compression with candle stick bars does the opposite, it focuses on the extremes: the high and the low are the extremes in amplitude and the open and close are the extremes in time. Only the extremes are kept, whereas the rest of the time series is ignored. Why is that? Is there a distinct difference between financial time series and time series from natural science that justifies using different approaches? Is it just the simplicity of the method that makes it so appealing, or are there deeper reasons that make the description of a time series with candle stick bars superior to data fitting?

Candle stick bars represent the extreme values

Data compression tools in natural science ignore extreme values

Candle stick bars have an interesting property: they are exact representations of a time series, because they contain four values of the time series without approximation: the open, the high, the low, and the close. In contrast, fitting parameters such as the average and the variance approximate the overall behavior but do not provide the precise value to the original time series at a certain instant in time. But, why would it be beneficial in any way to know the precise value of the time series?

Candle stick bars have another interesting property: they are additive. This means that two consecutive candle stick bars can be merged without using or even knowing the original time series. The open of the merged bar is the open of the first bar. The close of the merged bar is the close of the second bar. The high of the merged bar is the maximum of the highs of the two bars, and the low of the merged bar is the minimum of the two lows. But, how beneficial is the fact that candle stick bars are additive for modeling time series?

The chaotic logistic map is a simple toy model for financial time series, because it describes growth limited by finite resources. Chaotic logistic maps are important models in natural science too, because there are many situations where growth is limited by finite resources of energy, etc. For that reason, we will use the logistic map to discuss the benefits of data compression with candle stick bars and data fitting. Figure 2(a) shows a typical time series of the logistic map, and Figure 2(b) is a histogram of the values of a chaotic time series. It is remarkable that the highest and the lowest values occur more often than any other values. What is even more remarkable: for each value of the model parameter a, there is a unique high and vice versa (see Figure 3). Similarly, for each value of the model parameter a, there is a unique low and vice versa. This means that there are unique simple relations between the model parameter and the low and the high. With the relation, the model parameter can be computed from the high or the low. In contrast, the relation between the model parameter and the average value is complicated and not unique (see Figure 3). Imagine a trader has the task to summarize a long time series by one value. Which value would the person use? If the data originate from a chaotic logistic map, a good trader would probably keep the high or the low because they provide an accurate estimate of the model parameter. Even after the rest of the time series is discarded, the person could use the model parameter and try to reconstruct the rest of the data by iterating the map. However, there is one problem: the candle stick bar contains no information when the high occurred. For that reason, it is important to know the open. The open and the value of the model parameter (estimated from the high) provide sufficient information to reconstruct the entire time series by iterating the map. So, why is it necessary to store the low and the close in addition? The dynamics of the chaotic logistic map is sensitive to uncertainties in the initial conditions and in the parameter value. The low helps to reduce uncertainties in the parameter value, and the close helps to reduce the impact of the chaotic dynamics on the accuracy of the reconstructed data.

Figure 2.

A typical time series xn of the chaotic logistic map xn+1 = 3.8 xn (1 − xn) with the corresponding candle stick bar (a) and a histogram of the xn values (b) for a time series with 20,000 data points. The highest and the lowest values occur most frequently.

Figure 3.

The blue line indicates the relation between the model parameter a of the logistic map xn+1 = axn (1 − xn) and the high (a), the low (b), and the average value (c) of a time series xn. The red area indicates the range of high values, low values, and average a values for a time series with 10 data points for a = 3.8. The green area indicates the corresponding set of model parameters. It contains the correct value of the model. If the high or low is measured, the green area is small. This means that the range of model parameters is small, that is, the model can be determined accurately. In contrast, if only the average value is determined, the green area is large, because the dynamics for many parameter values has similar average values. This means that the uncertainty about the model is large.

In summary, candle stick bars are a superior data compression tool for chaotic time series, no matter if these are financial data or data from natural science. The high and the low in chaotic time series tend to have a simple unique relation with the system parameters, in contrast to fitting parameters, such as the average. The high and the low occur more frequently than other values. Therefore, they can be determined more accurately. The open and the close are important in reconstructing the original time series with iterative methods.

Last, but not least, knowing the model parameter is not just useful in reconstructing the original time series, but it also provides a quantitative model of the dynamics. Such models can be compared with models of other time series on an abstract level. Therefore, candle stick bars help to understand chaotic time series on an abstract level.

Ancillary