Reduction of financial tick big data for intraday trading

Various neural network architectures are often used to forecast movements in financial markets. Most research in quantitative analytics in finance uses interval financial data as this reduces the raw tick big data, but the averaging can lose key behaviour patterns. This work presents an alternative novel method to reduce raw tick data whilst retaining important information for training, as demonstrated with intraday trading using the EURO/USD currency pair. This time series reduction method focuses on short periods preceding significant movements in financial features and allows the most popular neural network architectures to be applied using less powerful but more readily available computer resources. It is shown that the proposed data preprocessing method for machine learning and other AI‐techniques successfully reduced the size of the selected dataset covering a three‐year period (2018–2021) by 275 times.


| INTRODUCTION
In recent years various architectures of artificial neural networks have been used to analyse and predict financial markets (Jiang, 2021).The actual raw financial market data is big data in the form of tick information (prices and volumes of each trade or settlement).Such tick data occurs where there are minimal changes in the buy (bid) and sell (ask) prices.
The training of a neural network using raw tick data typically requires a 3-D tensor of input which can be as large as 90 billion float-type numbers (Milke et al., 2020).This requires significant and prohibitive computing resources (Nison, 1994;Riyazahmed, 2021;Sandubete & Escot, 2020).The raw tick data is mainly owned by large quantitative investment funds and asset management companies that often do not publish their results (Pricope, 2021).This means that few brokerage and Forex companies provide free raw tick data to their clients.Instead, they provide free access to summarized versions of the raw tick data.This free data is sometimes referred to as interval or integrated data as the raw tick data is summed over a specified time interval (1 min, 1 h, 1 day etc.).A sequence of such time-interval integrated data for a period of interest summarizes the trading activity in the form of so-called 'Japanese candlesticks' (or bars) (Houssein et al., 2021).The accessibility and availability of the data in a form suitable for training a neural network without the need for significant computing resources explains the widespread use of interval financial tick data by researchers.
A consequence of using integrated data with neural networks is that the summarization loses information about the behaviour of market participants that was present in the raw tick data.This has a detrimental effect on the accuracy of the training of a neural net.A method to reduce the financial raw tick big data, so as to create manageable sample sizes for training without losing market-significant information, is a non-trivial problem.Using data from the EUR/ESD Forex market as an example, Table 1 compares the number of financial raw tick data rows and the number and percentage of rows of integral data summarized at different time frames.Even for the most detailed one-minute time frame integral data typically reflects only 2.86% of the raw tick big data.
The current work introduces a new alternative method to reduce financial raw tick big data that retains more significant data compared to an aggregate approach.This accelerates the training of neural networks without compromising accuracy and without the need for substantial computing resources.The approach is novel in deriving and utilizing a vector of probabilities of expected significant market movements that discriminates essential from non-essential data.
The remainder of the paper is organized in five sections.Section 2 describes the relevant background of existing research including approaches to analysis and reduction of different types of market data for intraday trading, and identifies the gap in research knowledge.Section 3 outlines the data used.Section 4 describes the novel method to reduce financial market raw tick big data using a vector of probabilities.Section 5 presents and discusses some experimental results whilst Section 6 offers some perspectives for future research.

| BACKGROUND
In recent years the amount of data generated by financial markets has increased exponentially.Global financial and asset management companies are interested in finding new ways to compress and condense this enormous amount of information into actionable insights (Fang & Zhang, 2016).
The methods used to analyse and process financial data either take a statistical (model driven) approach or an artificial intelligence (data driven) approach.Researchers tend to work with interval financial data rather than with the raw data firstly because it is more accessible and secondly because less computational processing power is required.As shown in Table 1, researchers use daily (Araújo et al., 2019;Bukhari et al., 2020;Fischer & Krauss, 2018;Garza Sepúlveda et al., 2022;Heryadi & Wibowo, 2021;Pande et al., 2021;Sornmayura, 2019) or hourly (Ali, 2018;Ersan et al., 2020;Jirapongpan & Phumchusri, 2020;Moews et al., 2019;Naeem et al., 2021) integral data over the historical period of interest (usually varying from decades to months) in the form of Japanese candlesticks to train neural networks.For the analysis of intraday trading, five-minute integral data is usually used (Barra et al., 2020;Chen et al., 2020;Markova, 2022;Mehtab & Sen, 2020;Zhao & Khushi, 2020).
Table 2 presents some specific examples and their advantages and disadvantages.
Prior to the exponential increase in the size of financial big data, classical statistical methods of time series analysis have been popular such as ARIMA and ARIMA-GARCH (Mohan et al., 2019;Murialdo et al., 2020;Siami-Namini et al., 2018).However, the majority of statistical methods of time-series analysis based on frequency or the time-domain (such as spectral and wavelet analysis) do not have good predictive power when applied to financial markets due to their irregular fractal structure (Karp & Van Vuuren, 2019;Ku et al., 2020;Mandelbrot, 1982;Mandelbrot & Hudson, 2010) and may lose essential information if they utilize integral (interval) time series intervals.
On the other hand data driven AI approaches must evaluate the training needs and data used for training.For example the high-frequency trading (HFT) approach uses financial raw tick big data (including information from the order book) but the predictive interval of such methods is very short and neural networks are rarely used because of the high computational power needed (often supercomputer access is required) (Aloud, 2020;Carapuço et al., 2018;Degiannakis & Filis, 2018;Shakeel & Srivastava, 2021;Shintate & Pichl, 2019;Zhang, Chan, et al., 2019).The problem of the lack of processing power to train raw tick big data extends even to publicly available GPU or TPU processors.For example, at the time of writing, the training of a neural network on raw financial data of 500 U.S. stocks containing information from the limit order book required the use of a cluster of 50 GPUs, which is not commonly available hardware (Sirignano, 2019).
T A B L E 1 Comparing big tick financial data and interval data with different timeframes, according to Dukascopy Swiss Banking Group (2022).For the majority of datasets that have many features (columns), the classical methods of reducing dimensionality and, therefore, the total numbers of data include Principal Component Analysis (PCA), Independent Component Analysis (ICA), LASSO regression, Low dispersion filter, High correlation filter, Random Forest.They can be used to reduce the dimensionality by eliminating those features that might not influence the output.However, for other datasets, such as the financial tick big data, the number of features is much lower, but the number of rows (settlements) is enormous.On average, the size of such data for various financial instruments per year may contain more than 130 million samples (rows) (Zhang, Zohren, et al., 2019).In this case, these classical methods are not efficient, and other approaches are required.
A variation on the 5-min interval approach is to first filter out 'bad' trades and aggregate the remainder into the intervals (Borovkova & Tsiamas, 2019).'Bad' trades are either out of sequence, outside regular trade hours, corrected or completed under special conditions.Applied to some of our data this approach reduced the raw data by 5%-10% but neural network training is still dependent on the derived integral 5-min Japanese candlesticks.
One means of reducing the size of the financial data is to manipulate (increase) the tick size.The tick size for some stocks and other financial instruments such as futures is sometimes changed directly by stock exchanges (Nurhayati et al., 2021).Whilst this reduces the amount of analyzed financial big data it negatively affects the liquidity of these financial instruments and increases the margin between the buy and sell stock prices (Bourghelle & Declerck, 2004).This in turn reduces the potential profit of speculative transactions and affects the outflow of market participants to other financial instruments, so the usefulness of the approach is limited.
Reducing the size of a big data set time series in other domains such as energy systems has included clustering methods (Kotzur et al., 2018), taking a relative entropy minimization approach (Ghosh et al., 2017;Zhao et al., 2018).However partial data loss is observed which is neither desirable for mission-critical engineering systems nor for intraday trading, again because of loss of crucial information about the market participants' behaviour.
The method we propose in this paper differs from the described works as it reduces financial raw tick big data by selecting the most important ones and discarding those that are not essential for a qualitative prediction of future short-term movements of financial instruments in intraday trading.This work fills a research gap between an approach that uses a full set of raw financial big data retaining all information and an approach that reduces the data size but with too much critical data loss during pre-processing.
One observation that is key to the approach taken in the current work is that financial markets are flat most of the time, which is not attractive for practical trading.Although financial markets have a chaotic structure (Dwyer & Hafer, 2013) this chaos is quasi-stationary.This means that it is possible to find patterns over a short-term period (Moghaddam & Momtazi, 2021).In real financial markets, settlements are at irregular T A B L E 2 A comparison of some example approaches to the reduction of raw and interval financial data.intervals reflected in a varying intensity of transactions, the volume of which, together with the time intervals between neighbouring transactions, form recognizable patterns.In particular the pattern of financial markets discernible only at a detailed scale is that it does not behave like a sine wave.Figure 1 shows a typical 4-hour Japanese candlestick middle-term chart of a financial instrument during flat movements and at this scale it is similar to a sine wave.However, if the chart scale increases then, in contrast to middle-term and before significant movements, the intraday ticks form a stepped pattern rather than a sine wave (Figure 2).
An appreciation of this financial market behaviour provides a means of reducing the significant amount of financial raw tick big data that needs to be presented for training a deep neural network.In principle the raw tick data can be limited to the parameters of transactions located immediately preceding a significant movement (indicated by the red ellipse in Figure 2).Other flat tick data can be ignored, thereby significantly reducing the total size of the training data and consequently the time, processor power requirements and cost of training deep neural networks with any architecture.

| Data
The most common method for training a neural network is to take a supervised learning approach, which requires pre-classified labelled data.For financial tick big data this means finding and marking up the set of ticks and patterns of market behaviour that are desirable points for entering the market (for long or short positions).The Dukascopy Bank (Switzerland) is unusual in that some raw tick data is available from their archives for free, including EUR/USD Forex raw tick data (Dukascopy Swiss Banking Group, 2022).Historical data from this resource for the period from 2018 to 2021 was used for development and application of our raw tick data reduction approach.Data was obtained through the JForex terminal of Dukascopy Bank.The data consists of five features: the date and time of transactions with a resolution of milliseconds, the prices of supply and demand (ask and bid), and the volumes of supply and demand (ask volume and bid volume).The size of the volume is specified in millions of lots.Table 3 shows all features of the first six rows from the raw data.Ask and bid prices represent, respectively, for the EURO/USD currency pair the prices of the best demand and best supply in the FOREX market at each moment in time when at least one transaction is made either at the price of the best demand (AskVolume), or at the price of the best supply (BidVolume), or both.A more detailed scale (Figure 4) reveals that each tick can consist of many deals with different market participants which take place at the same price at the same time.If any of the prices (bid or ask) or the time in milliseconds changes then the respective deals belong to the next tick.
The unequal intervals between transactions and major differences in the volumes of each transaction (tick) are significant.Whilst this feature is indicative of financial market behaviour it is not directly observable without automation and some kind of AI-analysis due to the high level of noise in the data.
The popular rolling window approach is often used to transform the original time series tick data (raw or integral) into input data for training a neural network.At each step forward, the window evaluates a certain number of the last ticks to form a 3D tensor of input data.We have formulated Equation (1) to calculate the number of float-type elements of a 3D tensor to feed a neural network when using a rolling window.
where, N rowsi is the number of rows (ticks) of the data set in i-year, k is the number of years in the big financial data set, W is the number of ticks in the rolling window, and C is the number of columns (features) of the big financial data set.
For the case of the EURO/USD -Forex market data the data size will be 87,993,711 ≈ 88 billion float-type numbers.The next section describes our approach for significantly reducing the size of this 3D tensor of inputs.
T A B L E 3 Raw data.
Stepped patterns in the tick-term GBP/USD Forex chart in February 2020.Screenshot from Dukascopy Swiss Banking Group (2022).the Batch gradient descent method was used with all code based on the Python-3 programming language, utilizing the TensorFlow-2 framework with Keras library and Google Colab.Relu was used as an activation function for CNN, LSTM and the dense layers.

| Neural network architectures and prediction algorithm
F I G U R E 3 Price and volume graphs of big financial EUR/USD Forex tick data (The first 3 months and 500 ticks).Each tick contains information about the time of the transaction (in milliseconds), the bid and ask prices, and the volumes of the transactions at the bid and ask prices at that moment in time.
A binary classification task is used to determine the current nature of the financial currency market either as a 'trend' or 'flat'.Input time series data of normalized tick prices, volumes, and inter-tick intervals is transformed in arrays using a sliding window method and then analyzed using a one-dimensional CNN and single-layer LSTM.When a new tick appears, the neural networks will try to predict the following behaviour of the financial market several steps ahead, looking 200-ticks back from the last tick, which is an average from 2 to 10 min prior, depending on the intensity of the market at that moment.Thus, the neural networks see only prices, volumes and inter-tick spacing for the last 200-ticks at each step.The intention is to identify patterns that have already appeared before and have changed the market's behaviour.In other words, they are trying to predict the beginning of a trend, that is the beginning of a significant price movement.

| A NOVEL METHOD TO REDUCE BIG FINANCIAL DATA
For practical intraday trading with any financial instrument, the essential objective is to predict the point of entry into the market with the highest probability of movement in the predicted direction.There is no need to accurately predict the future price in the predicted direction and how long it will take to reach it for practical trading since all open trades must be closed before the end of the trading day.For intraday trading it is only necessary to predict the direction of a significant movement and the entry point into the market located as close as possible to the beginning of this movement.Thus, for intraday trading, only two points are essential: • The target price should preferably be reached before the end of the current trading day (when all positions should be closed).
• The probability of movement in the opposite direction from the predicted one should be minimal (so that stop-loss orders are not activated).Adopting these principles the mathematical model of intraday trading differs significantly from the medium-term investment solution derived using a machine learning regression task.That approach utilizes a 'return on equity' (ROE) time-interval dependent parameter (Ahsan, 2012) or other similar functions not applicable for intraday trading, where time is necessary only as a boundary condition, that is, the end of the day trading session.

| Vector of probabilities
To create a mathematical model for practical intraday trading and reduce the size of big financial tick data as inputs to a neural network, this paper proposes the use of a vector of probabilities of future short-term movements of the financial instrument.
For calculating the vector of probabilities, the interval (depth) of a possible short-term price movement in both directions was divided into 16 irregular intervals (Figure 5).The boundaries of each interval, explained in this section and the following one 4.2, are based either on the statistical distributions shown in Figures 6-9 or on the integer ratios of the take profit and stop loss that is often used in practical trading.The divisions between P 2À and P 2þ , at the center of the probability vector (Figure 5) are based on the statistical distribution of the intraday margin (the difference between ask and bid prices), as well as on the stochastic distribution of flat fluctuations, which are the most likely causes of stop-losses.Each interval has the probability of price movement from the current tick, which has ask i and bid i prices.The price movements may go up or down, which correspond to P þ and P À probabilities.For the convenience of Python programming and practical volatility of EURO/USD Forex, the whole probability interval is divided into a vector represented by an indexed array with a range 0,16 ½ , which includes an additional zero level for initial labelling.Each of the above intervals of the probability vector P 8þ , P 7þ , …, P 8À ð Þ has its own rationale and purpose, as explained below.Due to the difference between the bid and ask prices when buying a financial instrument at the ask price, the ask level must be increased to be able to sell the position without a loss.Thus, to avoid losses at the time of entering the market, it is necessary to exclude from the targets all flat movements that are less than 0:00006 of the delta of EURO/USD tick prices.This approximately corresponds to two standard deviations of margin calculated during the active trading session excluding the two-hour break between the closure of the U.S. stock and foreign exchange markets and opening of Vector of probabilities for intraday trading analysis.
the Japanese stock exchanges.This is precisely the same mirror-image situation with respect to a short-selling scenario.Therefore, splitting the probability vector to P 1þ and P 1À sectors also corresponds with standard deviations of these margin distributions.The additional default field, P 0 (zero level), is introduced in the middle of the probabilities vector and contains all ticks less than the minimum stop-loss within an arbitrary ten subsequent ticks of that threshold.This means that the financial market is not yet ready to move and that these ticks do not need to be considered.
For intraday trading, flat fluctuations are the most likely cause of stop-losses and so it is vital to reduce their number (Alves et al., 2018) as a pre-processing priority before training.To indicate the minimum stop-loss, P 2þ and P 2À boundaries are created which are approximately equal to two standard deviations of the flat movement (fluctuation) distributions.A significant reduction in data is achieved because the activated stoploss orders often feature as a continuous series.This approach significantly reduces the size of the dataset necessary for training and validation.
Due to the margin between the bid and ask prices, as well as the brokerage fees for each trade, movement goals less than 0.001 delta of the EURO/USD exchange rate are virtually unprofitable.It is one of the reasons why deals in the Forex market are often unprofitable for clients, as it is a 'negative-sum game' due to various commissions.This is the opposite concept of a 'zero-sum game' (or antagonistic game) from game theory where the total gain of all players is zero (Von Neumann & Morgenstern, 2020).This simple reasoning and calculation define the minimum price goal that needs to be tagged as supervised learning output.Since entering the market there are only two possible movement directions: up or down (buying or short selling) and therefore the theoretical probability of randomly predicting the market movement is 50%.In this case, the minimum target for taking profit should be at least twice as high as the stop loss (which defines the P 3þ and P 3À boundaries) to cover the possible loss of mispredictions.Thus, the minimum price goal that should be labelled in the training dataset as supervised learning output is 0:002 of increasing (buying) or decreasing (selling short) the EURO/USD exchange rate.
For the explanation of other boundaries of the probability vector in the range between the P 4þ and P 8þ and between P 4À and P 8À , it is necessary to evaluate the volatilities on different time frames within a daily trading session and evaluate daily volatilities during a year.

| Volatility evaluation
Two types of volatilities have been used-daily and hourly.The daily volatility is used to recognize the borders of the vector of probabilities that is considered in the analysis.The hourly volatility is used to identify the range in the middle of the vector of probabilities and that data should be deleted to reduce the analyzed data size.For the determination of extreme boundaries of the probability vector, it is necessary to estimate the maximum daily market volatility of the analyzed financial instrument.Absolute daily volatility can be calculated as the sum of the volatilities of short periods (e.g., one minute) during a trading day as shown in Equation ( 2): where: N k is the number of one-minute intervals during k-day, Δt is the one minute, Ask Δti max ð Þ is the maximum value of the Ask price on i À minute, Bid Δt i min ð Þis the minimum value of the Bid price on i À minute, max ð Þ k is the maximum value of the sum among all days of the analyzed period, for example, a year.k ¼ 1, 2,ÁÁÁ, is the number of trading (working) days during a year.
To speed up the calculations it is possible to determine the boundaries of P 7þ and P 7À of the vector of probabilities using an approximate estimation of the daily volatility (Equation ( 3)): This level of volatility defines the upper target limit for intraday trading with this financial instrument.The second to last borders P 7þ and P 7À ð Þon both sides of the probability vector were defined as 0:02 delta of the EURO/USD exchange rate.The maximum values of the probability vector denoted by P 8þ and P 8À are not limited and correspond to the values greater than 0.02 delta of the EURO/USD exchange rate.
In order to estimate the probability of guessing the minimum movement target it is necessary to analyse the absolute hourly volatility within each trading day either accurately using Equation ( 4) or approximately using Equation ( 5).
where: N k is the number of one-minute intervals during k-hour, Δt is the one minute, Ask Δti max ð Þ is the maximum value of the Ask price on i À minute, Bid Δti min ð Þ is the minimum value of the Bid price on i À minute, max ð Þ k is the maximum value of the sum among the k-hour during the analyzed period, for example, a year.k ¼ 1, 2,ÁÁÁ, number of trading (working) hours during a year.
Figures 8 and 9 show the distribution of the hourly volatilities in the boxplot and histogram.It can be seen that volatility above 0.002 (which is the minimum target of the interesting market movement) is above the third quartile.For 2019 during the period of relative economic stability, and 2021 during the period of COVID-19 quarantines, the major currencies market were flat and the boundaries of the third quartile of hourly volatility for these years were slightly below 0.002.Since intraday trading in a flat period is likely to result in losses, it is necessary to more accurately recognize the flat periods so as to avoid trading during this time.Thus, on average, for the period from 2018 to 2021, intraday markets are in motion for no more than 25% of the time with the prices flat for the remaining duration.
In practical intraday trading, integer multipliers of the size of stop-loss orders are often used as take-profit orders and coefficients of 2 and 3 are most commonly used (Cervell o-Royo et al., 2015;Di Lorenzo, 2012;Tureac et al., 2011).However, for market entry points that have been more accurately determined using advanced machine learning methods a value of 10 or 100 have been used (Li & Zhou, 2021;Maratkhan et al., 2021).These integer multipliers vary for different analyzed financial instruments, so for our research for practical trading 3 and 10 along with a conditional midpoint integer of 6 are taken as multipliers of the stop-loss level that is used to split the remaining parts (from P 4þ to P 6þ and from P 4À to P 6À ) of the probability vector.The sixth stop loss multiplier also corresponds to the beginning of the long thin tail in the hourly volatility of the Forex EURO/USD distribution Figures 5 and 9.
Figure 10 shows a comparison of total (sum) volatilities calculated by Equations ( 3) or ( 6) at different time frames in absolute value and percentages for one-half of 2020.It can be seen that the total volatility for 1 min data is much bigger (34.3) than the 1 day data (1.17).Thus, potential targets in trading on a daily time frame are only 3.4% of one-minute time frame trading.This demonstrates a thirty-fold increase in potential profitability of short-term intraday trading.
It is impossible to infinitely decrease the time frame since the margin between bid and ask, as well as brokerage fees for a huge number of transactions will nullify all potential profit.Therefore, a reasonable frequency of trades and target volatility must be present.

| RESULTS AND DISCUSSION
Figure 11 shows the big financial EURO/USD tick data for the first half of 2020.The period saw an increase in the medium-term volatility of financial markets associated with the beginning of the COVID-19 pandemic and is reflected by a large daily number of ticks (settlements) in February and March, revealing the most profitable entry points into the market and is a good visual demonstration of the potential and need for data reduction.Figure 12 shows that despite the increased medium-term volatility of financial markets associated with the COVID-19 pandemic, 90% of the time the market was flat (P 0 in the vector of probabilities = 8).This confirms the concept of a stepped model of the short-term movement of financial markets, where all significant movements took place quickly and within a short time frame.
Since GPU processor power and free cloud GPU resources are limited, and the number of analyzed tick data (rows) is tens of millions, it is clearly beneficial to reduce the number of calculations in order to reduce the dependency on these resources.To achieve this, an additional variable L was introduced equal to the number of ticks before the significant movement of the financial market began.If the price does not go beyond the opposite stop loss during the L-ticks after the considered time, then the current tick is considered uninteresting, is excluded and the algorithm moves to the next tick.
Figure 13 shows the distribution of values of the vector of probabilities after removing all non-significant movements (zero-level of the vector of probabilities = 8) during the next ten ticks for the first half of 2020 and shows a ten times reduction in the size of training dataset.Divisions 6, 7 and 9, 10 in Figure 13 correspond to the sectors of the probability vector P 2þ , P 1þ and P 1À , P 2À respectively.They are limited by the minimum stop-loss and constitute about 90% of the remaining dataset.As mentioned above, the stop-loss size was calculated based on the standard deviation of the flat fluctuation distribution.Consequently, a significant amount of non-interesting flat fluctuation is excluded from the training dataset.
Through a preliminary iterative approach the most relevant value of L = 10 was chosen and this produced a 100-fold decrease of the original big financial tick data.Whilst this reduction is significant, one disadvantage is that it could remove long recoilless movements that might otherwise be of interest.However, this is a very specific kind of loss which we consider far less important than the benefits of our approach.
Figures 14 and 15 demonstrate the probability vector distribution with only significant EURO/USD Forex rates market movements that are larger than the size of the double stop-loss (without P 3þ and P 3À ) and this reduces the remaining dataset by a further 70%.
Similar calculations can be performed for any traded financial instrument for which big financial tick data has been obtained.Forex EURO/ USD was chosen as the most liquid financial instrument in the largest financial market-foreign exchange.
Our data reduction algorithm can therefore be summarized as follows.For each sliding window iteration, the price movements in that time frame of iteration can be expressed as a frequency distribution.Assuming a normal distribution of ticks, the stop-loss price boundaries will vary F I G U R E 1 0 Comparisons of total volatilities of EURO/USD Forex currency pair at different time frames.
according to the standard deviation of that distribution.For each sliding window iteration the data reduction occurs in three stages.First, the non-significant (0 level of vector of probabilities) ticks are removed.Second, each successive tick within the current sliding window within a given margin of the significant movement (the L-value, 10 in this case) is evaluated against the stop-loss price boundaries for that current window, and this determines whether the current tick of interest is significant.Third, ticks within medium-level vectors of probabilities are also removed.
Work by Borovkova and Tsiamas (2019) has demonstrated how data cleaning can reduce financial market big data by 10%, but generally preprocessing to achieve dataset size reduction is not widely undertaken or reported by other researchers in this domain.In Hirano et al. (2020) the financial data is not reduced but the forecasting horizon was limited to one minute which restricts the effectiveness of the forecast; in contrast our method increases the forecasting time horizon until the end of the current trading day.
Some work on raw tick data by Sandubete and Escot (2020) has used a reinforcement learning approach where the pre-processing involves a ten-fold time delay that skips information and is confined to a one month period of analysis.This was undertaken to save GPU power and mitigate prohibitive computing resources, but can exclude what might be important data.Similarly in Carapuço et al. (2018) the authors used reinforcement learning but again due to resource limitations only analyzed every 5000 ticks (equivalent to analysis once every 2 hrs) significantly reducing 19th April 2017) consisting of 396,000 ticks.Our data reduction pre-processing approach would allow the primary dataset to be extended to include a period of about 28 years.
Figure 16 shows a comparison of the accuracies obtained by the authors in this research and Moghaddam and Momtazi (2021).Based on this summary diagram, two conclusions can be drawn.(1) CNNs are more accurate in finding the direction of the subsequent significant intraday price movements (accuracy = 84.57%),as they analyze short-term market participants' behaviour patterns, similar to the principles of analysis using F I G U R E 1 3 Distribution of the number of values of the vector of probabilities without the zero level.
F I G U R E 1 4 The vector of probabilities with only significant financial market movements.
Japanese candlesticks to find market turning points.
(2) LSTMs also show promising results (accuracy = 83.17%) in determining the market behaviour (trend or flat).This is due to the high importance of the sequence of data within each slice obtained by the sliding window technology when expected to exit from the flat.
The data preprocessing technologies proposed in this research contributed significantly to the high accuracy of the predictions (accuracy = 76.09%).The reduction algorithm reduces and balances the data and therefore the prediction models gave better results than Moghaddam & Momtazi (2021).Conversely the original big financial dataset is large and unbalanced and could not be used correctly to train the learning model without preliminary preprocessing and data reduction, or the use of supercomputers.
In summary, alternative approaches to the tick data reduction that could be applied to financial data include firstly sampling the tick data at longer constant discrete intervals, or secondly reducing the duration of the analyzed input data period.Whilst the predictive outcomes are comparable to the current work, the resource requirements of these alternative approaches (platform access and time) are a major limitation to widespread adoption, and to which our work addresses.

| CONCLUSION
Using the example of the first half of 2020 (the COVID-19 pandemic beginning) as the most volatile period (daily volatility) over the four-year duration analyzed by this research, the application of our novel data reduction algorithm to the big financial tick data reduced the amount of data by 275 times (from 16.6 million to 60.1 thousand rows for FOREX EURO/ESD exchange rate).This significantly reduced the amount of data for F I G U R E 1 5 Distribution of this probability vector with only significant EURO/USD Forex rates market movements for the first half of 2020.
F I G U R E 1 6 Accuracy in predicting the trend or flat behaviour of the financial foreign exchange market, %.
the subsequent neural network training and consistent with achieving data with several thousand or tens of thousands of rows which is typically the most convenient size for training neural networks.For other intervals from 2018 to 2021 the above data reduction ratios are even higher since the daily volatility of these periods was less than in 2020.
Whilst our approach has introduced an additional dropout parameter which could remove some data with potential profitable movement (especially slow starting movements) for intraday trading it is not so essential to catch all such movements but rather the movements with the highest probability of continuation and with least risk of triggering stop-losses.Thus this approach may be of limited use if the market is in a slow uptrend or downtrend without intraday volatility.This market behaviour is not typical in liquid foreign exchange markets; it usually happens when there is a single dominant buyer, frequently the central bank, which is the market regulator for that currency.It is also apparent that this method can be used only for financial datasets and cannot be adapted without significant changes to data from other machine learning application areas.
Considering that the number of calculations during training of a deep neural network is in a square relationship due to the use of gradient methods, a 300-fold decrease in the number of rows (transactions) equates with a 100,000-fold decrease in the number of calculations, which in turn leads to huge improvements in processing time (five orders of magnitude observed in this work).This is significant as this method makes it possible to transfer most similar tasks regarding training deep neural networks on big financial tick data from elite analysis using supercomputers to an accessible area using a small set of GPUs.
Improvement of the model is envisaged for future work by using unsupervised learning methods for more accurate clustering of intraday liquidity groups and fine-tuning the dropout L-ticks' parameter.
The time it takes to train a neural network is an essential (sometimes crucial) consideration for any data analyses where a classification and/or predictive model is to be generated based on big data.This is because it could take months utilizing high performance computing resources whose access is limited and hugely expensive to run continuously over the extended periods required.This limits research progress and practical applications in this field.Our work is innovative in offering a solution to this problem by proposing a method of non-trivial financial market data reduction without loss of essential information, effectively reducing training time down to hours.From a practical perspective computing resources and associated costs are reduced opening up access to a far wider range of interested stakeholders both in pure research and the commercial sector.
Our novel financial data reduction approach could be used not only for the EUR/ESD Forex currency pair but also for other financial market instruments such as other Forex currency pairs, financial indices, futures and shares.

Figure 3
Figure3charts the prices and volumes separately for the first three months and the first 500 ticks of 2020, allowing a relative appreciation of how the data is structured.

F
I G U R E 6 Daily volatility of the Forex EURO/USD from 2018 to 2021.F I G U R E 7 Boxplot of the daily volatility of the Forex EURO/USD from 2018 to 2021.

F
I G U R E 9 Hourly volatility of the Forex EURO/USD from 2018 to 2021.F I G U R E 8 Boxplot of the hourly volatility of the Forex EURO/USD from 2018 to 2021.

Figures 6
Figures 6 and 7 demonstrate the distribution, specified as the histogram and boxplot, of the daily volatility of the Forex EURO/USD for four years, from 2018 to 2021 (2020 data shows increased volatility due to the COVID-19 pandemic).For this period, the maximum daily volatility has not exceeded 0.035 in 2020.The statistical outliers of the daily volatility of the Forex EURO/USD varies between 0.02 and 0.035 delta of the EURO/USD exchange rate.

F
I G U R E 1 1 Daily number of ticks (settlements) in the first half of 2020 (127 working days).F I G U R E 1 2 Initial distribution of the number of values of this probability vector.the accuracy.Using the reducing data method presented in this current work the authors consider that with the same hardware an analysis could occur every 17 ticks, corresponding to about every 30 s for 2003 EUR/USD tick data.Experiments presented by Fisichella and Garolla (2021) analyzed tick data for an 11-year-period (01/01/2010-30/04/2021) with a 4-hr step for the GBP/USD, EUR/USD, USD/CHF, USD/JPY, EUR/GBP and GBP/JPY currency pairs.Following our data reduction process these experiments could be conducted by analyzing tick data for the same period and currency pairs but with a 50-second or one-minute step.InTropmann-Frick and Tran (2023), the reinforcement learning training of convolutional neural networks was carried out on data aggregated from raw ticks data with 15-min steps for the interval from January 2017 to December 2018 on EUR/USD currency pair.Use of our data reduction pre-processing method would allow more data to be initially utilized to achieve comparable metrics for a 3 s step.Likewise, an agent-based reinforcement learning algorithm employed byChen et al. (2018) used Taiwan stock index futures (TAIFEX) from 35 days (16th March 2017 to

Table 4
shows the hyperparameters for training two standard models of deep neural networks, namely the one-dimensional Convolutional Neural Network (CNN) and the Long Short-Term Memory (LSTM) model.These standard architectures have been used only to facilitate a comparison with the state-of-the-art and have not been optimized in terms of the number and combination of layer architectures.For training the networks,