Generative and self‐supervised ensemble modeling for multivariate tool wear monitoring

Development of an effective tool wear monitor requires maximum utilization of information from associated data, especially in machine learning based modeling. However, vastly varied annotated training data is required, which is not only expensive but impractical to obtain. In the present work, a contiguous approach of artificial data generation followed up by self‐supervised pre‐training before supervised model fine tuning and final stacked generalized ensembling, has been adopted to develop an effective tool wear monitor in a low data regime. Cross‐validated results of proposed methodology adoption in tool wear prediction on an experimental data set of few labeled samples attained an averaged MAE of 0.035, RMSE of 0.045 and MAPE of 12.5%$$ 12.5\% $$ on the best case ensemble, which was comparatively superior to a purely supervised‐only trained deep model on the same data set, with an overall accuracy enhancement of over 25%$$ 25\% $$ . The proposed approach provides an effective experimental data augmentation technique while simultaneously minimizing aleatoric uncertainty and allowing for utilization of information from often ignored static cutting parameters.


Background and literature review
Tool wear monitoring plays a vital role in CNC machining by keeping track of a measure of cutting tool degradation as machining progresses.This is critical for safeguarding the dimensional and quality integrity of the machined part while concurrently being necessary for the next frontier of condition-based machine automation.The information used in a typical wear monitoring task is derived from two sources; the dynamic monitoring sensors signals and the static cutting parameters, such as cutting speed, feed, depth of cut, among others.2][3] Despite the options in information usage, TCM is not a trivial task due to the challenge posed by continuous tool-work interaction, making the on-line wear determination task onerous.Artificial intelligence techniques, such as data-based machine learning (ML), has provided an avenue to tackle this challenge, by relating a cutter's wear measure to indirect data features of sensor signals and cutting parameters. 4,5This data-based approach, utilizing algorithms such as artificial neural networks (ANNs) 6,7 among many others, [8][9][10] has resulted in superior performance on different wear diagnostic and prognostic tasks as compared to previously used inflexible mathematical physics-based models. 11The ML-based deep modeling has extended this performance further by enabling on-line wear determination without need for hand-crafted data features engineering. 12Different deep modeling based algorithms have been adopted for the wear estimation task, such as convolutional neural networks (CNNs), 13,14 recurrent cells 15,16 and even varying combinations of the two, 17,18 among other algorithms. 19,20The accuracy of these models' predictions is generally anchored on the architecture used and optimized development.Variance in predictions obtained from different models on similar tasks can thus be attributed to algorithm (architecture) type, optimization, and even neural networks sensitivity to different random initial weights at start of train time.Even though the deep models have resulted in significant comparative performance enhancement, in a low labeled data regime where there exists significantly few experimental data instances having corresponding output indicators or labels, their deployment is inhibited.This is because, their accurate performance relies on usage of a comparatively huge volume of varied historical training data samples as compared to the conventional shallow data-based machine learning models.One solution to this data scarcity problem is to collect and annotate more experimental data.However, this is not only expensive but time and labor intensive.An alternative viable solution would thus be to instead increase the data samples artificially.Generative modeling provides this avenue.
A trained generative model produces new varied data samples similar or relatable to the original training data set. 21They typically use an encoder-decoder architecture with the encoder learning useful representations of the data trained on whereas the decoder can be used for the generative purposes.Generative models have been developed successfully for different applications in the fields of computer vision and natural language processing (NLP), such as in References 22,23, for audio generation and synthesis, by utilizing dilated causal convolution networks and various variations of generative adversarial networks (GANs).Studies in References 24,25 also use GAN variants, such as the information-theoretic extension, to learn generalized data representations for varying computer vision applications.For the tool wear monitoring task, the nature of the input data is usually a multivariate time-series from several sensor channels and exhibits complex temporal relations.Attempts at generating synthetic time-series data for such fields as medical 26 and finance, 27 are reported in literature.The work in Reference 28 utilizes the conditional sig-Wasserstein GAN for time-series generation based on explicit approximation of the signature of a path, and the usage of conditional GANs is widely reported for synthetic time-series data generation for various fields as captured in References 29,30.For the tool condition monitoring task, studies such as in References 31-34 have reported the utilization of GAN in synthetic data generation for samples augmentation especially in low-labeled data scenarios, deploying such architectures as singular GAN (SinGAN), DCGAN, among other classical variants.Reported experimental results showed significant improvement in tool condition monitoring metrics as a result of augmenting experimental data with the synthetic data in model training.On the other hand, other studies utilize the comparatively simpler to train restricted Boltzmann machine (RBM) 35 and variational auto-encoder (VAE) 36,37 for uni-variate and multivariate time-series modeling and generation in tool condition monitoring.The GAN is favored for most generative modeling tasks and especially in computer vision where it generates realistic images as though sampled from the true data set.However, the major short coming in practice is the tendency to produce samples with little diversity even when trained on a broad data set, a feature known as mode collapse.Most approaches proposed to address the challenge revolve around modifying model architectures, optimization algorithms and training loss functions, with very little understanding as yet of how they fix the problem.
Even though generated synthetic data would augment the available data set, it's unlabeled and as such cannot be used directly for a supervised task such as tool wear trending.Unsupervised learning techniques would thus need to be deployed on it in order to be useful.Unsupervised learning concerns extracting valuable information from unlabeled data to learn a representation that best exposes useful semantic features that can be easily decoded in a downstream task, such as regression or classification.The combined successive usage of unsupervised pre-training of a model on unlabeled data followed by re-using a portion of the model for the supervised learning stage constitutes the semi-supervised learning approach. 38,39This pre-training scheme is useful for when a large volume of unlabeled data is available but limited annotated data.Studies utilizing this paradigm for different condition monitoring tasks are as reported in References 40,41.However, this conventional scheme is inflexible to downstream supervised task changes as the knowledge learnt on the unlabeled data is specific for the associated task.A different alternative approach that would make use of generated synthetic data to produce better disentangled generalized model is the self-supervised learning (SSL) paradigm.In SSL, the representation of the structure of unlabeled data is learned through a pretext task, essentially turning an unsupervised learning problem into a supervised one. 42,43A pretext task is a supervised learning problem formulated based on pseudo-labels generated artificially for the unlabeled data.The knowledge derived from this pretext learning is then re-used for the main supervised learning problem.The SSL paradigm has gained significant popularity especially in NLP due to the success of the generative pre-trained (GPT) language models such as Bidriectional Encoder Representations from Transformers (BERT) 44 and GPT-3. 45These models were pre-trained on pretext tasks of predicting missing words in sentences sampled from a vast word dataset.The models are then able to produce state-of-the-art results on various downstream tasks by simply training a single layer on top of the pre-trained network for the specific task.Attempts at usage of the paradigm for different time series condition-based tasks are as reported in References 36,46,47.In Reference 36, the pretext task aims to reconstruct data upon masking of some portions using an auto-encoder, before eventual usage for remaining useful life prediction of a machine tool.In References 47,48, contrastive approaches are used for the pretext task whereby similar data samples are grouped closer together whereas diverse ones further apart with the aid of a similarity metric for distance measurement.The eventual tasks are for bearings' fault detection, time series classification, and even change point detection.For a tool condition monitoring task, the work reported in Reference 49 utilizes comparative learning in model pre-training for useful features extraction from colour images of cutting force signals.Development of the colour images was achieved by expanding each individual signal channel into grey scale images via grammian angular field (GAF) technique and then stacking into a colour image.The extracted features are then used together with only a few labeled data samples to train a deep residual convolutional network, ResNet18, leading to attainement of an enhanced classification precision.In all the aforementioned studies, the SSL pre-trained models outperformed purely supervised approach ones in low data regimes and in certain cases had competitive results even in high data scenarios.This clearly points to the promising direction of the approach.However, the challenge is formulation of a pretext task which is relatable and useful for a specific downstream problem, with no clear guidelines available for the pretext formulation.

Problem definition and contributions
Successful deployment of a data-based tool condition monitor in a practical machining environment requires the model to have been exposed to varied train data.Tool wear being a complex phenomena with many intertwined variables would thus require collection of copious amounts of run-to-failure data for varied tools' distributions.This is impractical for most cases.The availability of only limited annotated data with concurrent unavailability of unlabeled data presents a low labeled data scenario.Training and deployment of deep models in such common cases is inhibited as they require quantity and quality in train data variability.Additionally, static cutting parameters, such as feed rate among others, are generally ignored in deep wear modeling for on-line TCM despite the role they play in tool wear rate.This is because they typically remain unchanged in a continuous machining operation and their usage easily risks model overfitting at train time and poor generalization at test time.Moreover, variance in predictions by deep models is easily attributable to sensitivity to initial random model weights and model's algorithm type.The methodology reported in this study seeks to address these problem scenarios.This work proposes a contiguous methodology approach of generative modeling followed up by self-supervised pre-training and final supervised ensemble learning, for development of an end-to-end tool wear monitor on sensory data and cutting variables, in a low annotated data regime.The available experimental sensory data is first used to train a generative model, with the subsequent trained model utilized to generate copious amounts of un-annotated data relatable to the experimental sensory data.The generated synthetic data is then used in the next methodology stage for self-supervised pre-training.A pre-task is formulated to this end for a generalized data structure representation, thus forming a pre-training framework for the downstream wear determination task.The pre-trained model's weights from this methodology stage is re-used as initial parameter weights for the succeeding supervised model fine-tuning using the few labeled experimental sensory data available.A single supervised-fine-tuned model constitutes one base learner.Using different variations of pretasks, model algorithm types and varying initial random model weights, several base learners are trained.A stacked ensemble of several base learners is then created and to which a top level meta learner is affixed to learn how best to combine the predictions of the individual learners.Additionally, static cutting variables of tool feed rate, depth of cut and encoded material type are fed into the meta learner for association derivation with tool wear.The meta learner thus takes two blocks of input; the predictions of the individual base learners and the static machining variables.
The main contributions of this work is, firstly, in adoption of the successive three-tier approach to enable deep model training for a tool wear monitoring task in a low data regime.Generative modeling allows for data augmentation by increasing training instances, though synthetic.This alleviates the associated high experimental costs of significant data collection, with additionally no extra computational costs involved.Self-supervised pre-training on the other hand leads to not only utilization of the produced synthetic data but also a generalized model for successful tool wear monitoring.The SSL pre-training also allows for the successful development of a supervised deep model on only a few labeled data samples.Additionally, the utilization of ensembling minimizes the propensity of deep models to be sensitive to parameter variation which results in vastly different predictions.This provides for better generalization and accuracy.Moreover, the stacked generalization approach adopted allows for utilization of static cutting variables in wear modeling in a simplified final block reducing the risk of model data overfitting while simultaneously making use of contained useful information.The rest of this paper is organized into the following sections: theory, methodology, experiments, results and discussion, conclusion and finally references.

Generative modeling
A generative model is trained in an unsupervised environment to extract implicit abstractions from a dataset and use the learned knowledge to generate new data samples relatable to the original training set. 21 The VAE comprises of an encoder-decoder architecture to encode input x into a latent representation or coding h followed by decoding of the hidden representations into reconstructions x ′ of the input. 50However, instead of directly producing a coding for a given input, the encoder produces a mean coding  and a standard deviation .The actual latent representation is then sampled randomly from a Gaussian distribution with mean  and standard deviation .Thus, the VAE provides a probabilistic approach to latent vectors representation and hence its generative capacity by simply sampling a random coding from a Gaussian distribution and decoding it to produce a new instance but that looks similar to the training samples.The goal in VAE training is two-part but done concurrently.On one hand is to find the parameters for the encoder and decoder that minimize the loss between the original input x and reconstructions x ′ , while concurrently determining latent representations that look as though they were sampled from a simple Gaussian distribution.The loss function in VAE training is a summation of the reconstruction and latent losses achieved through Kullback-Liebler (KL) divergence as provided in Equation (1): where, K is the number of latent variables and  is an adjustable hyper-parameter.

Self-supervised learning
The general formulation of the SSL framework comprises: 51 1. Pretext task definition: generating artificial labels for the input data from the unlabeled data, based on understanding of the data's structure.2. Supervised pre-training: model pre-training with data-labels from previous step.3. Transfer learning: re-using the pre-trained model as initial weights to train for specific downstream task of interest.
Various approaches have been adopted for the SSL pretext task formulation.One such is the generative approach which involves recovery of the original information such as by masking a token and trying to predict the masked token.Alternatively, there is the predictive approach in which the artificial labels are designed based on the clustering or augmentation of data.A different approach is by contrastive learning in which a binary classification problem is set up based on positive and negative sample pairs generated by augmentation.Additionally, a different approach is by use of bootstrapping whereby two similar but different networks are used to learn the same representation from the augmented pairs of the same sample.However, there is no set framework for pretext task formulation that fits all schemes.

Ensemble modeling
Ensemble modeling involves the use of multiple models and aggregating the predictions from the different predictors on the same input data set in order to improve accuracy on the particular prediction task. 52,53Ensemble methods produce optimal results when the predictors are as independent from one another as possible with one way of achieving this is by using different algorithms.This increases the probability of the different models making varying errors thus enhancing the ensemble's accuracy.Various approaches are utilized for ensemble modeling with the simple methods involving either max voting, simple averaging or weighted averaging.Generally for the simplified approaches, the final output prediction is taken as the best of the rest or simply aggregated by some form of averaging.Alternatively, advanced ensemble methods are utilized such as stacking, in which the predictions from each estimator are stacked together and used as input to a final estimator that computes the final prediction, with training of the final estimator accomplished via cross-validation.Alternatively, blending approach is used whereby a holdout set from the training set is used to make predictions.The predictions and holdout set are used to build a final model that makes predictions on the test set, with other advanced options as bagging or boosting being an option.The choice of the ensembling method thus depends on the accuracy and or complexity desired.

Temporal analysis
In deep modeling, temporal associations in data are generally extracted either via recurrent units, temporal convolution or attention-based networks, among other variants.A temporal convolution network (TCN) 54,55 for sequential data processing is premised on the 1D-convolution neural network (CNN), which applies multiple kernels across a sequence strided along the time dimension to output one feature map per kernel.Unlike the conventional CNN though, the TCN has a broader receptive field through the use of dilated convolutions thus allowing for longer sequence processing.A recurrent unit, on the other hand, such as a long short term memory (LSTM) cell, 56,57 processes a sequence by outputting a value at each time step which is a function of cell's input x t and value at previous time step h t .By incorporating both a short and long term state in its configuration, the LSTM can process a sequence and store relevant information in respective states as need determines.As for the attention network, 58 it processes a sequence by developing focus on only a learned useful portion of the presented input, via weighted scoring based on normalizing the output score of a feed-forward network at each time step, with global associations determined across elements in a sequence.

Notation
The input data for the tool wear prediction task comprises of static scalars of cutting variables s i ∈ R m , and time-series of real values from N different sensor channels, {x i , ..., x L }; where m is the number of machining parameters, i is the data sample number, and L is the total count of data samples.Each time series input data sample is a 2D tensor x i ∈ R l×N of l, the number of time steps, by N, the sensor channels.An input sample denoted X i is thus a concatenation of the scalars and time series samples, that is, For each input sample, there is a corresponding real valued scalar target y i ∈ R of flank wear width.The wear monitoring task is thus formulated as a regression prediction task of output value y i for each input data sample X i .

Proposed methodology
The proposed methodology for building an effective tool wear monitor in a low data regime framework is as illustrated in Figure 1 and comprises of successive stages of generative modeling, followed up by SSL pre-training, then supervised fine-tuning and final generalized ensemble stacking via a meta learner.VAE is utilized for the generative modeling stage because it allows for efficient Bayesian inference in probabilistic models, and is simpler in structure and training comparable to the GAN or transformer-based models.The architecture of the VAE used in this study is as shown in Figure 2.
An input data instance is first passed through a temporal convolution network (TCN) block followed by batch normalization layer and then max-pooling.The TCN block is used for determining time dependencies between data elements in the time-series sensory samples whereas batch normalization is applied to ensure stability during training by re-centering and re-scaling the layer's inputs.Max pooling scales down the sequence length and introduces scaling variance for better

Randomized codings
Xrecon F I G U R E 1 Schematic block of proposed methodology approach.SL, supervised learning; SSL, self supervised learning.X exp is experimental sensory data while X synth is generated synthetic data.generalization.Scaled exponential linear units (SELU) are used as activation functions in the layers due to its superior convergence performance as compared to other activation functions.Equation ( 2) describes the SELU function operation, with  and  as hyper-parameters of choice.Further processing is done through two fully connected neural layers to complete the encoder block.The output of the encoder is sent through a fully connected sampling layer to produce mean and standard deviation codings.The decoder block then reverses the encoder processes.This involves data reshaping, up-sampling, before final processing through a TCN and a 1-dimensional convolution layer to provide the reconstructions.The generative model is trained on the experimental sensory data X exp as input and output, with produced reconstructions aimed at having minimal error loss with real inputs.The trained VAE model's decoder is then re-used to produce copious quantities of varied synthetic data X synth resembling the experimental sensory data, by simply providing codings, sampled from a random Gaussian distribution, to its decoder.The computation time required in the generative VAE model training/testing is dependent on; the utilized computer resource, training samples count and time-step length of each time series sample.A comparatively high data sample count with an equally longer sample series length results in significantly longer training/testing computation time.Generally, this computational time is in the order of minutes-to-hour, all factors considered.However, once the model is trained, the sampling time of synthetic data generation is a fraction of the training/testing time of the order seconds-to-minute, depending on synthetic data count required.
The SSL stage utilizes X synth from the generative model in the pre-training step but by first generating pseudo-labels y synth for the data, based on pretext tasks formulation.Two pretext tasks were formulated to this end.The first task was designed as a multi-classification problem of predicting the cluster id of a data sample as provided after clustering the synthetic data X synth using a time series k-Means classifier, which was chosen for its scalability and fast response.The second task on the other hand was formulated as a forecasting task of predicting the masked final values of each sample instance.The tasks formulation was informed by the prior knowledge that, a typical tool undergoes various distinctive wear progression stages during its life cycle.By having a model learn to agglomerate data samples into respective categorizations and or learn masked sequence values as upstream tasks can prove to be useful pre-training knowledge for temporal sequencing in the eventual tool wear prediction task.The model architecture chosen for the base learners used in the SSL stage comprises of three levels, that is, data de-noising and feature selection, temporal features extractor, and a top multi-percetron predictor layer, with the selection guided by previous reported work. 59The generalized base model configuration is as captured in the block diagram of Figure 3.
The base network in the de-noising and feature selection block is the Gated Residual Unit (GRU) which parallel processes each input data stream individually in order to not only maximize information from all monitoring sensor streams but importantly, also minimize the variance associated with varying scaling and noise in the input data.The softmax activation function is used for output processing of the concatenated channel with its definition as represented in Equation (3); For the temporal features extractor, three model algorithms were explored and developed i.e. attention, LSTM and TCN based feature extractors, with their choices informed by usage in multiple reported literature for temporal and sequential analysis.The feature extractor choices are as shown in Figure 4.
The top layer MLP is simply a single fully connected layer followed up by dropout layer, which guards against data over-fitting by introducing randomness principle in the training process.The final output predictor layer has varying nodes based on pretext task in question.The SSL pre-training step is thus a fitting of synthetic input X synth to artificial labels y synth in a supervised manner.The trained model from the SSL stage is re-used for the supervised learning stage to trend tool wear from the experimental time-series sensory data { X exp , y wear truth } .The fine-tuning process of the SL stage is a fitting of experimental sensory inputs X exp to experimental truth wear values y wear truth , with training basically involving the top layer MLP.Multiple models (base learners) are trained with different variations on the initial random seed generator, temporal feature extractor and SSL pre-training task, with these different permutations constituting the various ensemble cases as will be discussed in the subsequent experiments section.The ensembling technique used in this study involves stacking multiple trained predictors from the SL stage in a parallel configuration and adding a meta learner at the top of the structure to learn how best to agglomerate the wear predictions from the individual base learners.Additionally, the static machining scalars of feed rate, depth of cut and encoded material type are also fed to the meta learner to develop associations with tool wear.Ensembling is utilized to minimize prediction variances due to aleatoric uncertainty associated with deep models sensitivity to random weights initializations, and also model algorithm type used in sequential data analysis.This allows for comparatively better generalized models which is essential for a tool wear monitor considering the varying wear distributions exhibited by different tools.The meta learner is simply a two layer MLP of fully connected neural layers.In the ensembling stage, the already trained base learners are not re-trained, only the meta learner is.Training of the meta learner is thus a fitting of the concatenated inputs of predictions from individual base learners y pred i and the static machining variables X static , to the experimental ground truth wear values y wear truth .The final tool wear monitor is thus an end-to-end stacked ensemble of multiple base learners with a top level meta learner that takes in as inputs the monitoring sensory data coupled with static cutting variables, and outputs a wear prediction y pred .

Data description
Evaluation of the study methodology was carried out on the UC Berkeley milling data set acquired from the NASA Ames prognostics data repository. 60It comprises of 16 cases of milling tools' use to cut in metal in order to investigate tool wear under varying operating conditions of feed rate, depth of cut and material type.Monitoring sampling data was obtained from three sensors, that is, acoustic emission, vibration and current sensors, stationed either at the machine table or spindle.The three cutting parameters were varied over two levels each to provide for eight case scenarios: feed rate of either 0.25 or 0.5 mm/rev, depth of cut of either 0.75 or 1.5 mm, and material type either cast iron or stainless steel J45.The choice of the levels for parameter variation was guided by industrial applicability and recommended manufacturer's settings.The experiments were repeated a second time with the same cutting parameters but different tools to provide the total 16 cases.Table 1 summarizes the experimental conditions utilized for all the machining cases.
The monitoring sensor signals were captured at 250 Hz with each cut having 9000 sampling points or time steps.A representative monitoring signal sample of a cut is as shown in Figure 5 for cut number 100 in the data set, and clearly captures the tool entry, constant cutting and exit phases.However, certain captured signals such as corresponding to runs 17, 94 and 105 have significant anomalies and have to be excluded from the dataset usage.These distorted signals as shown in Figure 6, either have abnormally high signal amplitudes (Figure 6A,B) or signature unrepresentative of a typical machining cycle (Figure 6C), and are thus outliers whose utilization would negatively impact convergence during model training.There are varying number of runs for each of the 16 cases at which points the degree of flank wear was measured up to a wear limit and sometimes beyond, but not always.There are a total of only 167 cuts for all the cases combined.Additionally, the flank wear values were not always recorded at the end of each run leading to missing wear values.This reduces further the number of data samples available for analysis.The dataset is thus not only significantly unbalanced but also fits in perfectly into the low labeled data regime scenario, providing a basis for its usage in this study.
The experimental set up used for data collection is as illustrated in Figure 7, with the data collected on the Matsuura machining center MC-510V.The cutting tools used inserts of type KC710 with the size of the work pieces being 483 × 178 × 51 mm 3 .A MIO-16 (National Instruments) high speed data acquisition board with a maximal sampling rate of 100 KHz was utilized for sampling output sensory data via LabVIEW® software.The acoustic emission sensor used was model WD 925 with a frequency range of up to 2 MHz, whereas the accelerometer was model 7201-50 ENDEVCO with a frequency range of up to 13 KHz.For current measurements, model CTA 213 current sensor was utilized.

Data preparation, models settings and experimental cases
The data sets used in models training for the various methodology stages varies and are thus handled differently.A typical monitoring signal, as depicted in Figure 5, captures three distinct cutting phases of tool entry, constant contact machining and tool exit.Visual inspection aided in initial processing of the experimental sensory data by selecting only the stable cutting region for use in models training.For the generative model, training is carried out on experimental sensory inputs only without labels as the aim is to generate new synthetic dataset statistically resembling the experimental set.The data was pre-processed by scale normalization to be in the boundary [0 − 1], as provided by Equation ( 4): where, x n is the time series of the n th sensor channel, x max train and x min train are the maximum and minimum channel values as determined on the train set, and x z n is the normalized time series input data.The scale normalization used on the train set is applied to the test set as referred to Equation (4).Input data normalization was done to ensure convergence and stable models' training which would otherwise be difficult by training on data sequences on different scales as captured from the multiple sensor channels.The scaling also allowed for the adoption of the binary cross-entropy loss for the reconstruction loss function as opposed to the mean square error for faster and better convergence.The binary cross-entropy log loss is as described in Equation ( 5); where y i is the instance truth value, pi the corresponding model prediction, and m the mini-batch samples count.The generative model's reconstruction training is thus modeled as a multi-label binary classification problem.A sample's time series length choice was based on experimentation using sample lengths of 64, 128 and 256.The computational cost exponentially increases with increase in sample length to be produced and with decreasing accuracy in reproduced samples.The sample length of 64 was adopted for the significant comparative accuracy obtained in reconstruction.All successive methodology stages thus adopt the same sample sequence length.In the present study, a comparative computation time factor of 15:1 min in relation to training/testing time vis-a-vis synthetic data generation time, was obtained for 40,000 generated samples versus a windowed training data count of 25,000.The selection of the generative VAE model's hyper-parameters was via a random search as it was found to be computationally effective than a grid search, with 200 such iterations carried out.Initial parameter value ranges was guided by previously reported work on closely similar architecture.The hyper-parameters in question were the number of filters used in the convolutions, codings size, dilations sequence and the kernel size of the convolutions.Evaluation of the generative model was based on its generated samples with two metrics used in this study for evaluation purposes, that is, the maximum mean discrepancy (MMD) between the generated and experimental distributions, and the data usefulness metric.The MMD metric seeks to ascertain the relation between two distributions by checking whether any two sets of samples from different data sets were generated by the same distribution. 61The relation is done by comparing there statistics.The MMD represents distances between two distributions as distances of mean embeddings of their features.Given two distributions P and Q over a set X, the MMD is defined by a feature map  ∶ x → H where H is a reproducing kernel Hilbert space.Thus, in general, the MMD is given by Equation ( 6);

F I G U R E 7
Schematic block representation of experimental setup used in data collection.RMS, root mean square.
which to distance between the of P and Q based on a kernel function.MMD = 0 if and only if P = Q.MMD is a value between 0 and 1 with a value close to zero indicating statistical closeness of the distributions.The MMD has found wide usage in applications such as detecting the distributional discrepancy in datasets, checking whether two distributions are the same, as a loss function in ML model training, among other functions. 61The second of data usefulness is based on the train-on-synthetic test-on-real (TSTR) paradigm.The generated synthetic data should be as useful as the real data when used for model training on a predictive purpose.Evaluation of generated data, and in extension the generative model, on the usefulness metric is based on its successful use in the downstream main task of tool wear prediction.
The SSL pre-training stage utilizes only generated synthetic data.The pseudo-labels generation is based on the pretext task.For cluster id determination task, the clusters as determined by a time series k-Means classifier are adopted for y synth , whereas the last six channel values per sample are adopted in the forecasting task.Upon annotation, training is then carried out in a supervised manner.The evaluation metrics for pretext task 1 are the accuracy, precision and recall as provided in Equations ( 7), (8), and (9); where t p is the cluster true positive count, t n the true negative count, f p the false positive count, and f n the false negative count respectively.In order to eliminate classification bias, same sample size per cluster was picked, with the new balanced set then used in model pre-training.The softmax function was used for layer output in task 1, with its definition as defined in Equation (3).Evaluation of pretext task 2 was based on minimizing the mean square error between the model predictions and assigned pseudo-labels per data instance as provided for in Equation ( 10) where y truth i are the assigned pseudo-labels per data instance, and y pred i are the corresponding forecast predictions per same input data instance.For the supervised training eventual end-to-end model evaluation, experimental data from case 1 to 8 were used for model training, with the exception of case 6 which only has one data instance.Due to the unbalanced nature of data as a result of uneven runs per experimental case, samples corresponding to cases 15 and 16 were additionally added to the train set for augmentation.The remaining case samples, 9 through to 14, were used as the test set.This in 73 case samples used in training while 70 being utilized for testing.This data split selection was informed by two facts: first, each experimental case was repeated a second time using similar machining conditions but with different tools, thus by using one case for model training, the repeated case can be used for model testing.Secondly, at practical test time, a trained model is exposed to data set on tools yet unseen to it thus the data-split formulation chosen allows for better generalizability for model deployment.3][64] The exclusion of case 6, with only one data sample, leaves 15 cases providing for the choice of hyper-parameter k in the cross-validation, with the stratified fold approach ensuring preservation of class distribution due to the unbalanced nature of the experimental data.The 15-fold stratified cross-validation thus involved splitting the dataset into 15 folds, corresponding to experimental cases.In the first instance, the first 14 folds are used to train the model, while the 15 th is used as the hold-out test set.The training/testing process is then repeated with a different hold-out test set fold until all the folds have been given the opportunity to be used as the test set, providing for a total of 15 model evaluation runs.The averaged performance from these runs provides the final overall prediction results.
Hyper-parameters selection for base learner was based on random experimentation on different values for sensitivity with no joint optimization carried out as yet.Four levels of parameter choices were adopted with encoding size values of 16, 32, 64 or 128, attention head size of 64, 128, 256 or 512, head count of 1, 2, 4 or 8, supervised layer nodes of 16, 32, 64 or 128, and dropout values of either 0.2, 0.3, 0.4 or 0.5.The comparative higher value choices led to significantly increased computational processing load due to parameters explosion which inadvertently elevated data overfitting risks.
Thus, performance on value variations on different experimentations aided in the selection.Table 2 summarizes the hyper-parameter selections for the VAE and a single base learner.
The loss function utilized in the supervised model training is the mean squared error between ground truth wear values and predictions, with the general formulation as provided in Equation (10).The adaptive momentum estimation (Adam) optimization function was used for model weight updates at train time, with an exponentially decaying learning rate from an initial value of 0.01.The choice of initial learning rate value was from random experimentation.The evaluation metrices for model performance were the mean absolute error (MAE), mean absolute percentage error (MAPE) and root mean square error (RMSE), between the truth and predicted wear values, as provided by Equations ( 11), (12), and ( 13) respectively.
Using different variations of pretext task, temporal feature extractor model and initial random seed generator, three ensemble cases were explored resulting in development of three stacked models (ensemblies).Ensemble case 1 involved stacking multiple base learners pre-trained on the same pretext task and having the same architecture but randomly initialized differently at start of training.Ensemble case 2 involved stacking base learners utilizing different feature extractor algorithms but pre-trained on the same pretext task and similar initial random seed generator.As for ensemble case 3, the base learners were of the same architecture but are pre-trained on different pretext tasks.Analysis of the different cases would provide an enlightening insight into effect of proposed SSL pre-trained ensembling approach as related to choices of model algorithm, pretext task and variance associated with weight initialization.In the cases of selection of similar features model type and or pretext task, the choice was based on experimentation with the best performance guiding selection.The performance of the developed ensemblies was compared against a model purely supervised trained on experimental data only.In order to ensure competitive comparison, the architecture of this model was chosen to be same as of the best performing architecture of the base learners in the ensemblies.The attention-based learner was chosen to this end with details of its configuration similar as described in the methodology section.Additionally, strided data windowing was utilized for additional augmentation, with same data set applied for the developed ensemblies.Table 3 summarizes the ensemble cases and choices therein.The models were developed using Tensorflow Keras® deep learning library in Python® environment.The computing resource utilized was an Intel® Core i5 3GHz 4GB RAM CPU, with additional hardware acceleration provided via a GPU through the Google® Colab platform.

Generative model evaluation
Training of the generative VAE model was aimed at lowering the absolute loss between reconstructions and corresponding original data samples, while simultaneously pushing the codings from its encoder to be from a Gaussian Visual evaluation of the VAE's generated real valued samples vis-a-vis the original experimental dataset is difficult to infer, and this can be evidenced in the sample illustrations in Figure 8. Performance evaluation of a synthetic time-series data generator is significantly non-trivial as compared to the case usage in computer vision and NLP.This is because, as an example case for image generation in computer vision, simple visualization of the generator's output would provide feedback on a model's realistic or otherwise generated images.This is not the case for synthetic time series data, with the longer the series the greater the problem dimension.The generated data samples using the VAE's decoder were evaluated on two metrics, that is, the maximum mean discrepancy (MMD) and the usefulness metric, as previously described in the experiments section.By taking a fixed set of M generated samples and a similar number of experimental data, the MMD between the two distributions using a Gaussian filter evaluated to 0.145, indicating a statistical closeness between the generated samples and the original experimental monitoring signals data.Evaluation on the generated data usefulness metric was based on the performance of a subsequent model trained on this synthetic dataset.If a model trained on the generated samples is then tested on actual real data its performance is comparatively good and acceptable, then the dataset is considered useful.Analysis of this metric on the VAE's generated samples is captured in the subsequent sub-section when evaluating the final produced models on the wear prediction task.The variability of synthetic set is important for useful adaptation downstream.An of the generated samples variability can be infered from their cluster distribution as produced by the k-Means classifier for use in the cluster determination pre-task.The clusters distribution for the generated samples is as shown in Figure 9 indicating clear variability in obtained samples.
F I G U R E 9 Generated samples cluster distribution.Time series k-Means classifier used for the clustering.

Influence of self-supervised learning
The two pretext tasks formulated for the SSL stage, on the synthetic dataset obtained from the generative model, translated to time series clustering and forecasting tasks respectively.The successful performance of the subsequent model pre-trained on either of these tasks required the best performance on the associated metrics for each task in order to maximize the learnt knowledge in the downstream wear prediction task.For the series id cluster identification task, this meant attainment of a high score on each of the class specific metrics of precision and recall, with the maximum attainable score of 1.00 indicative of 100% accurate classification.The obtained metric scores for the clustering pretraining task on a hold out test set are as summarized in Table 4, with an averaged classification accuracy of 94% attained.On the other hand, for the series forecasting task, attainment of a low mean squared error loss between predictions and pseudo labels was the aim.With no set lower limit or guarantee of attainment of the same, repeated experimentation on hyper-parameters to achieve lowest possible error metric sufficed for this case.The performance, on the single exclusive hold-out test set and 15-fold cross-validated, of the three developed model ensemblies versus the base comparison model on the different evaluation metrics is as summarized in Table 5.
It is observed that, the results obtained from the 15-fold cross-validation exhibits comparatively higher prediction errors as compared to the single hold-out test set results, but with general closeness in valuation, a deviation approximately ±20%.The comparatively higher errors is attributable to increased variance from exposure to a wider data set, which though provides for lowered bias and hence more reliable results.The closeness in prediction values of the two validation approaches is attributable to lack of data leakage with a respective test case never being used in model training, providing a good estimate of the model's performance on yet unseen data.Irrespective of the two validation approaches, it is observed that, all the three SSL pre-trained ensemble models completely outperformed the base comparison model, that was only supervised trained on the few labeled experimental data only, on all evaluation metrices.The influence of knowledge learnt from the copious amounts of varied synthetic data is seen in the eventual performance enhancement obtained by the SSL pre-training approach as evidenced by reduced mean absolute errors and percentage errors for all experimental test cases.Further illustration of models performance is captured in the wear trends of Figures 10 and 11, with the wear trends in Figure 10 comparing the truth plots versus prediction plots of the supurvised-only trained model and ensemble 1 only for simplicity in comparative analysis.The wear plots of all the models compared as in Table 5 are captured in Figure 11.The predicted wear trends of all the stacked ensemblies closely trace the truth plots for all experimental test cases with significant variation majorly attained for test case 11.This can be attributed to the unbalanced nature of the data set causing irregular exposure.The comparatively better predictive results were obtained for stacked ensemble 1, with the averaged 15-fold cross-validated performance on the data set providing an MAE of 0.035, RMSE of 0.045, and F I G U R E 11 Regressive wear plots; truth versus predicted, all cases.All three stacked ensemble prediction plots included.MAPE of 12.5%, as compared to the supervised-only trained model with an averaged MAE of 0.115, RMSE of 0.175 and MAPE of 40%.

Influence of ensembling
The influence of ensembling is best captured by analyzing the performance of the individual base learners making up a stacked ensemble.The averaged performance on all test data of the constituent base models in each ensemble case versus the stacked ensemble on the MAE metric is as shown in Figure12, with extra evaluation indices summarized in Table 6.
In analyzing ensemble case 1, the effect of different random weight initialization is seen in the varying MAE and MAPE values obtained, clearly evidencing aleatoric uncertainty.The stacked ensemble though smoothes out this variance and results in an even lower MAE and MAPE, partly also due to the additional information gained from the static cutting variables of feed rate, depth of cut and material type.For ensemble case 2, the influence of different base learners in terms of algorithm (architecture) is captured in the completely varied results.The attention-based learner and the LSTM appear to perform relatively better compared to the TCN base learner.This is attributable to memory capacity of the two in temporal analysis as compared to the TCN.The stacked ensemble of the three with a meta learner though offsets the significant performance variation allowing for different model architectures utilizing varying strengths to be adopted.Analysis of ensemble case 3 provides an insight into the effect of the choice of pretext task for the SSL stage.The cluster determination pretext appears to produce a better generalized model for the downstream wear determination task as compared to the model pre-trained on forecasting.This is attributable to the fact that pretext task 2 essentially constituted multi-variate forecasting on a mean absolute loss in which its difficult to attain best convergence as compared to pretext 1 of multi-classification.Moreover, the forecasting feature may not be generalizing well for the wear determination task as it does not constitute fully in trending.The cluster identification task on the other hand though appears to correlate different series to wear phases and the varying experimental cases.The performance of an SSL pre-trained model is thus

TA B L E 6
Base learners averaged performance evaluation on different indices.Note: Key: model name notations as referenced in Table 3.
heavily influenced by the formulated pretext task.However, for real valued time series data, there is no guideline on how to best achieve an effective formulation and is thus dependent on the task at hand.The performance of ensemble 3 model though shows that multiple tasks can be combined to leverage on different information learnt thus minimizing the associated variance due to pretext task choice.Conversely, unhelpful pretext task choice could significantly lower overall model performance.All the developed ensemblies though provide enhanced model performance allowing a deep model to be trained on only a few labeled data samples.Based on the best ensemble cross-validated results, the averaged performance enhancement on the supervised-only trained model constituted an MAE, RMSE and MAPE error reduction of 0.08%, 0.13% and 27.5% respectively.As a further validation of the proposed methodology, the performance of the developed ensemble models was compared with other work as reported in literature on the same experimental data set.However, different reported work utilize varying experimental data train/test scenarios thus making it difficult to realize a direct inference.The approaches reported in those works though rely on the effectiveness of the optimized developed models' architecture or algorithm for effective results in the condition monitoring tasks, and the approach is in essence directly similar to the base comparison model already utilized in this work.The approach reported in this work thus still provides a performance enhancement in comparison to reproduced cases from literature on test set used in this study.

CONCLUSION
This study proposed a contiguous approach for development of an end-to-end on-line tool wear monitor on both sensory and static cutting parameters for a low annotated experimental data regime, while concurrently addressing challenges associated with predictions variance due to different model algorithms and aleatoric uncertainty.Generative modeling allowed for synthetic data use to augment the available few experimental samples, essentially negating the need to collect and annotate vast data samples as it is comparatively expensive.The performance of the developed stacked ensemblies has shown that predictions variance associated with model algorithm type and sensitivity to weight initializer can be minimized significantly using the stacked approach leading to enhanced model accuracy.Moreover, adoption of information related to the static cutting parameters via the meta learner allows for a simplified utilization of the knowledge in real time wear trending leading to comparatively better performance.The self supervised pre-training approach showed that a better generalized model can be developed via the approach utilizing synthetic data set thus enabling eventual successful development of a deep supervised learner on only a few labeled data samples.The success of the self supervised pre-training though hinges greatly on the formulated pretext task(s).The proposed approach is still viable even for a case of when vast unlabeled data is available with concurrent few labeled samples, as the only step that would not be required then is generative modeling.All the developed self supervised pre-trained ensemblies completely outperformed a purely supervised trained model on same few experimental data set.Future work will involve model interpretability as related to determination of the influence of different cutting parameters on tool wear as provided by a deep neural net, which is still a black box in terms of its explainability.Additionally, exploration of more pretext tasks will also be carried out, coupled with models' hyper-parameter optimization.
U R E 2 Schematic illustration of developed VAE architecture.X input is the sensory input data whereas X recon are the reconstructions of X input .

3
Schematic illustration of a base learner's architecture.Three main functional blocks are the denoising and feature selection block, feature extractor network, and the top level multi-layer perceptron (MLP) block.

F I G U R E 8
Schematic illustration of generated versus experimental sensory sample, (A) is a normalized generated sample while (B) is its re-scaled version.

F I G U R E 10
Regressive wear plots; truth versus predicted, simplified.Only stacked ensemble 1 prediction plots included for simplified comparison.

F I G U R E 12
Models performance comparison on MAE metric.att, attention; foc, focusting; clust, clustering.
It thus estimates the probability p(x) of observing observation x, and requires no labels.However, if the data is labeled, it estimates the joint distribution p(x|y), where y is the label.The training dataset comprises of examples x 1 , … , x n , which are samples from a true data distribution p(x).At start of training, the generative model outputs a random distribution such as from a unit Gaussian distribution.The goal is then to find the model's parameters  that produce a distribution that closely matches the true data distribution.The most prominent deep generative models are the generative adversarial network (GAN), variational auto-encoder (VAE), and the transformer-based models, for example, generative pre-trained (GPT) models.
Schematic illustration of model choices utilized for feature extractor network, (A) is an attention-based network, (B) is an LSTM network, while (C) is a temporal convolution network (TCN).Abstract features input is derived as output of denoising and feature selection block.Experimental conditions.
F I G U R E 6Schematic illustration of distorted sensory signals, (A) and (B) are the captured signals for cut numbers 17 and 94 respectively, with abnormaly high amplitude values, (C) is the captured signal for cut number 105, with an uncharacteristic signature unrepresentative of a typical machining cycle.
TA B L E 2 Ensemble cases summary.
Classification report on clustering task.Models performance evaluation on different indices.