Application of nonparametric quantifiers for online handwritten signature verification: A statistical learning approach

This work explores the use of nonparametric quantifiers in the signature verification problem of handwritten signatures. We used the MCYT‐100 (MCYT Fingerprint subcorpus) database, widely used in signature verification problems. The discrete‐time sequence positions in the x ‐axis and y‐axis provided in the database are preprocessed, and time causal information based on nonparametric quantifiers such as entropy, complexity, Fisher information, and trend are employed. The study also proposes to evaluate these quantifiers with the time series obtained, applying the first and second derivatives of each sequence position to evaluate the dynamic behavior by looking at their velocity and acceleration regimes, respectively. The signatures in the MCYT‐100 database are classified via Logistic Regression, Support Vector Machines (SVM), Random Forest, and Extreme Gradient Boosting (XGBoost). The quantifiers were used as input features to train the classifiers. To assess the ability and impact of nonparametric quantifiers to distinguish forgery and genuine signatures, we used variable selection criteria, such as: information gain, analysis of variance, and variance inflation factor. The performance of classifiers was evaluated by measures of classification error such as specificity and area under the curve. The results show that the SVM and XGBoost classifiers present the best performance.


INTRODUCTION
For centuries, handwritten signatures have served as a fundamental method for personal identification and authorization.A signature, representing a handwritten rendition of an individual's name or mark, is utilized to verify legal and financial transactions and establish authorship [1].The significance of handwritten signatures extends into diverse applications, encompassing online banking, credit cards, and cheque processing mechanisms [2].Moreover, biometric systems play a pivotal role in the authentication and validation of passports, employing methods such as signature verification [3,4] and user behavioral characteristic verification in the realm of digital forensics [5][6][7].Signing is a minimally invasive form of identification compared to methods like DNA, fingerprint, and blood analysis.Because of their widespread use, automatic signature verification techniques are critical to validate and authenticate signatures [8].However, handwritten signatures are prone to forgery.With sufficient practice, forgers can produce forgeries that closely resemble legitimate signatures, making them hard to detect [9][10][11].Therefore, signature verification poses a formidable challenge.Not only must systems account for a forger's skills, but also the natural variation within a single person's signatures over time.A person's signature changes gradually due to physical, psychological, and mental factors, thereby increasing variability [12].
Signatures possess several key characteristics, including shape, movement, and variation [13].Of these attributes, movement is particularly significant since it arises from the muscles controlling the fingers, hand, wrist, and shoulder.Generally speaking, a genuine signature exhibits ballistic movement-it features fast, predetermined movement without positional feedback that originates in the brain.As such, it cannot be performed slowly.In contrast, a forged signature represents a deliberate attempt to reproduce a signature with the aid of positional feedback [14].Whereas an authentic signature flows spontaneously from the brain's motor control centers, a forgery is carefully constructed in a step-by-step fashion.
There are two main approaches for capturing and analyzing signatures: offline methods and online methods.In the online approach, a person signs using a digital device like a graphics tablet, tablet computer, or smartphone.These devices track hand motion and convert it into sequential handwriting data.Depending on the specific hardware used, additional information such as stroke order, pen pressure, and hand speed can be captured [15].Signature verification using online methods tends to produce higher accuracy rates [16].
Signature verification entails classifying a signature as either genuine or forged for a given individual.This pattern recognition process comprises four key phases: data preparation, feature extraction, feature selection, and classification [17].
In this work, we explore using nonparametric quantifiers for signature classification in a case study of handwritten signatures.We build upon our previous work on handwritten signature classification and verification [18,19] by evaluating nonparametric quantifiers on the X and Y coordinates of time series representing signature data, along with their first and second derivatives.The paper is organized as follows: Section 2 reviews relevant literature on signature verification and summarizes key methods used in this work; Section 3 describes the signature database used and the methods applied for signature verification using nonparametric features extracted from time series.Section 4 presents the results obtained from this study; and Section 5 concludes the paper with final remarks.

BACKGROUND
There has been extensive research in recent years on online signature verification using various preprocessing techniques, feature extraction methods, and verification algorithms.Comprehensive surveys of signature verification can be found in Plamondon and Lorette, Fierrez et al., Plamondon and Srihari, and Tolosana et al. [12,[20][21][22].Many studies utilize feature-based and function-based approaches, deriving global features from signature trajectories and time series that capture local signature properties.Common measures include total signature duration, number of pen ups, Fourier descriptors, and more [23][24][25].Recently, Okawa [26] proposed an online signature verification method using single-template matching with time-series averaging and gradient boosting, along with dynamic time warping for distance measurement.Generalized KNN and Freeman chain code classifier approaches were put forth by Saleem and Kovari [27].Bibi et al. [28] reviewed techniques for both offline and online signature verification based on classification model taxonomy.Santos et al. [19] proposed a signature verification method using network analysis where centrality metrics serve as predictive features.

HANDWRITTEN SIGNATURES: MATERIAL AND METHODS
In this section, we describe the database used in this work and the methods used to obtain and select features from time series to be applied into classifiers with the purpose of verifying signatures.

Handwritten signatures database
For this study, the MCYT-100 (MCYT Fingerprint subcorpus) was used, which is a freely available database of handwritten signatures.We have adhered to the main protocol, and the methodological details of the acquisition of the MCYT database published by Ortega-Garcia and collaborators in Fierrez et al. and Ortega-Garcia et al. [20,32] as well as "license agreement for non-commercial research use of MCYT-100 signature corpus."This database contains information on 100 people; for each individual, 25 real signatures were captured.Also, 25 forgeries were produced for each user [32].Each signature was obtained using an online method with the help of a digital tablet and an electronic pen.The device used to obtain this data was the WACOM © Intuos A6 USB tablet.The resolution of the tablet is 100 lines/mm and accurate to ± 0.25 mm.The sampling frequency is configured at 100 Hz.Adhering to the Nyquist sampling criterion and considering that the maximum frequencies of the associated biomechanical sequences consistently remain below 20-30 Hz [33].
This device stores, as a function of discrete time t, the x-axis position X(t) and y-axis position Y (t).Moreover, the device also captures the pressure applied to the pen p t , the azimuth position  t , and the angle of the pen in relation to the tablet  t .In the present work, we only use the time series X(t) and Y (t) since these quantities are provided by even the simplest signature capture devices.
The time series in the database have different sizes.Therefore, as a way to facilitate the analysis, a preprocessing was performed on each time series, such that the coordinates were resized to the unit square [0, 1] × [0, 1].Finally, considering these resized values, a cubic Hermite polynomial interpolator was used to smooth the signatures so that the total number of data points in each series is M = 5000.In this way, for the k th individual (k = 1, … , 100) associated with the  th signature ( = 1, … , 25) of type , two time series were considered, denoted by X (k;) , where the type  ∈ {T, F} refers to the genuine (T) or forged (F) signatures and x and ỹ are the interpolated values [18].Figure 1 displays three signatures from each of two individuals, two of which are genuine (left, blue) and one forged (right, red).

Feature extraction via nonparametric quantifiers
One of the important steps in the machine learning (ML) process is selecting relevant information to represent the data [34].To extract features from time series, we require resources to obtain the maximum information present in these series so we can understand their dynamics.In physics, dynamics-that is, speed and acceleration-are widely used concepts.There are several approaches to distinguish different dynamical regimes in complex systems [35][36][37][38][39].Among them, entropy, Fisher information, and complexity are time-causal quantifiers [18,[40][41][42][43] that extract global and local information, respectively.
The nonparametric quantifiers used in this work were the Wallis and Moore trend statistic (W) [44] and the quantifiers based on information theory: normalized Shannon entropy permutation (H), Fisher information permutation (F), and statistical complexity permutation (C).These quantifiers have been shown useful for identifying time series dynamics [45].
Wallis and Moore proposed a phase frequency test [44,46,47] used to test series for randomness.Given a time series {X(t)} N t=1 of length N, consider the sequence of first (non-null) differences X(t) − X(t − 1) for t = 2, … , N. We can then define the test statistic W as the number of positive differences in this sequence: 0, otherwise.
These counts have an expectation of (N − 1)∕2 and variance (N + 1)∕12.Using a two-tailed test, the null hypothesis of randomness is tested against a trend.However, if the alternative hypothesis considers an upward (downward) trend, the null hypothesis is rejected if W is very large (small).
The use of information theory quantifiers presupposes some prior knowledge about the system under study.In fact, to obtain these quantifiers, it is necessary to provide the most appropriate probability distribution associated with the time series [48].Therefore, some methodologies are available to determine the probability distribution function from the time series.In this work, we use the Bandt and Pompe approach [49], which is used to obtain a probability distribution (P).This is a simple and robust symbolic methodology that stands out by taking into account the causality of the time series [49].
This procedure can be better illustrated as follows: we first transform the time series dynamics into a sequence of patterns, also known as words.These words are built by comparing consecutive values of the time series [49].The number of possible words will depend on their dimension, that is, on how many consecutive values we consider to construct the words.Words are assigned depending on the relative magnitudes of consecutive points in an arbitrary scalar time series {X(t)} N t=1 of length N. If D is the dimension of the words, for dimension D = 2, for example, we compare two consecutive values, {X(t), X(t + 1)}.If X(t) < X(t + 1) we assign the word 01.For X(t + 1) < X(t) we assign the word 10.Similarly, for dimension D = 3, if X(t) < X(t + 1) < X(t + 2) then we assign the word 012; if X(t) < X(t + 2) < X(t + 1) then we assign the word 021, and so on.We have a total of D! different words for dimension D. The probability of each ordinal pattern can then be estimated by simply computing the relative frequencies of the D! possible permutations  i : with C( t ) being the number of occurrences of the i th ordinal pattern and the lag  being the time separation between elements ( ∈ N).In this way, an ordinal pattern probability distribution, P = p i , i = 1, … , D!, is obtained.A notable result of Bandt and Pompe is an improvement in the performance of the information quantifiers obtained using their P generation algorithm [50].The symbolic data is used to classify the series values and define the reordering of the inserted data in ascending order, equivalent to a reconstruction of the phase space with a standard length D and lag .In this way, it is possible to quantify the diversity of patterns from a time series [51,52].We adopted D = 5 and lag  = 1 to obtain a reliable estimation of P [18,53,54].
From the sequences of patterns we calculate the probability of each word, and compute the "normalized" Shannon Permutation Entropy (H) as: Here, 0 ≤ H ≤ 1 is a measure of the global behavior of the dynamics, robust to changes in the distribution on small scales, and invariant to the way we order the D! words.
The Fisher Information Measure (F) is another quantifier of complexity.It measures the gradient of the distribution.This makes it sensitive to small, localized changes.For a distribution with N possible values (e.g., the ordinal pattern probability distribution) it can be defined as: Here, the normalization constant F 0 reads, and it is the best choice for a discrete time series [55].F is a powerful tool to identify and characterize complexity in nonlinear dynamical systems [56][57][58].The H and F measures complement each other, as the former extracts information of the dynamics at a global scale, while the latter does it at a local scale.By using the ordinal patterns approach, one can reveal different levels of complexity and structure when using words of different dimensions.In this sense, the Statistical Complexity (C) is a measure based on Jensen-Shannon divergence (J) between the associated probability distribution p  and the uniform distribution p u (the trivial case for the minimum knowledge of the process) and is defined by: where p  is the probability distribution of ordinal patterns, p u is the uniform distribution, and H is the normalized Shannon Entropy as defined in Equation 2. The disequilibrium Q ) where S[p] is the Shannon entropy with respect to a PDF p and Q 0 is defined by: which describes the normalization constant that is equal to the inverse of the maximum value of S [45,59].
The speed is related to the time it takes a body to travel a given space, while acceleration is the rate of change of speed in relation to time, that is, it is a way of quantifying the change in speed of a given object.Thus, extracting the first derivative X ′ (t) and the second derivative X ′′ (t) of the time series, it is possible to obtain new time causal information representing the regime behavior of its speed and acceleration, respectively.Thus, in this work we also evaluate the quantifiers W, H, F, and C over these new sequences obtained by differentiation.

Feature selection
After normalizing the nonparametric quantifiers extracted from the time series, it is necessary to verify which ones are really relevant in conducting the study.In this sense, the presence of irrelevant features impairs the performance of the algorithms used in the classification, not only in the ability to correctly predict, but also in the computational performance [60].
As a way to use the relevant features, the following feature selection techniques were used: • Information Gain (IG): This method uses the entropic gain of each variable in the explanatory matrix, calculated based on Shannon's entropy [61], to select the most significant explanatory variables with respect to the response variable.
• ANOVA: A model that is selected by analysis of variance (ANOVA) [62], from which the variables that obtained statistical significance were chosen at the 5% and 1% levels [62].This is thus an index of which variables significantly influence the response variable.
• Variance inflation factor (VIF): It consists of using a logistic regression model to obtain all variables that do not exhibit multicollinearity (strong correlation between two or more explanatory variables) [63].According to Menard [64], VIF values greater than 10 indicate strong multicollinearity, which in turn affects the model's estimates [65].

Classifiers
For this work, the following classification methods were used: • Logistic regression (LR): LR is a parametric classification model where the dependent variable is dichotomous or binary and explanatory variables can be categorical or numeric.LR allows us to estimate the probability associated with the occurrence of a given event based on a set of input features.Therefore, it is necessary to perform a transformation to make the dependent variable continuous.Among the transformations available in the literature, logit, probit, and Cauchy have become popular [66].In this work, we used the logit transformation.
• Support vector machines (SVMs): SVMs have been proposed as an effective statistical learning method for classification and regression that often has superior predictive performance to classic neural networks [67][68][69].
The main idea consists of mapping the input space to a high-dimensional feature space (generally a suitable Hilbert space) through nonlinear transformation to produce optimal separating hyperplanes (OSHs) that separate cases of different class labels.
• Random Forests (RFs): RFs are an extension of decision tree models, consisting of a set of uncorrelated trees where a decision is made based on a simple vote, resulting in an estimate or classification.Decision trees are comprised of numerous resamples with replacement made by bootstrap [70,71].Some advantages of RFs are their robustness to outliers, as well as their low bias and ability to capture complex data interaction relationships.
• Extreme Gradient Boosting (XGBoost): The XGBoost classifier is an ensemble method based on decision trees that incorporates the concept of gradient boosting, as described by Breiman in his work on bagging [72].This approach relies on the boosting technique, which involves resampling classifiers with replacement multiple times, while ensuring that the resampled data benefit from the classification performed in the previous sampling, as outlined in the work by Friedman [70].The final result is obtained by combining the outputs of all the classifiers using a weighting method that accounts for the classification performance of each individual model.
Studies show that these methods are quite competitive in both regression and classification [73,74].Therefore, we used such methods to classify genuine and forged handwritten signatures.For classification, a method is used that automatically separates the classes under study based on individual features [75].In this work, to assist classification, we used the holdout method to perform simple random sampling of the databases obtained according to the criteria for selecting features.Thus, each dataset is divided into a training set and test set, where, as in most works, we used 70% of the data for training and 30% for the test phase [76].To obtain estimates of the averages and dispersions of the selection criteria to be used to choose the best classifier, we used multiple holdouts to improve classification accuracy [77].

Evaluation metrics
To comprehensively evaluate the performance of our classification procedure, we employed several error metrics.Let us denote (TN) as the number of true negatives corresponding to class 0 (forgery signature) correctly classified, false negatives (FN) represent class 0 (forgery signature) misclassified as 1 (genuine signature), false positives (FP) indicate class 1 (genuine signature) misclassified as 0 (forgery signature), and true positives (TP) denote class 1 (genuine signature) correctly classified.The metrics are thus: • Accuracy: This is a fundamental performance measure that calculates the proportion of correctly classified instances, regardless of their class.It provides an overall assessment of the classification procedure's correctness.
The formula for accuracy is given by: Accuracy = TP + TN TP + TN + FP + FN • Sensitivity: The Sensitivity, also known as recall or true positive rate, quantifies the proportion of correctly classified positive instances out of all actual positive instances.It provides insights into the model's ability to identify positive cases accurately.The formula for sensitivity is given by: Sensitivity = TP TP + FN • Specificity: The Specificity represents the proportion of correctly classified negative instances out of all actual negative instances.It measures the model's capability to correctly identify negative cases.The expression for specificity is given by: Specificity = TN TN + FP • Area Under the Curve (AUC): The AUC is a widely used measure to assess the discriminative power of a classification model.It represents the probability that a randomly chosen positive instance will be ranked higher than a randomly chosen negative instance.A higher AUC value indicates better classification performance.The equation for calculating AUC is given by: where TPR is the True Positive Rate (also known as sensitivity) and FPR is the False Positive Rate.
To mitigate the risk of overfitting, which arises when a classification method fits the training data well but performs poorly on unseen data in the testing stage of the classifier, the multiple holdout method [77] with 100 iterations was implemented.
• Average accuracy: the estimated average accuracy is given by: μAccuracy =  The creation of the verification system drew inspiration

Flowchart of the verification system
Flowchart of the verification system.This approach aligns with a multiverse analysis [83] in that we used several statistical learning methods for both feature selection and the machine learning phase.
from prior studies discussing the implementation of biometric systems and machine learning methods [5,19,60,[78][79][80][81][82].Signatures are the input and they are first scaled to fit a unitary square, and interpolated in order to have the same number of data for all subjects.Then, the time series (curves) and their first and second derivatives of both horizontal and vertical writing processes are extracted.These time series are then represented in a nonparametric manner using trend and time causal descriptors via the Wallis and Moore quantifier and the Bandt and Pompe symbolization.A histogram of these symbols is then built for each coordinate, and information theory quantifiers are computed from these histograms: normalized Shannon entropy (H), Fisher's information measure (F), and statistical complexity (S).After an exploratory data analysis, we use feature selection methods such a VIF, IG, and ANOVA to reduce the number of attributes in the classification algorithms.The datasets, featuring the selected attributes, were utilized to train and test four popular ML algorithms for binary classification (genuine and forgery signatures): Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF), and Extreme Gradient Boosting (XGBoost).Subsequently, the results were compared and analyzed via performance measures.It is feasible to employ selection methods within each machine learning method, ensuring that feature selection considers the limitations of the methodology.However, in practice, the expectation is that the system learns from the previously selected characteristics, allowing only these features to be extracted for new datasets.The implementation of the verification system in this approach may be further developed in future research, and meta-analysis studies could facilitate the evaluation of the effectiveness of different proposals.

RESULTS
For a better understanding of quantifier behavior, we present a descriptive analysis of the features extracted from the time series.It is worth emphasizing that all features were obtained from standardized time series so they were on the same scale.For each feature, we subtracted the mean and divided by the standard deviation.Tables 1-4 contain the mean (μ), standard deviation (σ), and median ( m) for entropy, complexity, Fisher information, and the Wallis and Moore statistic, respectively.These are shown for the original time series, first derivative, and second derivative for the x-and y-axes of each signature.Each table presents the measures separated for forged (0) and genuine (1) signatures.
It is possible to observe in all tables that the mean and median values have different signs for genuine and forged signatures.Moreover, regarding the standard deviation, we note that in all tables the genuine signatures have less variability than the forged ones.
To better visualize how the features behave, Figure 3A-L present the scatter plot with the marginal density of each feature classified by the signature type (forgery or genuine) for both x and y coordinates.In all plots, the positive correlation between x and y coordinate is obvious, as expected.We note that all features are less dispersed in the genuine than in the forgery signatures, a clear signal of the separability between them.The density curves on the marginal axes show the probability distributions of each feature for each coordinate of both types of signatures.These plots, in spite of being limited due to its marginal nature, reveal for each feature several modes, and suggest different dispersion patterns, as discussed in Rosso et al. [18].If we fix a certain characteristic, such as entropy, we see that its distributive behavior is very similar when we take the first and second first derivatives of the coordinates (left vertical panel: Figure 3A,E,I), and this behavior is repeated throughout the other features (seen in the vertical panels for each feature).We can see in the graphs that the characteristics entropy and Wallis and Moore information are the most concentrated among all the characteristics of genuine signatures, this may indicate that they are more stable characteristics and can serve as good discriminators between genuine and forgery signatures.
The process of extracting time series features plays an important role.However, it is necessary to verify whether these quantities are, in fact, representative.Thus, a measure to identify the quality of these attributes is consistency [84]: where, s = 1, … , T, T is the total number of features extracted from the time series, d s (i) is the consistency of the feature s for individual i,  s;1 (i) is the average of the feature s for the genuine signatures of individual i only, while  s;0 (i) is this average for the forged signatures of individual i.  2 s;1 (i) and  2 s;0 (i) represent the variance in the sample for the genuine and forged signatures of individual i, respectively.This measure results in k consistency measures for each feature, where k is the number of individuals in the study.Therefore, in order to consolidate a consistency information measure for each feature, the mean, standard deviation, and median values considering the k individuals are calculated [85].With regard to consistency, a good feature should have a high average and low standard deviation [85].
Table 5 refers to the consistency measure, where we calculated the average values of the original features (μ), standard deviation of the original features (σ), average of the first derivative of features (μ ′ ), standard deviation of the first derivative of features (σ ′ ), average of the second derivative of features (μ ′′ ), and standard deviation of the second derivative of features (σ ′′ ) for the features obtained from the x-and y-axes.Thus, it can be observed that the features do not show much variation and, consequently, have mean and standard deviation values close to each other.However, in general, the Wallis and Moore statistic (W) presented the lowest mean in all obtained forms, but this difference is compensated by its standard deviation being the smallest in all categories.
The effect of interpolation and sampling frequency has been extensively studied in the literature.Martinez-Diaz et al. ( 2007) employed a Hidden Markov Model (HMM)-based system to investigate the impact of sampling frequency on signatures, concluding that sampling at 100 Hz improved verification accuracy [86].Vivaracho-Pascual et al. ( 2009) proposed an online signature recognition system, reducing signature points in the database without performance loss [87].Saleem and Kovari (2020) explored the choice of individual sampling frequencies for each signer, proposing a signer-dependent sampling-based signature verification system [88].In this work, we assessed the effect of the number of points used for interpolation, aiming to examine the stability of nonparametric quantifiers (entropy, complexity, and Fisher information).Different point numbers, including 250, 500, 1000, 2500, and 5000, were tested.We did not assess the Wallis and Moore quantifier as results from statistical  asymptotic theory assure the stability of this estimator [89].Results indicate that, in terms of the distributional effect on location and scale (See Figure 4), interpolation does not significantly affect different point numbers.However, analyzing nonparametric quantifiers based on information theory (see Figure 5), we observe high variability when the number of points is less than 1000, especially in entropy.This variability is expected as the reconstruction of the permutation entropy distribution becomes distorted, failing to capture meaningful patterns.The other quantifiers show small changes in location that decrease as the number of points increases.Stabilization of all quantifiers occurs from 2500 points onward, as expected due to the large number of points.Therefore, our entire verification system is based on interpolation with 5000 points, providing assurance of obtaining a robust reconstruction of quantifiers.Several groups of quantifiers were well correlated.The heatmap of the features correlation matrix can be seen in Figure 6.This matrix allowed us to evaluate the relationships between the quantifiers.The highly positively correlated quantifiers were grouped in dark blue, and the quantifiers in dark red showed negative or nonsignificant correlations.The results revealed a simple pattern of high correlations.The first group shows the correlation between entropy and complexity features.A strong positive correlation can be noted, since the ellipses of the figure are more flattened.On the other hand, the second group, composed of the entropy and complexity features related to Fisher's information and Wallis and Moore's statistics, has a negative correlation, which is strong in most features' forms.Finally, the last group is composed of the Fisher's information and Wallis and Moore statistic features.In this group, the correlation is positive but weaker compared to the first group, since the ellipses are more dispersed.The patterns identified in these correlations provide motivation for preselecting an optimized set of features to be input to the classifiers.Feature selection is a crucial step in the implementation of numerous biometric systems, as it facilitates the identification of useful features while eliminating redundant information.This reduction in training costs, along with improved detection performance, contributes to the development of a lightweight detection system, which is essential for online personal recognition [81,90].Based on the [60] guidelines for practical application of ML methods, we conducted selection procedures.

TA B L E 5
For feature selection, we used four methods: information gain, ANOVA 1%, ANOVA 5%, and variance inflation factor.From these methods, four datasets were obtained.Each dataset is composed of the target variable, signature class, and the features selected to compose the dataset based on the mentioned selection criteria.

F I G U R E 6
Heatmap of correlation matrix of the nonparametric quantifiers.The correlation coefficient of quantifiers are color coded from deep red (−1) to deep blue (1).
For information gain, we calculated the information gain ratio for all features obtained from the time series.Afterward, we obtained the average information gain ratio.Finally, the features selected to compose the set of nonparametric quantifiers according to this criterion were those with an information gain ratio above average.
Regarding selection by ANOVA, using logistic regression with the logit link function, we estimated the coefficients of this model composed of all nonparametric quantifiers.The features selected according to this criterion were those significant at the 1% and 5% levels.
To calculate VIF, we used logistic regression with the logit link function.We calculated the VIF for the covariates of this model.The feature with the highest VIF value was removed from analysis, and we reestimated the logistic model and recalculated the VIF.Finally, the features with VIF less than 10 were selected for the quantifier set according to VIF.
In Table 6, the symbols x, x ′ , x ′′ , y, y ′ , and y ′′ refer, respectively, to the original features, first derivative, and second derivative of the x-and y-axes.It can be seen that the features C y , F y , F y ′ , and W y ′ were most relevant, contained in three of the four datasets formed according to the selection criteria.They were not relevant only according to VIF, the most conservative criterion, which had only two characteristics in its selection.In the classification step, we used the classifiers with the holdout method and the feature sets selected according to the criteria.
Figure 7 shows the accuracy averages and standard deviations, both in the training context (Figure 7A) and in the test stage (Figure 7B), for the different classification methods adjusted for each dataset according to the selection criteria.Thus, it is observed that in the training stage, the XGBoost classifier obtained the best fit, with average accuracy of 78.24%, 79.3%, 78.6%, and 71.86% for the ANOVA 1%, ANOVA 5%, information gain, and VIF datasets, respectively.The standard deviations were TA B L E 6 Features selected according to each criterion.

Selection criteria
Selected features Removed features F I G U R E 7 Accuracy for each classification method according features selection criteria.Error bars represent M ± 1SE.We observed that Extreme Gradient Boosting tends to overfit the training set, leading to an increase in accuracy that may not be reflected in the test set.The Support Vector Machine shows more accurate performance on the test dataset.
±0.515%, ±0.541%, ±0.522%, and ±0.502%, respectively.However, regarding the test group, the XGBoost classifier did not stand out as much as in the training group.In general, there was little difference in accuracy.However, the random forest and SVM classifiers had the highest accuracy, with values close to each other for the ANOVA 1%, ANOVA 5%, and information gain feature criteria.Only for the VIF model was the random forest classifier the one with the lowest accuracy.However, the logistic regression, SVM, and XGBoost showed close accuracy values.To observe the performance of classifiers for each signature class, Figure 8 shows classifier performance by sensitivity, defined here as a method's ability to correctly classify forged signatures.Figure 8A shows classifier performance in the training phase.Here, the XGBoost method stands out for all feature selection criteria, followed by random forests.On the other hand, logistic regression had the worst performance in this group.
In the test phase, for the ANOVA 1%, ANOVA 5%, and information gain criteria, the XGBoost and random forest classifiers had similar performance.Also, for VIF, these two methods had the best performance, but XGBoost was more sensitive.Table 8 shows the average sensitivity for the classification methods across the four feature selection criteria, for the training and testing groups respectively.The bolded numbers in the table indicate the best performance in terms of average sensitivity for XGBoost on the training dataset and RF for the test datasets obtained by ANOVA feature selection at the 5% significance level.
Similarly, specificity is the ability to correctly classify genuine signatures.As shown in Figure 9, for the training group and ANOVA 1%, ANOVA 5%, and information gain criteria, logistic regression and SVM performed best.We observe that all methods exhibit comparable performance in both the training and testing phases, and there is no explicit standard specifying the best measure.It is noteworthy that there is a strong dependence on the set of In a biometric system such as the one proposed here, in the training phase, the model must be adjusted to optimize specificity, while in the testing phase, this emphasis on specificity can result in a lower ability to detect forgery signatures.This highlights the importance of finding a balance between sensitivity and specificity, depending on the priorities and consequences associated with falsifying or incorrectly rejecting signatures.Adjustments to model  We observed that Extreme Gradient Boosting tends to overfit the training set, leading to a performance increase that may not be reflected in the test set.On the other hand, Support Vector Machine demonstrates more accurate performance on the test dataset.
parameters or training methodology should be considered to improve the system's ability to detect forgery signatures without overly compromising the correct identification of genuine signatures.
For AUC, as shown in Figure 10, logistic regression and SVM performed best according to this metric in both the training and test phases.We observed that XGBoost tends to overfit the training set, resulting in a performance increase that is not consistently reflected in the test set.On the contrary, SVM is more accurate on the test dataset.These contrasting behaviors highlight the need for a careful examination of model generalization and overfitting issues, suggesting the importance of considering alternative modeling approaches or tuning parameters for XGBoost to achieve better performance on unseen data.To visualize receiver operating characteristic (ROC) curve behavior, Figure 11 shows an example for classifiers adjusted to each dataset according to the feature selection criteria.For these, the ANOVA 5% criterion showed the best performance for random forest, SVM, and XGBoost.In contrast, for logistic regression the feature set from information gain had the highest AUC.
Table 11 shows the average time in seconds to estimate classification for each dataset according to the feature selection criteria.Random forest followed by SVM demanded the most time (RF requires more computing time for several reasons, including the number of trees in the ensemble, the depth of the trees, the size of the dataset, limited parallelism, exhaustive hyperparameter search, and the complexity of the problem), around 20 and 6 seconds, respectively.In contrast, logistic regression and XGBoost were the fastest, taking less than 1 second each for estimates.Notably, logistic regression took less than 0.01 seconds to obtain a classification estimate.

CONCLUSION
The relevance of signatures in legal transaction processes, and the risk of increasingly sophisticated forgeries highlight the importance of research to improve recognition of forged and genuine signatures.In this work, we used the MCYT database (MCYT Fingerprint subcorpus) to verify online handwritten signatures.Before classifying signatures, we extracted nonparametric information quantifiers directly from the time series, such as the Wallis and Moore trend statistic.Furthermore, using the Bandt and Pompe technique to obtain the time series pattern distribution, we calculated Fisher's entropy, complexity, and information.We also calculated first-and second-order derivatives to assess signature dynamics, and again extracted quantifiers.
We used feature selection techniques on the database of information quantifiers to form datasets composed of the most relevant quantifiers for classification.This yielded four feature sets according to the selection criteria.Notably, Fisher's information and Wallis and Moore's statistics were the most relevant features, with at least one form present in all datasets.Generally, although the classifiers showed similar performance, XGBoost performed best in the training stage, but in the test stage SVM and random forest performed better for most metrics.For the objective of classifying signatures, we want a classifier that rigorously classifies forged signatures correctly.Thus, sensitivity is a relevant metric here, and random forest showed the best sensitivity performance among all classifiers.
The main contribution of this work was proposing a set of nonparametric quantifiers for online handwritten signature verification.We also proposed comparing feature selection criteria and classifier algorithms on a real dataset.Importantly, the objective was not to outperform other state-of-the-art verification systems.
Collecting data on a WACOM tablet is common in handwritten signature studies, but variations arise when using different devices like store pads.Pressure sensitivity, sensor resolution, and stroke dynamics can differ between devices.To ensure system robustness, standardization and strict protocols are crucial when selecting collection devices.Some biometric characteristics are expected to be robust across devices, allowing for valid comparisons, particularly in environments with diverse devices.A recent study [91] analyzed pressure curves in different WACOM devices, highlighting variations in pressure saturation and dynamic range between styli.This underscores the importance of considering such nuances when comparing data collected on different devices.In the context of potential new acquisition technologies, regulated equipment and acquisition, following solid privacy policies for both commercial and noncommercial use, are crucial to ensure reliable biometric data.Ongoing research should lead to more uniform devices, facilitating inter-device comparisons.
In future work, the proposed nonparametric quantifiers could be added to state-of-the-art systems to analyze potential accuracy gains.Other work on time series classification problems may also benefit from using the proposed nonparametric quantifiers.Finally, the multiverse statistical learning approach presented here (see Figure 2) is applicable to other fields where hand line drawings are used to identify a person's physical and psychological "signature."For example, the proposed approach could examine Rey-Osterrieth complex figure drawings to reduce false positives in classifying potential deficits in children's cognitive development and dementia in adults [92,93].

F I G U R E 1
Signatures of two individuals from the database.Two true (left, blue) and one false (right, red) with the time series referring to the axes X(t) and Y (t), respectively.

TN••
+ TP  TN  + TP  + FN  + FP  , where TN  , FP  , FN  , and TP  correspond to the TN, FP, FN, and TP values in the -th iteration of the multiple holdout, respectively.Average sensitivity: The average percentage of TP, representing the signatures that were correctly classified within the group of genuine signatures in each iteration of the multiple holdout.The calculation of this measure is given by 100 ∑ =1 TP  TP  + FN  .Average specificity: The average percentage of TN, representing the forgery signature correctly classified within the group of genuine signatures in each iteration of the multiple holdout.The calculation of this measure is stated as 100 ∑ =1 TN  TN  + FP  .

Figure 2
Figure 2 displays the flowchart of the verification system.The creation of the verification system drew inspiration

F I G U R E 3
Dispersion diagrams with marginal densities of the two coordinates of the time series X(t) and Y (t) of entropy, complexity, Fisher Information and Wallis and Moore Information in its original form (first row), first derivatives (second row), and second derivatives (third row).

F I G U R E 4 5
Kernel density estimates of the statistical mean and standard deviation for all genuine and forgery signatures coordinates when interpolated using n points.Kernel density estimates of the nonparametric quantifiers based on theory information for all genuine and forgery signatures coordinates when interpolated using n points.

F I G U R E 10
Area under the curve of the feature selection criteria for different classification methods.Error bars represent M ± 1SE.
This research is partly supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) under grants 307556/2017-4 and 308980/2021-2 (L.C.R.), 303192/2022-4 and 402519/2023-0 (R.O.), and by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) in Brazil under financing code 001.Open access publishing facilitated by University of trajectories and time series are commonly used, as are measures of signature duration, pen ups, and frequency domain properties.More recent work applies dynamic time warping, network analysis, gradient boosting, and recurrent neural networks.
[18]eries.Studies by Foroozandeh et al.[30], Choudhary et al.[31], and Rosso et al.[18]proposed online signature verification using permutation entropy, Fisher information, and other metrics.In summary, there are a variety of feature-based, function-based, and machine learning techniques proposed for online signature verification.Global features from signature Entropy means, standard deviations, and medians.Complexity means, standard deviations, and medians.Fisher Information measure's means, standard deviations, and medians.Wallis and Moore statistics means, standard deviations, and medians.
TA B L E 1 Features consistency means and standard deviations.

Table 7
Average accuracy (%) for each classification method according to the criteria for features selection for the training and testing groups.Sensitivity for each classification method according to the feature selection criteria.Error bars represent M ± 1SE.We observe a tendency for both Extreme Gradient Boosting and Support Vector Machine to overfit the training set, leading to a consistent sensitivity increase that is mirrored in the test set.
Average sensitivity (%) for each classification method according to the feature selection criteria for the training and testing groups.Specificity for each classification method according to the feature selection criteria.Error bars represent M ± 1SE.We observe that all methods exhibit equivalent performance during both the training and testing phases, with no clearly defined standard for the best measure specified.
Average specificity (%) for each classification method according to the feature selection criteria for the training and test groups.
TA B L E 9Note: Numbers in bold show the highest average specificities.

Table 10
presents the average AUC.The bold numbers in the table indicate the best performance in terms of average sensitivity for XGBoost on the training dataset and SVM on the test datasets, obtained by ANOVA feature selection at a significance level of 5%.Average area under the curve (%) for each classification method according to the feature selection criteria for the training and testing groups.Numbers in bold show the highest percentage of area under the curve.Average classification processing time in seconds.
TA B L E 10Note:F I G U R E 11 ROC curve classification using each method applied to datasets according to the feature selection criteria.The classifier's ROC curves demonstrate superior performance when the algorithm is trained with features obtained through analysis of variance at a significance level of 1%.TA B L E 11