Password policy characteristics and keystroke biometric authentication

Behavioural biometrics have the potential to provide an additional or alternative authentication mechanism to those involving a shared secret (i.e. a password). Keystroke timings are the focus of this study, where key press and release timings are acquired whilst monitoring a user typing a known phrase. Many studies exist in keystroke biometrics, but there is an absence of literature aiming to understand the relationship between characteristics of password policies and the potential of keystroke biometrics. Furthermore, benchmark datasets used in keystroke biometric research do not enable useful insights into the relationship between their capability and password policy. Herein, substitutions of uppercase, numeric, special characters, and their combination of passwords derived from English words are considered. Timings for 42 participants for the same 40 pass-words are acquired. A matching system using the Manhattan distance measure with seven different feature sets is implemented, culminating in an Equal Error Rate of between 6% and 11% and accuracy values between 89% and 94%, demonstrating comparable accuracy to other threshold ‐ based systems. Further analysis suggests that the best feature sets are those containing all timings and trigraph press to press. Evidence also suggests that phrases containing fewer characters have greater accuracy, except for those with special character substitutions.


| INTRODUCTION
The use of digital resources throughout modern society has mandated the need to adequately restrict access to authorised users. Restricting access requires a robust and reliable means of performing user authentication. In most restriction mechanisms, a user will state an identity and then provide information that can verify that they have indeed claimed their correct identity. The most common implementation is a password system, where the password is kept secret and used to authenticate a user. Although password-based systems provide a good first-line defence, there is a continuing struggle between the complexity of the password to implement strong security and the user's desire for ease-of-access. To withstand bruteforce attacks, a strong password policy is often enforced involving a combination of lower and upper case, numeric, and special characters that should be frequently changed [1]. However, passwords generated to this specification become hard to remember, resulting in the user finding ways to increase memorability, which includes performing systematic variation (i.e. increasing a number at the end) or simply writing the password down so that it does not need to be remembered. It is widely acknowledged that there is a trade-off to be achieved between memorability and security, and it is challenging to create a policy to guide the user to good practice without being overly restrictive and making it difficult to remember [2,3]. Evidence suggests that there is good and increasing use of passwords using substitutions [4].
The challenges of password-based authentication systems have inevitably resulted in the exploration of approaches that have strong security properties whilst reducing the burden of remembering on the user. Biometric systems provide an alternative approach to authentication whereby physiological, behavioural and chemical characteristics are sensed and used to authenticate a user in replacement (or in addition) to a password. Biometric systems have always been heavily researched, and there is a wealth of literature on the different types and their applications, as presented by these extensive surveys [5,6]. For example, in terms of physiological, fingerprints, and facial features are examples. In terms of behavioural, voice, and keystroke dynamics are two examples. For chemical biometrics, DNA and odour are two examples. Recent advancements also include using behavioural biometrics extracted from wearable fitness trackers [7], and also using it for clinical purposes, such as stress detection [8]. Biometric systems impact on their users in different ways, depending on whether they are invasive or passive for the participant. For example, facial recognition is passive as it can be captured from video or image of the user without them taking special action, whereas a fingerprint system is invasive as the user needs to physically connect with a sensing device.
Herein, we revisit the topic of keystroke biometrics for authentication. Keystroke timings can be acquired passively from the user whilst they type their password. The use case is that the keystroke information is acquired as the user enters their password, thus providing a two-factor approach and helping to prevent against their account being compromised by a brute-force attack or being used should their password be acquired by an adversary. This is a well-established biometric area and its first use can be traced back to analysts in World War II aiming to track operators sending Morse Code messages via radio communication [9]. There have since been many studies on keystroke biometrics, focusing on the extraction of different features, using different phrase lengths, and different learning algorithms [10][11][12].
Although, many studies do exist providing useful insight, there is a lack of empirical analysis investigating how keystroke biometric systems could be affected by different password policy characteristics (uppercase, numeric, special character, and length) that are often introduced to increased password security. The majority of key works in this area use datasets acquired from typing either the same phrase or from monitoring all user typing input, consisting of free-text input and no single password phrases. Furthermore, the phrases extracted in these datasets would not meet the majority of password policies, where a uppercase, numeric, or special symbol substitution are required. As we are investigating the use case of keystroke biometric authentication as an additional authentication layer on the user's password, it is necessary to gain timings from phrases containing characteristics that are aligned with common password policies. There is a need to understand how keystroke systems perform based on different password characteristics, to ensure that a balance can be achieved whereby the combined use of password and keystroke provide security beyond that of each single component.
To investigate the relationship between password policy characteristic and keystroke biometrics, an empirical study has been performed, involving acquiring timings from 42 participants, before systematically analysing using a distance measure and threshold approach. The following are the main contributions provided in this work: � The design of a rigorous keystroke biometric data collection exercise with varying password policy characteristics (length and character substitution), resulting in a publicly available dataset containing found sets timings for 42 individuals for 40 passwords. � Equal Error Rates are achieved between 6% and 11% with accuracy between 89% and 94% for all feature sets. Empirical observations reveal the impact of the number of characters and types of character substitutions, providing new knowledge on their relationship.
The remainder is organised as follows: In Section 2, related work is presented and discussed. Section 3 details the considered features and the methodology followed in this research. Section 4 discusses acquired data and details enrolment template creation. Section 5 details the matching mechanism used in this research followed by the process for empirical testing. Section 6 details the results and considers the impact of different password policy components, and finally, a conclusion and area for future work are provided.

| RELATED WORK
Behavioural biometrics are considered as a security strengthening addition to the existing authentication schemes. The domain of behavioural biometrics is an active research discipline and is continuously evolving in capability. For example, a recent study proposed a new smartphone authentication system, called AnswerAuth, which is based on the way a user unlocks their phone and brings it towards their ear [13]. The evaluation was performed on a data set of 85 users, containing 10,200 patterns, and achieved 98.98% accuracy using a Random Forrest classifier. Another recent work investigated the feasibility of using gaze-based behavioural indices to perform user classification [14]. The developed solution was tested on nine participants and obtained 97% accuracy. Many research studies also propose the fusion of multiple behavioural biometrics (i.e. Multi-modal techniques) to gain better performance, such as integrating eye movement and mouse dynamics achieved 92.9% accuracy for a short subject registration time of 20 s [15].
Previous behavioural biometric studies are predominantly publicly available datasets, which contain a large number of samples (repetitions) for a large number of users. In keystroke biometrics, these datasets are often acquired from monitoring users typing free text input. For example, the Aalto University Dataset containing key timings for 168K participants over a 3-month time period [16]. In collecting the data, users type 15 sentences from a large sentence pool. The sentences are in English and are not provided as a single phrase, such as a password. Furthermore, the data has no substitutions. Therefore, there is an absence of data sets where the participants are typing phrases with different types of character substitutions in a controlled manner. This is because the large datasets have often been acquired passively through key logging from willing participants when typing phrases longer than passwords. The lack of a suitable dataset could be a motivating reason as to why the relationship between password policy and keystroke capabilities has received less attention. The research presented herein is positioned to provide knowledge on this important topic.
In terms of the development and progression of key dynamic biometric systems, one of the earliest studies was performed on 50 subjects, where each subject chose their own password of 6-15 characters in length [17]. In the study, the researchers achieved 91% accuracy through the use of Multilayer Perceptron (MLP) and Probabilistic Neural Network (PNN) algorithms. The achievement of this study is significant as it used representative password phrases. Subsequent studies were able to improve in terms of accuracy. For example, one study [18] also applied MLPs on 20,400 data samples using 34 features and achieved an Equal Error Rate (eer) of 4.45%. The dataset was acquired from 51 subjects by typing .tie5Roanl as the password, which contains a special character, upper and lowercase characters as well as a number. Although these results are promising, the single phrase password, not closely representing a word, does not provide an insight into how the err might change should the password change in terms of length and substitution. Another research work developed a new, highly scalable keystroke-based biometric system called, TypeNet [19]. The TypeNet performs user authentication using Siamese Recurrent Neural Network (RNN). The evaluation of TypeNet was performed on one of the largest keystroke dataset, containing 136 million keystrokes from 168,000 users, and achieved 9.53%-3.33% eer.
Other studies have continued to build upon earlier success, using the same dataset. For example, another study [20] used a combination of Support Vector Machine (SVM) and Deep Learning algorithms to train keystrokes (key press/release events) from the .tie5Roanl password. The data was acquired from 51 subjects, where each subject performed 400 repetitions of a password. The solution demonstrated an accuracy of 92.60%. Another paper used SVMs to identify users based on keystroke biometrics [21]. A set of 155 features were extracted from the .tie5Roanl password using different combinations of timings. Using a data set gathered from 94 subjects, the developed solution demonstrated over 97% accuracy.
Research have also been undertaken using different datasets. In one work, the authors collected 4126 data samples from 22 subjects when typing the numeric password of 766420 [22]. In another study [23], a database of 7600 sequences was acquired over the duration of 4 months from 300 users. The sequences are users' first name and surname typed together. The features used in this study are hold time, release-press latency, and press-press latency. A total of 31 machine learning algorithms were used for evaluation. The solution has achieved err as low as 5.32%. A different study [24] used .tie5Roalnb, . aeihoz246@, .nzkla29zah.#, and aeR5t.ilnb passwords to acquire 30,000 samples from 1000 voluntary subjects. The dataset was used to train several classifiers, demonstrating accuracy of >98% and an ability to manage large datasets. This study shows that the characteristics of passwords (e.g. length, types of letters, entropy, etc.) impact on the feature extraction process, and consequently the accuracy and performance of the overall authentication system. Another study introduced the use of a Partially Observable Hidden Markov Model (POHMM) [25]. The authors used five publicly available keystroke datasets to evaluate the POHMM. The POHMM consistently achieved higher performance and better accuracy over existing solutions in terms of user identification and (continuous) verification.
Keystroke authentication systems have also been developed for mobile devices. One such work used the .tie5Roanl password to acquire a keystroke dataset from 42 subjects [26]. Each subject typed the same password for 60 times on an Android device. Following this, keystroke dynamics were used to extract 71 made up from different timing combinations. After testing multiple algorithms, it was concluded that the Random Forest obtained the highest accuracy of 82.53%. A recent study proposed the consolidation of 2-graph and 3graph time-based features and achieved 4.19% false acceptance rate and 4.59% false-positive rate [27]. The keystroke data was acquired from 152 subjects using a digit-only password, containing 17 characters. The password was typed on mobile touchscreen 10 times by each subject. A different study built a solution to handle typing errors whilst entering passwords [28]. This included collecting timings from 10 subjects making intentional mistakes while typing the password of [Mohammed-63] on a mobile device. As compared with random forest classifier that showed 12% err, the proposed solution presented 9% err and is based on time-based keystroke features.
A recent study [29] proposed the fusion of keystroke dynamics with gait patterns to perform continuous user authentication on the mobile devices. The data acquisition occurs while the user walks and inputs text, simultaneously. Based on an evaluation dataset collected from 20 participants, the developed solution achieved 99.11% accuracy using the multilayer perceptron classifier. Similarly, another study [30] demonstrates that using touch dynamics in addition to PINbased authentication system significantly increases the level of security. In case of PIN number being fully compromised, the probability of a successful impersonation attempt is reduced from 100% to only 9.9%.
There are many other applications that have used similarity measures as a pose to machine learning techniques. This works by scoring a new sample against an enrolled template to determine if it is a positive match, within a predefined threshold [31]. Similarity measures include examples such as Gaussian mixture model-based matching [32] and distance measuring [31]. These techniques also demonstrate 90% accuracy, but there is the added challenge of determining decision boundaries. However, although their accuracy might be slightly lower than machine learning approaches, they have advantages in terms of research as they require no training phrase, resulting in them requiring less computation overheads. Furthermore, from a research perspective, using a similarity measure approach can help with understanding any relationships in the underlying datasets that could be hidden by the generalised approach of machine learning mechanisms.
The absence of any research and available dataset to in the area of understanding characteristics of password policies and keystroke dynamic systems motivates the design and collection of a suitable dataset. Furthermore, it is evident that a vast range of matching mechanisms can be used in keystroke dynamic systems with varying accuracy, but more traditional distance measuring approaches can still achieve good accuracy, which is significant considering their ease-of-use and transparent innerworkings.

| METHOD
To gain a comprehensive understanding of how password policy affects the performance of keystroke biometric systems, we have performed the following process: � Established a set of 40 passwords of varying length constructed from English dictionary words. Substitutions of different types are then made to some of the passwords. These variations are uppercase, numeric, special symbol, and combination of all three; � Collect timings for all 40 passwords for 42 participants.
Each participant will type each password 4 times; � Implement a distance-based scoring technique and threshold approach to compare test data sets with enrolled templates; � Use three of a participant's timings to process into an enrolled sample. The fourth sample is reserved for testing purposes, where each participant's test sample is compared against all enrolled templates to see whether it matches the correct participant or not. The experiment is repeated with multiple threshold values; and finally � We analyse the results to identify accuracy characteristics of the system in regard to the different password characteristics and feature sets.
The remainder of this section provides more detail on the above outlined process.

| Passwords
The passwords are all generated by selecting English dictionary words of varying complexity using the following criteria, which is guided by a survey into user practice in passwords [4]. In the study of six million passwords, 75% were between 8 and 10 characters in length. Furthermore, in excess of 40% are either native English words either in their unchanged form or with substitution. However, this study uses passwords generated from a system in Asia (acquired from the Chinese Software Developer Network) and therefore has a high portion (67%) of non-native English speakers. Other studies focusing on systems with a higher portion of English-speaking users report percentages [33]. In the study with six million passwords, 12.4% are alphabet only, 39% include at least one number, and 3.2% include the use of a symbol. Note that the password policy of the Chinese Software Developer Network did not require the use of a symbol. For these reasons, and in the interest of developing a representative password set, the following specification is followed to generate passwords. In total, 40 passwords of varying complexity are used to acquire timings from the user.
� Varying character lengths of 6, 8, 10, and 12. There will be two words for each character length; � Two more English words for each length will be selected and one uppercase letter substitution will be performed in each word; � Two more English words for each length will be selected and one numeric letter substitution will be performed in each word; � Two more English words for each length will be selected and one special character substitution will be performed in each word; and � Two more English words for each length will be selected and one uppercase, one numeric, and one special character substitution will be performed in each word.
The full combination of passwords is included in Appendix 9.1. As seen in the list of passwords, uppercase substitutions are used at different locations, whereas numeric and special character substitutions are used to replace characters where the shape of the number or special character is sufficient to act as a substitution for the normal letter. The specific special characters used in this experiment are: !, Âč, |, $, ∖, #, and @. Each participant will be asked for the password a total of four times. The entire list will circulate three times in the same order, and finally, in the fourth iteration the ordering of the passwords will be randomised to the same order for all participants.

| Acquired timings
A software tool has been developed in C# to (1) present the password to the user, (2) acquire and record timestamps as they type the password, and (3) perform validation to ensure the user types the phrase correctly, and if not, ask them to repeat it. However, as the participants will perform their experiment unsupervised, there is the possibility that they can exit the software and not complete the exercise. The software tool records millisecond timings for the key down and key up actions as the password is displayed to the participant, until they select a 'next' button or press the carriage return key to inform the software they have finished. Figure 1 displays a graphical illustration of the individual key down and key up timestamps whilst the participant is typing the action password. There are in total 12 timestamps recorded from the participant typing the six-letter word. Each character will generate a timestamp when a key is pressed (t ∨1 ) and one when the key is released (t ∧1 ).

| Features
The below list provides a description of each of the different features and Figure 1 provides an illustration: � Full Timings (f-t): the duration between each key action in sequential order, regardless of whether it is an up or down action (same as merging both p-r and r-p). � Press to Release (p-r): the duration between the time a key is pressed and released, which in Figure 1 is Other studies refer to this as the hold or dwell time. � Press to Press (p-p): the duration between the time a key is pressed and the next key in the password is pressed. In the example presented in Figure 1, this feature is calculated by Other studies also refer to this feature as a diagraph press time. � Release to Press (r-p): the duration between a key being released and the next one being pressed. In Figure 1, this feature is calculated by t ∨2 À t ∧1 . In some studies, this feature is referred to as a dwell, release, or seek time. � Release to release (r-r): the duration between the release of a key and the release of the next key in the password phrase.
In the example presented in Figure 1, this is calculated by t ∧2 À t ∧1 . Other studies also refer to this feature as a diagraph release time. � Trigraph press to press (trigraph p-p): the duration between the time a key is pressed and the second next key in the password is pressed. In the example presented in Figure 1, this feature is calculated by t ∨6 À t ∨4 . � Trigraph release to release (trigraph r-r): the duration between the release of a key and the release of second the next key in the password phrase. In the example presented in Figure 1, this is calculated by t ∧6 À t ∧4 .

| DATA ANALYSIS
In this section, and before considering the utility of the acquired data and the different features as demonstrated in Section 3.3, it is necessary to analyse a participant's four responses for each password, as detailed in Section 3. In total, 42 participants took part in the data collection exercise. All participants are undergraduate students on a Computing related degree and therefore have strong typing proficiency. In terms of completeness, the majority of the participants did provide a full series of samples; however, upon analysis, it is evident that in some instances recordings are missing due to the user stopping and restarting the test. Due to incompleteness of only having one completed sample per password, some datasets are incomplete. In total 87% of all password and participant combinations have four samples. For example, participant 24 is missing responses (password 20, 23, 30, and 31). There are also many other instances where a user has provided two or three samples per password. This has occurred when a user exits the data collection exercise. However, as the work here presents an empirical analysis, the data is still used during the analysis as it can be used to gain an understanding of the technique's capabilities. There is also one instance (participant 34) where more samples have been provided than necessary. This is because the user has restarted the data collection exercise, resulting in too many timings being acquired.  Table A2 provides the number of repetitions for each password, per participant. For most passwords, it is evident that 30 or more users provide four or more samples. However, this is not the case for passwords 6, 11, 14, 15 where 28-29 users provide four or more samples. Interestingly, these passwords have either no substitution or a single uppercase character. Another interesting finding from the table is that there are not more users providing a lower number of samples for passwords including all substitution types. This could suggest that the participant is taking more time and care with the difficult task.
To demonstrate variation in a user's timings, Figure 2a provides the timings for participant number 2 when typing the phrase action. The up and down timings for each key are displayed relative to the previous timestamp (t t∨n À t t∧nþ1 ). There are some important observations that can be established from looking at this example. Attempt number three has a delay between t t∧ and t o∨ , which is not typical when considering the other three samples. Similarly, the difference between t c∧ and t t∨ is higher than with the other three samples. Both these timings are release to press (r-p) timings (seek timings) and would be ignored if this feature was not utilised. Furthermore, intuitively the forth sample is the quickest sample, demonstrating that the participant is more familiar with the phrase.

| Template creation
As illustrated in Figure 2a, although there is good repetition in the samples provided by participant 2 when typing action, there is an obvious outlier for i ∨ that is significantly different from other i ∨ values. As biometric systems are aiming to F I G U R E 1 Key down (t ∨ ) and key release (t ∧ ) timestamps acquired from a participant whilst typing action, as well as demonstrating different features PARKINSON ET AL. establish a representative dataset to enrol a user, it is necessary to remove this outlier. In this example, we consider that of t i∨ demonstrated in Figure 2a as it contains an outlier. We use the m-score measure, which is a variation on the well-known z-score, and is used as a means to identify outliers. The reason behind using the m-score is its suitability for small sample sizes, which is pertinent considering in most instances there are four attempts at each password by each participant [34,35].
The median is calculated for each password character up and down action, which for example is X t i∨ . We then calculate the Median Absolute Deviation (MAD) for each character action MAD t i∨ . We then calculate an M-score for each t i i∨ by the following: The Median Absolute Deviation (MAD) will converge to the median of half the normal distribution, which is the 75% percentile of a normal distribution, and hence the value of 0.6745. For further information on equation for calculating the MAD, the readers are directed to early work by Iglewicz and Hoaglin [34].
We then replace t i∨ with X t i∨ if the calculated M i i∨ value is greater than 3.5. Figure 2b shows an updated series of attempt timings for participant 2 when typing action. It is evident that the t i∨ outlier has been removed. It is however noticeable that there is still a large degree of variation in t n∧ . This is to be expected as there is a large degree of variation amongst all attempts, with no majority being closely aligned. As with all biometric systems, a template is created and stored for future matching. In this approach, the mean of the first three samples (outliers removed) is used. The purpose for only using the first three is to keep the fourth sample 'unseen' for empirical testing (Section 5). In some instances, the participants did not supply four timings per password. Full information on the number of times a participant provided repetitions for each password can be seen in Tables A1 and A2. Where a user has provided less than the four samples and providing the total number of provided samples for the password is greater than or equal to 2, the last sample is reserved for testing used during template generation. If only one sample has been provided, the same is used to create a template and test set. The mean of each up and down key timing is calculated by: where j is the number of t i∨ samples. Furthermore, t n i∨ is the individual key down timings for the character 'i'. In the example, we have t 1 i∨ , t 2 i∨ , and t 3 i∨ . Figure 2c demonstrates the template for the continuing timing set of participant 2 typing action. Furthermore, the figure also includes templates for the first five participants to illustrate their difference.

| EXPERIMENT
In this section, an experiment is conducted to determine how effectively key timing information can be used as a biometric. This specifically involves investigating: � An approach to match and score a set of timings (test set) against a template set and determine whether or not it is a match � Accuracy information when using the fourth samples of each participant to match against all enrolled templates � The impact on performance of using the seven different feature sets, as discussed in Section 3.3.

| Matching
In this work, the Manhattan distance measure is used to score how well a testing sample matches that of an enrolled template [36] function is used. A distance measure and threshold approach has been adopted as previous research findings detail that both these mechanisms are part of systems that are capable of generating results comparable with applying general learning mechanisms (e.g. deep learning) [ [38], their results can be difficult to interpret [39], and they are computationally expensive to use. Herein, the aim is to gain an understanding of the relationship between password characteristics and accuracy. This understanding could be easily lost when applying a general learning mechanism and therefore, we adopt a systematic and easy to interpret approach. The Manhattan measure is used over other available distance measures (e.g. Euclidean) as it has demonstrated to be more suitable to data with a higher number of dimensions [40]. This is calculated by: The testing set is denoted as T ¼ t 1 , t 2 , …, t n and the enrolled sample as E ¼ e 1 , e 2 , …, e n where both t i and e i are individual timings. n is the size of both test and enrolled set. The Manhattan distance is measured for each testing set against all participant samples and recorded in d k,l , where k is the participant number and l is the password template being used for comparison.
A decision function is established to determine whether the d k,l value for a specific participant and password is sufficient to determine a valid match. In this research, we use a threshold, t, which is used as a decision function. If the d k,l < t then the test sample matches the specific template, otherwise a non-match is returned. However, determining the correct threshold value is not trivial as it will vary between password and feature set. For example, in Figure 3 d k,l scores can be seen when measuring the test set acquired from participant 1 for action against each enrolled participant's template for action.
Furthermore, measurements for templates generated using each feature set are included. It is expected that d 1,1 (1 on x-axis) is to be the lowest as that is the correct match; however, as evident from the figure, there are instances where d k,l scores within other, incorrect passwords of similar low scores. It is therefore clear that determining an effective threshold is challenging and will inevitably result in false rejections and false acceptances. In the remainder of this section, empirical analysis is performed for all seven feature sets with a changing threshold value (t).

| Process
In this empirical analysis, templates are generated for each participant and password combination, as detailed in Section 4.1. As previously described, the last of the user's samples is removed and stored for testing. This results in the creation of an enrolled set, E, and a test set, T. In the majority of instances this results in three samples being used to create an individual participant and password template, e, and one sample being used as the test set, t. During testing, each test set will be compared against each enrolled template to determine if it is a suitable match. In our datasets, we have 1492 enrolled samples and 1670 test sets (generated from 4324 individual samples), which is approximately a one-third enrolled/testing split. The reason that the number of enrolled samples is smaller than the number of test sets is due to the fact that in some instances the participants did not supply four timings per password, as previously discussed in Section 4.1. As we are comparing each test set to each enrolment set, the experiment involves a total of 6,229,100 authentication decisions. In this analysis, the decision threshold, t, starts at 0 and is incremented by 0.1 until 100, where no further changes in performance was identified across all feature sets. This involves running the experiment on each future set a total of 1000 times. Furthermore, as we are performing this experiment to consider seven different features sets, almost seven trillion authentication decisions need to be made when performing the entire experiment. This takes around 2 h to complete on a computer with an 3.6 GHz i7 CPU with 16 GB of available RAM.

| Evaluation metrics
In this experiment, we are systematically comparing each test sample with all enrolled templates. It is therefore possible to establish the following measures:   Figure 4b shows the same information, only focusing in on the lower section where far values are less than 0.4 and frr values are less than 0.2 to allow the reader to more easily see the difference in the feature set. From analysing Figure 4a and 4b, it is immediately evident that although there are differences between the results for each feature set, the results overall are not too dissimilar between feature sets, with variation often within 10%. The feature sets demonstrate that the best results are with the trigraph press to press (trigraph p-p) feature set, with the feature set containing all up and down timings ( f-t) following very closely behind. The feature set demonstrating the worst results is press to press (p-p), followed by release to release (r-r) and release to press (r-p). The next two best are trigraph release to release (trigraph r-r) and press to release ( p-r). Table 1 presents the performance measures for each feature set where the far and frr become equal to within 1% and represents the eer. The table also demonstrates the same orderings as illustrated in Figures 4a,b. The best eer value is observed to be 10% for both the feature set containing all timings ( f-t) and trigraph press to press (trigraph p-p). Table 1 also provides the threshold value at which the eer was achieved. A general observation from these threshold values is that the higher the value, the lower the eer value. These results are promising and demonstrate the potential of a dynamic biometric system.

| RESULTS AND DISCUSSION
In the remainder of this section, the difference between password length, as well as the substitution of uppercase, special and numeric characters is further investigated and discussed. Individual ROC illustrations are used (Appendix 9.3) for each of the 40 passwords (Appendix 9.1). Table 2 provides both the number of times a feature set has been identified as best for a participant's test data set and the threshold and accuracy values for the best overall feature set. The best overall values are calculated by calculating accuracy values for testing all participant test data sets. The table also shows the number of participants where each feature set produces the best performance for each password. This has been established by identifying the feature set resulting in the lowest eer, which is the point at which the far and frr intersect. However, it should be noted that they are not exactly equal (i.e., far ¼ frr) because a stepsize of 0.1 was used to reduce the computation overheads when performing the experiment. In Table 2, only five feature sets are included, and this is because for all participant data sets, both p-r and r-r are never identified as being the best. More specifically, they never produce an eer that is lower than any other feature set when taking a participant's test set and comparing against all enrolled samples.
Before proceeding to analyse relationships between password features and key timing gestures, an immediate observation evident in Table 2 is that full timings set (f-t) and trigraph press to press (trigraph p-p) are identified as the best feature sets for all passwords. However, this was to be expected as both these feature sets have the best overall performance, as provided in Table 1. In total, f-t is identified to be the best feature set for 32 passwords and trigraph p-p for 8. Another general observation is that the best identified threshold values for trigraph p-p (average 27) are lower than for f-t (average 49). The reason for this is because the trigraph p-p will contain fewer timings for comparison and will result in a lower distance measure. Another general observation identified during the experimentation is that release to press (r-p) was the worst performing feature set and was only selected as the best for five users for passwords of length 6 with no substitutions.
Next, the impact of the password characteristics and capabilities of the proposed system are examined. Appendix 9.3 presents 20 individual ROC graphs presenting the results grouped by password characteristic which are in pairs. More specifically, the two passwords generated with the different characteristics are grouped together. For example, two passwords of six characters in length with no substitutions (action and return) are presented together in Figure A1.
In this discussion, we do not cross-reference with each ROC figure to keep the text easy to read; however, the authors do recommend the reader to review the ROC figures through the discussion as an illustrative aid. The ROC figures are clearly labelled as to their associated password. For example, Figure A1a-d presents the ROC for password lengths of 6, 8, 10, and 12, respectively. Furthermore, the ROC figures are grouped in five rows based on characteristics. The first (as four figures previously cross-referenced) have no substitution, F I G U R E 3 d k,l scores when matching testing set participant 1, password 1 (action) to all template sets for password 1, considering all different feature sets 170followed by a row including uppercase substitution, and so forth (Figures A2-A5). The reader can use these ROC figures throughout this section to aid understanding. In each figure, the ROC is provided for the seven different feature sets for the two passwords of each length. Table 2 demonstrates that for passwords with no substitutions, the feature set identified as being best for the most participant test datasets are full timings set (f-t) and trigraph press to press (trigraph p-p). It is evident that trigraph p-p has been identified as being best for only passwords with a length of 6 (action and return), and f-t for passwords of lengths 8, 10, and 12. Interestingly, it is also evident that both action and return have the largest variety of features identified as being best in comparison with all other passwords, irrespective of length and substitutions. In terms of accuracy, there is a gradual decrease from 94% to 91%. This is also evident in the eer where a decrease from 6% to 9%.
In terms of uppercase substitutions, each of the passwords has a single uppercase character substitution made in different locations. Interestingly, the results for all phrases generate similar accuracy and eer rates of around 90% and 9%-10% (only 0.7% variation), respectively. Similar to instances with no uppercase substitution, f-t has also been identified as the best feature set for five out of eight of the passwords, and trigraph p-p has been identified as best for three of the passwords, which are of 6 and 10 characters in length. An interesting observation is that those where trigraph p-p has been identified as best, the position of the uppercase substitution is either the first or last character.
Following the same process, numeric substitutions have been made where all apart from two of the passwords lengths of 6 and 10 (cr1sis and adressin9) have the best performance with the trigraph p-p features set. In terms of accuracy and eer rate, that values are quite consistent, showing a slight decrease (<1%) as password length increases.
Similar to instances with either no or uppercase substitution, f-t has also been identified as the best feature set for six out of eight of the passwords, and trigraph p-p has been identified as best for two of the passwords. An interesting observation is that for only two participants are any other features sets identified as the best. This is the trigraph r-r set for the password of brok3n.
In terms of special character substitutions, f-t has been identified as best for the majority of password instances (7 out of 8) and trigraph p-p is identified as the best for one password of 6 characters in length (fri!end). No other features have been identified as best for any participant data sets, apart from p-p for two passwords, one of length eight and one of length 10. These passwords are de∖ivering and bre$thtaking. Accuracy and eer values remain consistent with a less than 1% decrease for different lengths. As demonstrated in Table 2, passwords containing all three types of substitution are interesting because the f-t feature set has been identified as best for all participant/password combinations. Furthermore, it is also interesting that this is the only substitution type whereby the accuracy values remain within 0.5% variance, yet the eer shows a slight improvement.
In terms of the average eer for passwords with different types of substitution, the best are those with no substitution, followed by uppercase, numeric, special character, and finally, a combination or all three. However, it is worth noting that the difference is small, decreasing from 91% accuracy and eer 8% for with no substitution to 89% accuracy and eer 11% (rounded to 0 decimal places). Table 3 provides results on an aggregated level for the different password lengths. This includes a count of how many times trigraph p-p and f-t are identified as the best feature set for a password. Furthermore, the table includes average accuracy values for all passwords with that length, irrespective of substitution type. A general observation here are that f-t is the dominant feature set, except for with passwords of 6 characters in length. Another observation is that the eer is gradually worsening as the password length is increasing; however, this is small and less than 1%. The results discussed so far focus on identifying the feature set with the lowest overall eer for all participant/password combinations. In Figure 5, the number of times each of the five feature sets are identified as the best for matching a test password against the enrolled template. The table is interesting as it demonstrates that there is good consistency with f-t being the best feature set, followed by trigraph p-p second for all participants. Any variation is often in the balance of f-t and trigraph p-p, but it is clear that f-t is always the best.

| Summary of key findings
In summary to the previous section where results are presented and discussed, the below list provides a summary of the key findings from this study: � Improved accuracy is achieved where no substitutions are present in the password (2% improvement). Furthermore, where no substitutions are present, the accuracy is best for passwords of a shorter length. More specifically, 94% for 6 characters in length, decreasing to 90% for those 12 characters in length. Where substitutions have been made, length has little significance with all substitution types having around 1% variation in accuracy and eer. This demonstrates the length is an important factor when utilising passwords with no substitutions; � Full-timing (f-t) feature set is identified as being the best for the greatest number of participants, accounting for 69% of all participant/password combinations. Trigraph press to press (trigraph p-p) is the second best, accounting for 27%. The other three features identified as best for some participant password combinations account for 4% of the total. In addition, press to release and release to press are never identified as the best feature set for any participant/ password combination. Considering that f-t has been identified as best in the majority of instances, there is no advantage to converting the data into different feature sets; � Passwords containing all three types of substitutions demonstrate a small decrease in performance from any individual substitutions instance. Furthermore, when compared with those with special character substitutions, the eer has a slight improvement; and � There is a small decrease in accuracy as length increases (1%) across all feature sets, except where no substitutions are made. This demonstrates that length has little significance in relation to accuracy when a substitution of any type is introduced.

| CONCLUSION
A systematic study of the key timings for 42 individuals for 40 different phases with varying length and with different -173 substitutions is presented. The presented approach involves removing outliers and creating a template for each user and phrase instance, which resulted in the refinement of individual samples. One of the user's samples is reserved from being included in the template as it is used for testing. The processed enrolled templates and test data sets are then used to produce different feature sets (e.g. press to press). The Manhattan distance measure is then used to quantify how close a test sample matches any single template. A threshold mechanism is then applied to determine whether a test sample's Manhattan distance measure is suitable to determine a match or a rejection. Experimentation is performed to determine the threshold and feature set that yields the best performance. An eer of between 10% and 11% and accuracy values of 89%-90% were observed for all feature sets, with the lowers being with the feature set containing all timings and trigraph press-to-press. Further empirical analysis was then performed to gain an understanding of how phrase size and substitutions impact on accuracy. Key observations have been determined and provide a useful insight for those designing keystroke biometric systems, as well as doing further research into their capabilities. Our study can help inform their experimental design. Of our key findings, it is interesting to have identified that the capabilities of a key dynamic system are at their best with no substitution and the password is shorter in length (6). This means that implementing a keystroke biometric system, alongside a strong password policy, might have a negative impact on the biometric system.
Although this study provides useful insights into password policy and its impact on biometrics, there are limitations to the study. The first limitation is in the number of participants and their similar technical competence. The second being in the number of phrases asked and the potential for more repetitive samples to be acquired to further investigate repeatability. Limitations are also present in the participant provides the timings sequentially on the same computer hardware, and gaining timings using different hardware over a prolonged period of time would help provide a more representative dataset. However, notwithstanding these limitations, this study has merits and the data is supportive of the findings. In future work, the authors intend to address the aforementioned as well as consider the use of different matching mechanisms.