A multi-step finite-state automaton for arbitrarily deterministic Tsetlin Machine learning

Due to the high arithmetic complexity and scalability challenges of deep learning, there is a critical need to shift research focus towards energy efficiency. Tsetlin Machines (TMs) are a recent approach to machine learning (ML) that has demonstrated significantly reduced energy compared to neural networks alike, while providing comparable accuracy on several benchmarks. However, TMs rely heavily on energy-costly random number generation to stochastically guide a team of Tsetlin Automata (TA) in TM learning. In this paper, we propose a novel finite-state learning automaton that can replace the TA in the TM, for increased determinism. The new automaton uses multi-step deterministic state jumps to reinforce sub-patterns, without resorting to randomization. A determinism parameter d finely controls trading off the energy consumption of random number generation, against randomization for increased accuracy. Randomization is controlled by flipping a coin before every d 0 th state jump, ignoring the state jump on tails. For example, d ¼ 1 makes every update random and d ¼ ∞ makes the automaton completely deterministic. Both theoretically and empirically, we establish that the proposed automaton converges to the optimal action almost surely. Further, used together with the TM, only substantial degrees of determinism reduce accuracy. Energy-wise, random number generation constitutes switching energy consumption of the TM, saving up to 11mW power for larger datasets with high d values. Our new learning automaton approach thus facilitates low-energy ML.

. Tsetlin Machines (Granmo, 2018) (TMs) are a recent approach to machine learning (ML) that has demonstrated significantly reduced energy usage compared to neural networks alike (Lei et al., 2020;Wheeldon et al., 2020).Using a linear combination of conjunctive clauses in propositional logic, the TM has obtained competitive performance in terms of accuracy (Abeyrathna, Granmo, Zhang, Jiao, & Goodwin, 2019;Berge et al., 2019;Granmo et al., 2019), memory footprint (Granmo et al., 2019), energy (Wheeldon et al., 2020) and learning speed (Granmo et al., 2019;Wheeldon et al., 2020) on diverse benchmarks (image classification, regression and natural language understanding).Furthermore, the rules that TMs build seem to be interpretable, similar to the branches in a decision tree (e.g., in the form if X satisfies condition A and not condition B then Y = 1; Berge et al., 2019).The reported small memory footprint and low-energy consumption make the TM particularly attractive for addressing the scalability and energy challenge in ML.

| Recent progress on TMs
Recent research reports several distinct TM properties.The TM can be used in convolution, providing competitive performance on MNIST, Fashion-MNIST and Kuzushiji-MNIST, in comparison with CNNs, K-Nearest Neighbour, SVMs, Random Forest, Gradient Boosting, BinaryConnect, Logistic Circuits and ResNet (Granmo et al., 2019).The TM has also achieved promising results in natural language processing, such as text classification (Berge et al., 2019), word sense disambiguation (Yadav et al., 2021a) and sentiment analysis (Yadav et al., 2021b).By introducing clause weights, it has been demonstrated that the number of clauses can be reduced by up to 50Â, without loss of accuracy (Phoulady et al., 2020).Further, hyper-parameter search can be simplified with multi-granular clauses, eliminating the pattern specificity parameter (Gorji et al., 2019).By indexing the clauses on the features that falsify them, up to an order of magnitude faster inference and learning has been reported (Gorji et al., 2020).Additionally, regression TMs compare favourably with Regression Trees, Random Forest Regression and Support Vector Regression (Abeyrathna, Granmo, Zhang, & Goodwin, 2019).In Abeyrathna et al. (2021), stochastic searching on the line automata (Oommen, 1997) learn integer clause weights, performing on-par or better than Random Forest, Gradient Boosting and Explainable Boosting Machines.While TMs are binary throughout, thresholding schemes open up for continuous input (Abeyrathna, Granmo, Zhang, & Goodwin, 2019).Finally, TMs have recently been shown to be fault-tolerant, completely masking stuck-at faults (Shafik et al., 2020).The convergence property of TM has recently been studied in (Jiao et al., 2021;Zhang et al., 2020).

| Paper contributions
TMs rely heavily on energy-costly random number generation to stochastically guide a team of TAs to a Nash Equilibrium of the TM game.In this paper, we propose a novel finite-state learning automaton that can replace the TAs of the TM, for increased determinism.The new automaton uses multi-step deterministic state jumps to reinforce sub-patterns.Simultaneously, flipping a coin to skip every d 0 th state update ensures diversification by randomization.The d-parameter thus allows the degree of randomization to be finely controlled.Both theoretically and empirically, we establish that the proposed new automaton converges to the optimal action almost surely, when it is trained over an infinite time horizon while having infinite number of memory states.We further evaluate the performance of TM with this new automaton empirically on five datasets, demonstrating that the d-parameter can be used to trade-off accuracy against energy consumption.

| Paper Organization
In Section 2, we introduce our new type of Learning Automaton (LA)-the multi-step variable-structure finite-state LA (MVF-LA).The convergence of the MVF-LA is studied both theoretically and empirically in Section 3. Replacing the TA with MVF-LA, we describe the Arbitrarily Deterministic TM (ADTM) in Section 4.Then, in Section 5, we evaluate ADTM empirically using five datasets.The performance of ADTM is investigated by varying the d-parameter, contrasting against the regular TM and five other state-of-the-art ML algorithms.Effect of determinism on energy consumption is discussed in Section 6.We conclude our work in Section 7.

| A MULTI-STEP FINITE-STATE LEARNING AUTOMATON
The origins of LA (Narendra & Thathachar, 2012) can be traced back to the work of M. L. Tsetlin in the early 1960s (Tsetlin, 1961).The objective of an LA is to learn the optimal action through trial and error in a stochastic environment.Various types of LAs are available depending on the nature of the application (Thathachar & Sastry, 2004).Due to their computational simplicity, we here focus on two-action finite-state LA, which we extend by introducing a novel periodically changing structure (variable structure).
An LA interacts with its environment iteratively.In each iteration, the action that a finite-state LA performs next is decided by its present state (the memory).The environment, in turn, randomly produces a reward or a penalty according to an unknown probability distribution, responding to the action selected by the LA.If the finite-state LA receives a reward, it reinforces the action performed by moving to a 'deeper' state.If the action results in a penalty, it instead changes state towards the middle state, to weaken the performed action, ultimately switching to the other action.In this manner, with a sufficient number of states, a finite-state LA converges to selecting the action with the highest probability of producing rewards-the optimal action-with probability arbitrarily close to 1:0 (Narendra & Thathachar, 2012).
The transitions between states can be deterministic or stochastic.Deterministic transitions occur with probability 1:0, while stochastic transitions are randomly performed based on a preset probability.If the transition probabilities are changing, we have a variable-structure automaton, otherwise, we have one with fixed structure.The pioneering TA, depicted in Figure 1, is a deterministic fixed-structure finite-state automaton (Tsetlin, 1961).The state transition graph in the figure depicts a TA with 2N states.States 1 to N maps to Action 1 and states N þ 1 to 2N maps to Action 2.
While the TA changes state in single steps, the deterministic Krinsky Automaton introduces multi-step state transitions (Narendra & Thathachar, 2012).The purpose is to reinforce an action more strongly when it is rewarded, and more weakly when penalized.The Krinsky Automaton behaves as a TA when the response from the environment is a penalty.However, when it is a reward, any state from 2 to N transitions to state 1, and any state from N þ 1 to 2N À 1 transitions to state 2N.In effect, N consecutive penalties are needed to offset a single reward.
Another variant of LA is the Krylov Automaton.A Krylov Automaton makes both deterministic and stochastic single-step transitions (Narendra & Thathachar, 2012).The state transitions of the Krylov Automaton are identical to those of a TA for rewards.However, when it receives a penalty, it performs the corresponding TA state change randomly, with probability 0:5.
We now introduce our new type of LA, the MVF-LA, shown in Figure 2. The MVF-LA has two kinds of feedback, strong and weak.As covered in the next section, strong feedback is required by the TM to strongly reinforce frequent sub-patterns, while weak feedback is required to make the TM forget infrequent ones.To achieve this, weak feedback only triggers one-step transitions.Strong feedback, on the other hand, triggers s-step transitions.Thus, a single strong feedback is offset by s instances of weak feedback.Further, MVF-LA has a variable structure that changes periodically.That is, the MVF-LA switches between two different transition graph structures, one deterministic and one stochastic.The deterministic structure is as shown in the figure, while the stochastic structure introduces a transition probability 0:5, for every transition.The switch between structure is performed so that every d 0 th transition is stochastic, while the remaining transitions are deterministic.

| PROOF OF THE CONVERGENCE OF MVF-LA
In this section, we discuss the convergence of the proposed MVF-LA.In Section 3.1, we use a Markov chain model to analyse the convergence property of the MVF-LA.Thereafter, we simulate the MVF-LA and illustrate its convergence in different conditions in Section 3.2.To build the Markov chain, we utilize the memory states of the MVF-LA, that is, 1 to 2N, to represent the state space of Markov chain.The transition probability matrix, P, for the Markov chain of MVF-LA is then to be established.The transition from any state i to another state j in MVF-LA can happen due to one of four types of feedback: strong reward, strong penalty, weak reward and weak penalty.Apart from boundary conditions, the state transition from i to j may also not happen since the every d th update is made with probability of 0:5.Considering these conditions, the probability of making the transition from i to j, p i,j can be calculated as follows.
Transition probability due to a strong reward, P sr can be calculated as: Here, P Trans is the probability that any transition to other states happens.It includes two possibilities.(1) Transitions happen d À 1 times for every (2) Transitions happen with probability 0:5 at the remaining 1 of d iterations.Therefore, the overall probability of any transition, P Trans , can be calculated as, The variable c in (1) is the penalty probability.The penalty probability c is the penalty probability of action 1 (c 1 ) if the starting state i in a transition from i to j is located in the state space of the action 1, that is, 0 < i ≤ N. The penalty probability c on the other hand is the penalty probability of action 2 (c 2 ) if state i is in the state space of action 2, that is, N < i ≤ 2N.The probability P s in the same equation is the probability of getting a strong feedback.
Using the above transition probabilities, we form the transition probability matrix, P for the MVF-LA in Figure 2. Matrix P exhibits the Markov chain property that the sum of the probabilities of each raw equals to one: P j p i,j ¼ 1.For instance, consider the MVF-LA in Figure 2. Here, when the starting state of a transition is N, a strong reward moves the state from N to N À 3 (s ¼ 3).Similarly, a weak reward moves the state from N to N À 1. Weak and strong penalties, on the other hand, move the state N in the state space of action 1 to the state space of action 2. While a weak penalty moves the state from N to N þ 1, a strong penalty moves it to N þ 3. The state N stays on the same state with probability P non , where P non is equal to 1 À P Trans ð Þ .At boundaries, if any of the above transitions cannot be made, that transition probability is accommodated in P non .
The Algorithm 1 shows the step-by-step procedure of building the transition matrix of MVF-LA.
Clearly, this Markov chain is indeed recurrent and non-periodical, and thus it is a ergodic Markov chain.As in Jiao (2020), the probability of staying at a particular state at time n, π n ð Þ, in the Markov chain can be computed as, Algorithm 1 Calculating the stationary distribution of the Makov chain for MVF-LA

21:
else (staying on the same state) 22: ⊳ both weak and strong updates cannot be made 23: then ⊳ only weak updates can be made 26: When n goes to infinity, we obtain the steady state probabilities, π Ã .The steady state probabilities π Ã are independent of the initial state.Hence, we can also calculate π Ã by, Theoretically, from ( 6) and ( 7), π Ã can be obtained by multiplying P itself for infinite number of times, π Ã ¼ P ð Þ ∞ .In practice, we multiply P itself for a sufficiently large number of times until its entries converge and then the steady state probabilities are obtained.Once we know the steady state probabilities, we sum up the steady state probabilities that correspond to the action that has the lowest probability of penalty.If the sum of the probabilities converges to 1 when N approaches to infinity, we can conclude that, with sufficiently large number of memory states per actions, the LA converges to the correct action.For MVF-LA, based on the calculation of Algorithm 1, we conclude that as long as the best action's penalty probability is less than 0.5, the action selection probability converges to 1 when N goes to infinity.
To illustrate the convergence property of MVF-LA, we form different transition matrices with distinct parameter configurations using Algorithm 1.We keep s, P s and d as constant at 3, 0:67 and 10, respectively for the analysis.Without loss of generality, we always set action 1 as the best action.The steady state probability distribution over the states of MVF-LA can be seen in Figure 3 when N ¼ 10.The sum of the steady state probabilities of action 1, Pr[Action 1], for different penalty probability configurations are also illustrated in the figure.Although with only N ¼ 10 memories, the action selection probability for action 1 is convincingly higher ( > 0:94) when The probability of selecting action 1 for the remaining penalty probability setups are higher than 0:75.However, the probability distribution of these two setups show that the MVF-LA has not made the decision of selecting action 1 confidently as the steady state probabilities of the end states of action 1 are relatively lower than those of the previous two setups.
Nevertheless, theoretically, the probability of selecting action 1 increases with N and it will reach 1 as N goes to infinity, given that the best action has a penalty probability less than 0.5.This is verified by the plots in Figure 4 where the probability of selecting action 1 reaches 1 when N increases for all the cases except when c 1 ¼ 0:6 and c 2 ¼ 0:9.This is because the lowest penalty probability (c 1 in this case) is not less than 0.5.
Our Markov chain based analysis thus uncovers similar convergence properties as what is achieved with traditional Tsetlin Automata (TA).However, with increasing determinism, the stochasticity of learning is reduced.This reduction makes the individual learning runs more predictable.
The steady state probabilities of an MVF-LA with different penalty probabilities when N ¼ 10 The increase of the probability of selecting the correct action with N Additionally, the Markov property of TM learning has been utilized previously to analyse learning convergence (Jiao et al., 2021;Zhang et al., 2020).Because TM learning can be formulated as a Markov chain, we can mathematically prove convergence properties.Indeed, apart from proving convergence of the individual learning elements (TA or MVF-LA) within the TM, one can also analyse the complete TM as one unit.Other state-of-the-art ML algorithms do not necessarily decompose into a simple Markov chain, making exact convergence analysis more difficult.

| Simulation analysis of MVF-LA
In this section, we simulate the MVF-LA and see if it behaves similar to the above stated convergence properties.First, we build the MVF-LA and iteratively update its states by stochastically generating feedbacks for MVF-LA's actions according to known penalty probabilities.Then we analyse the behaviour of the MVF-LA and compare with its theoretical outputs.
Here we introduce the new quantity M n ð Þ, which is the average penalty after n training iterations.The M n ð Þ for a two-action automaton is Thathachar, 2012).According to the theory stated in Section 3.1, when n and N go to infinity, the probability of selecting the action which has the lowest penalty probability should reaches 1 (consequently, the probability of selecting the other action goes to 0).Therefore, when n and N go to infinity, the average penalty, M n ð Þ should approximate to the lowest penalty probability.
In our simulation, to make the analysis easier, we always set the lowest penalty probability to the action 1.Then, we first analyse the variation of Pr[Action 1] against the number of training iterations, n.The change of the average penalty, M n ð Þ over n for the same experiment is illustrated in Figure 6.Except for the experiment with c 1 ¼ 0:6 and c 2 ¼ 0:9, M n ð Þ has approximated to the lowest penalty probability with n.The five-iterations moving average for the experiment with c 1 ¼ 0:2 and c 2 ¼ 0:4 is again unsteady.The reason here is both c 1 and c 2 are lower than 0.5 and therefore, there is a higher chance to get a reward for both the actions.
In the next arrangement, the change of Pr[Action 1] against n is studied for distinct N values.For this experiment, the c 1 and c 2 are fixed at 0.4 and 0.6, respectively.As expected, Figure 7 displays that Pr[Action 1] of MVF-LA with higher N reaches highest possible probability faster.

| THE ARBITRARILY DETERMINISTIC TM
In this section, we introduce the details of the ADTM, shown in Figure 8, where the TA is replaced by the MVF-LA.The purpose of the ADTM is to control the amount of stochasticity generated, thus allowing management of energy consumption during learning.

| Input features
Like the TM, an ADTM takes a feature vector of o propositional variables as input, X ¼ x 1 , x 2 , x 3 , …, x o ½ , to be classified into one of two classes, y ¼ 0 or y ¼ 1.These features are extended with their negation, to produce a set of literals:

| Classification
In order to identify the sub-patterns associated with both of the classes of a two-class ADTM, the clauses are grouped in two.The number of clauses employed is a user set parameter m.Half of the clauses are assigned positive polarity (c þ j ).The other half is assigned negative polarity (c À j ).The clause outputs, in turn, are combined into a classification decision through summation and thresholding using the unit step func- That is, classification is based on a majority vote, with the positive clauses voting for y ¼ 0 and the negative for y ¼ 1.

| The MVF-LA game and orchestration scheme
The MVF-LAs in ADTM are updated by so-called Type I and Type II feedback.Depending on the class of the current training sample X, y ð Þand the polarity of the clause (positive or negative), the type of feedback is decided.Clauses with positive polarity receive Type I feedback when the target output is y ¼ 1, and Type II feedback when the target output is y ¼ 0. For clauses with negative polarity, Type I feedback replaces Type II, and vice versa.In the following, we focus only on clauses with positive polarity.

| Type I feedback
The number of clauses which receive Type I feedback is controlled by selecting them stochastically according to (9): is the aggregated clause output and T is a user set parameter that decides how many clauses should be involved in learning a particular sub-pattern.Increasing T proportionally with the number of clauses introduces an ensemble effect, for increased learning accuracy.Type I feedback consists of two kinds of sub-feedback: Type Ia and Type Ib.Type Ia feedback stimulates recognition of patterns by reinforcing the include action of MVF-LAs whose corresponding literal value is 1, however, only when the clause output also is 1.Note that an action is reinforced either by rewarding the action itself, or by penalizing the other action.Type Ia feedback is strong, with step size s (Figure 2).
Type Ib feedback, on the other hand, combats over-fitting by reinforcing the exclude actions of MVF-LAs when the corresponding literal is 0 or when the clause output is 0. Type Ib feedback is weak (Figure 2) to facilitate learning of frequent patterns.

| Type II feedback
Clauses are also selected stochastically for receiving Type II feedback: Type II feedback combats false positive clause output by seeking to alter clauses that output 1 so that they instead output 0. This is achieved simply by penalizing exclusion of literals of value 0: Thus, when the clause output is 1 and the corresponding literal value of an MVF-LA is 0, the exclude action of the MVF-LA is penalized.Type II feedback is strong, with step size s.Recall that in all of the above MVF-LA update steps, the parameter d decides the determinism of the updates.

| EMPIRICAL EVALUATION
We now study the performance of ADTM empirically using Bankruptcy, Balance Scale, Breast Cancer, Liver Disorders and Heart Disease datasets. 1Note that since we seek to achieve a trade-off between TM accuracy and interpretability, we have selected datasets that facilitate interpretation.
The ADTM is compared against regular TMs to assess to what degree learning accuracy suffers from increased determinism.The ADTM is also compared against seven other state-of-the-art ML approaches: Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Decision Trees (DTs), K-Nearest Neighbour (KNN), Random Forest (RF), Gradient Boosted Trees (XGBoost; Chen & Guestrin, 2016) and Explainable Boosting Machines (EBMs; Nori et al., 2019).For comprehensiveness, three ANN architectures are used: ANN-1-with one hidden layer of five neurons; ANN-2-with two hidden layers of 20 and 50 neurons each and ANN-3-with three hidden layers and 20, 150 and 100 neurons.
Performance of these predictive models is summarized in Table 6.We compute both F1-score (F1) and accuracy (Acc.) as performance measures.
However, due to class imbalance, we emphasize F1-score when comparing the performance of the different predictive models.

| Bankruptcy
The Bankruptcy dataset contains historical records of 250 companies. 2The outcome, Bankruptcy or Non-bankruptcy, is characterized by six categorical features.We thus binarize the features using thresholding (Abeyrathna, Granmo, Zhang, Jiao, & Goodwin, 2019) before we feed them into the ADTM.We first tune the hyper-parameters of the TM and the best performance is reported in Table 1, for m ¼ 100 (number of clauses), s ¼ 3 (step size for MVF-LA) and T ¼ 10 (summation target).Each MVF-LA contains 100 states per action.The impact of determinism is reported in Table 1, for varying levels of determinism.As seen, performance is indistinguishable for d-values 1, 10 and 100, and the ADTM achieves its highest classification accuracy.However, notice the slight decrease of F1-score and accuracy when determinism is further increased to 500, 1000 and 5000.
Figure 9 shows how training and testing accuracy evolve over the training epochs.Only high determinism seems to influence learning speed and accuracy significantly.The performance of the other considered ML models is compiled in Table 6.The best performance in terms of F1-score for the other models is obtained by ANN-3.However, ANN-3 is outperformed by the ADTM for all d-values except when d ¼ 5000.
T A B L E 1 Performance of TM and ADTM with different d on Bankruptcy Training and testing accuracy per epoch on Bankruptcy

| Balance scale
The Balance Scale dataset 3 contains three classes: balance scale tip to the right, tip to the left or in balance.The class is decided by the size of the weight on both sides of the scale and the distance to each weight from the centre.Hence the classes are characterized by four features.However, to make the output binary, we remove the 'balanced' class ending up with 576 data samples.The ADTM is equipped with 100 clauses.Each MVF-LA is given 100 states per action.The remaining two parameters, that is, s value and T are fixed at 3 and 10, respectively.Table 2 contains the results obtained with TM and ADTM.Even though ADTM uses the same number of clauses as the TM, the performance with regards to F1-score and accuracy is better with ADTM when all updates on MVF-LAs are stochastic.The performance of the ADTM remains the same until the determinism parameter surpasses 100.After that, performance degrades gradually.
Progress of training and testing accuracy per epoch can be found in Figure 10.Each ADTM setup reaches its peak training and testing accuracy and becomes stable within a fewer number of training epochs.As can be seen, accuracy is maintained up to d ¼ 100, thus reducing random number generation to 1% without accuracy loss.From the results listed in Table 6 for the other ML approaches, EBM achieves the highest F1-score and accuracy.

| Breast cancer
The Breast Cancer dataset 4 contains 286 patients records related to the recurrence of breast cancer (201 with non-recurrence and 85 with recurrence).The recurrence of breast cancer is to be estimated using nine features: Age, Menopause, Tumour Size, Inv Nodes, Node Caps, Deg Malig, Side (left or right), the Position of the Breast and Irradiation.However, some of the patient samples miss some of the feature values.These samples are removed from the dataset in the present experiment.The ADTM is arranged with the following parameter setup: m ¼ 100, s ¼ 5, T ¼ 10 and the number of states in MVF-LA per action is 100.The classification accuracy of the TM and ADTM are summarized in Table 3.The performance of both TM and ADTM is here considerably lower than for the previous two datasets, and further decreases with increasing determinism.
However, the F1 measures obtained by all the other considered ML models are also low, that is, less than 0:500.The training and testing accuracy progress per epoch is reported in Figure 11, showing a clear degradation of performance with increasing determinism.

| Liver disorders
The Liver Disorders

| Heart disease
The Heart Disease dataset 6 concerns prediction of heart disease.To this end, 13 features are available, selected among 75.Out of the 13 features, 6 are real-valued, 3 are binary, 3 are nominal and 1 is ordered.
In this case, the ADTM is built on 100 clauses.The number of state transitions when the feedback is strong, s is equal to 3 while the target, T is equal to 10.The number of states per MVF-LA action in the ADTM is 100.
As one can see in Table 5, the ADTM provides better performance than TM in terms of F1-score and accuracy when d ¼ 1. F1-score then increases with d and peaks at d ¼ 100.After some fluctuation, it drops to a value of 0:605 when d ¼ 5000.
Figure 13 shows similar training and testing accuracy for all d-values, apart from the significantly lower accuracy of d ¼ 5000.
Out of other ML algorithms, EBM provides the best F1-score, as summarized in Table 6.Even though ANN-1, ANN-2, DT, RF and XGBoost obtain better F1-scores than TM, the F1 scores of ADTM when d equals to 1, 10, 100, 500 and 1000 are higher.

| EFFECTS OF DETERMINISM ON ENERGY CONSUMPTION
Energy consumption of all TM implementations can be positively reduced by using ADTM, since random choice is a key mechanism in learning (see Section 4).This effect is especially notable in ASIC implementations aimed at low-energy on-chip learning applications, where energy overheads are low (e.g., compared to a personal computer).While software implementations of the TM use centralized pseudorandom number generators (PRNGs) to facilitate the random choices (Figure 14a), the ASIC implementation uses many smaller PRNGs localized to individual TAs to maximize parallelism (Figure 14b).In the ASIC implementation of TM, linear feedback shift registers (LFSRs) are used as PRNGs due to their small size and simplicity Wheeldon et al. (2020).
Power is consumed by the PRNGs in the process of generating a new random number.This is referred to as switching power.In the TM, every TA update is randomized, and switching power is consumed by the PRNGs on every cycle.Additionally, power is also consumed by the PRNGs whilst idle.We term this leakage power.Leakage power is always consumed by the PRNGs whilst they are powered up, even when not generating new numbers.
In the ADTM with hybrid TA where the determinism parameter d is introduced, d ¼ 1 would be equivalent to a TM where every TA update is randomized.d ¼ ∞ means the ADTM is fully deterministic, and no random numbers are required from the PRNG.If a TA update is randomized only on the dth cycle, the PRNGs need only be actively switched (and therefore consume switching power) for 1 d portion of the entire training procedure.The switching power consumed by the PRNGs accounts for 7% of the total system power when using a traditional TA (equivalent to d ¼ 1).With d ¼ 100 this is reduced to 0.07% of the system power, and with d ¼ 5000 this is reduced further to 0.001% of the same.It can be seen that as d increases in the ADTM, the switching power consumed by the PRNGs tends to zero.
In the special case of d ¼ ∞, the PRNGs are no longer required for TA updates since the TAs are fully deterministic-we can omit these PRNGs from the design and prevent their leakage power from being consumed.The leakage power of the PRNGs accounts for 32% of the total system power.On top of the switching power savings this equates to 39% of system power, meaning large power and therefore energy savings can be made in the ADTM.
Figure 15 shows the number of randomization events for different d values in the case of Heart Disease dataset.As expected, for lower d values, the number of events reduces drastically.For example, in the first iteration this number reduces by 4219Â from d = 1 to d = 5000.Notice, how the number of these events also reduces further for both cases as the number of iterations increase Granmo (2018).The reduced number of events can be positively leveraged towards power minimization.
Table 7 shows comparative training power consumption per datapoint (i.e., all TAs being updated concurrently) for two different d values: d = 1 and d = 5000.Typically, the overall power is higher for bigger datasets as they require increased number of concurrent TAs as well as

F
I G U R E 1 Transition graph of a two-action Tsetlin Automaton with 2N memory states F I G U R E 2 Transition graph of the multi-step variable structure finite-state learning automaton 3.1 | Proof of the convergence of MVF-LA using Markov chain

1:
Input: Number of states per action, N; Number of strong jumps, s; Deterministic parameter, d; Probability of getting a strong feedback, P s ; Penalty probability for action 1, c 1 ; Penalty probability for action 2, c 2 .2: Output: Limiting Matrix 3: Initialize: Transition Probability Matrix P ⊳ P requires 2N by 2N space 4: Function: Figure 5 depicts the 20-iterations moving average of Pr[Action 1] against the number of iterations.At each experiment round, the number of training iteration, n is increased and the final Pr[Action 1] is recorded.The N, s, d and P s in this simulation are fixed at 20, 3, 10 and 0.67.As expected, the Pr[Action 1] increases with n.The case with c 1 ¼ 0:1 and c 2 ¼ 0:9 has already approaches probability 1.The probabilities of selecting action 1 in experiments with c 1 ¼ 0:4, c 2 ¼ 0:6 and c 1 ¼ 0:2, c 2 ¼ 0:4 are slowly approaching 1.From these two, Pr[Action 1] variation of the case c 1 ¼ 0:4 and c 2 ¼ 0:6 is more stable than the other.The Pr[Action 1] variation of the experiment with c 1 ¼ 0:6 and c 2 ¼ 0:9 has stabilized around 0.8.F I G U R E 5 The variation of the Pr[Action 1] against the number of training iterations, n for different penalty probabilities F I G U R E 6 The variation of average penalty M n ð Þ against the number of training iterations, n for different penalty probabilities

7
The variation of the Pr[Action 1] against the number of training iterations, n for different number of states per action, N F I G U R E 8 The ADTM structure 4.1.2| Clauses Patterns are represented by m conjunctive clauses.As shown for Clause-1 in the figure, a clause in the TM comprises 2o MVF-LAs, each controlling the inclusion of a specific literal.Let the set I j , I j ⊆ 1,…,2o f gdenote the indexes of the literals that are included in clause j.When evaluating clause j on input literals L, the literals included in the clause are ANDed: c j ¼ ^k Ij l k , j ¼ 1, …, m.Note that the output of an empty clause, I j ¼ ;, is 1 during learning and 0 during inference.
dataset 5 was created by BUPA Medical Research and Development Ltd. (hereafter 'BMRDL') during the 1980s as part of a larger health-screening database.The dataset consists of seven attributes.However, McDermott and Forsyth (2016) claim that many researchers have used the dataset incorrectly, considering the Selector attribute as the class label.Based on the recommendation of McDermott and Forsythof, we here instead use the Number of Half-Pint Equivalents of Alcoholic Beverages as the dependent variable, binarized using the threshold ≥ 3. The selector attribute is discarded.The remaining attributes represent the results of various blood tests, and we use them as features.Here, ADTM is given 10 clauses per class, with s ¼ 3 and T ¼ 10.Each MVF-LA action possesses 100 states.The performance of ADTM for different levels of determinism is summarized in Table 4.For d ¼ 1, the F1-score of ADTM is better than what is achieved with the standard TM.In contrast to the performance on previous datasets, the performance of ADTM on Liver Disorders dataset with respect to F1-score does not decrease significantly with d.Instead, it fluctuates around 0:690.As shown in Figure 12, unlike the other datasets, the ADTM with d ¼ 1 requires more training rounds than with larger d-values, before it learns the final MVF-LA actions.It is also unable to reach the training accuracy obtained with higher d-values.Despite the diverse learning speed, testing accuracy becomes similar after roughly 50 training rounds.The other considered ML models obtain somewhat similar F1-scores, however, only DT, RF and EBM surpass an F1-score of 0:700.F I G U R E 1 1 Training and testing accuracy per epoch on Breast Cancer T A B L E 4 Performance of TM and ADTM with different d on Liver Disorders PRNGs.As can be seen, the increase in d value reduces the power consumption by 11 mW in the case of Heart Disease dataset.This saving is made by reducing the switching activity in the PRNGs as explained above.More savings are made by larger d values as the PRNG concurrent switching activities are reduced.F I G U R E 1 4 PRNG strategies for (a) software TM and (b) hardware TM F I G U R E 1 5 Number of randomization events per epoch for the Heart Disease dataset Performance of TM and ADTM with different d on Breast Cancer Performance of TM and ADTM with different d on Heart Disease T A B L E 5Note: Bold values are the best F1-scores and Accuracies for each dataset.