Adjusting misclassification using a second classifier with an external validation sample

Administrative data may suffer from delays or mistakes in reporting. To adjust for the resulting measurement errors, it is often necessary to combine data from related sources, such as sample survey, administrative or ‘big’ data. However, the additional measure variable usually has a different definition and errors of its own, and the available joint data set may not have a completely known sampling distribution. We develop a modelling approach which capitalizes on one's knowledge and experience with the data source where they exist, and apply it to register‐ and survey‐based Employed status. Comparisons are made to adjustments by hidden Markov models. Our approach is applicable to similar situations involving big data sources.


INTRODUCTION
Making greater use of data originated from administrative sources for statistical purposes has become an increasingly important topic for many National Statistical Offices (NSO). Administrative data are not perfect, and combination of data from multiple sources is often needed to overcome various known deficiencies, such as when capture-recapture methods are applied to multiple registers to adjust for their combined under-coverage, or when latent class analysis is applied to adjust for the discrepancies among similarly defined variables from different sources. See Zhang (2012), di Zio et al. (2017) and Hand (2018) for broad discussions of the relevant statistical topics, methods and challenges. In our motivating application later, we have a register-based Employed status, denoted by X, which is known for everyone in the population, based on an administrative source that delivers to Statistics Norway on a monthly basis. Due to delays and misreporting, X may be erroneous for the true register status, denoted by Y , for a considerable period after the reference month, although the two will eventually reach agreement over time. A concurrent Employed status is also available from the continuous Labour Force Survey (LFS), denoted by Z, which follows a different definition of employment, so that Z and Y can differ regardless if Z is a true measure of its own definition. The setting is generically summarized in Table 1, where R = 1 if a unit is in the joint sample and 0 otherwise.
We shall develop and apply a modelling approach for adjusting the misclassification of X by making use of the second classifier Z that is also subject to misclassification. We assume that, based on the knowledge and experience about the source of X, it is possible to define the part of the population where one is confident that X is either correct or nearly so, denoted by B = 1, and the errors of X are only much more likely in the rest population, denoted by B = 2. We call B a discriminant. As Table 2 shows, the discriminant creates a validation sample of (Z, Y ) in the subpopulation of B = 1 so that Pr(Z|Y , B = 1) can be estimated. Our model allows one to apply the estimated Pr(Z|Y , B = 1) to the subpopulation of B = 2, in order to estimate Pr(Y |X, B = 2) and then Pr(Y |B = 2).
The discriminant provides a previously unexplored possibility to make use of one's knowledge and experience of the source. This is often possible with administrative data. For the aforementioned register Employed status X, many indicators in the administrative sources can be used for the construction of the discriminant, such as the type of job or position, the length of employment history, the level of income and so on. Register-based highest level of education is another example, where for example, native borns with a completed education history in the relevant registers can be reliably classified. As a third example, registered address of a person (or family) may be mistaken due to lack of updating or misreporting. Provided it is possible to obtain various 'signs of life' addresses via licence registration, social or health services, utility bills, etc. one may be able to distinguish which registered addresses are most likely to be correct from the rest. As a final example here, dwelling addresses registered for primary residence or recreation uses may be mistaken or not-in-use in reality. Provided it is possible to obtain relevant activity data of electricity smart metre, mobile phone signal or airborne laser scanning, one may be able to identify those most likely to have correctly classified uses. In all such or similar situations, the discriminant creates a validation sample for a second fallible classifier Z, which is external to the subpopulation where adjustment of X is needed. As will be demonstrated later, in practice, X only needs to be nearly always correct for the tacit assumption of validation sample to lead to useful adjustments for the rest population, even if the assumption is not completely true. The remaining task of modelling is to enable one to usefully apply Pr(Z|Y , B = 1) estimated in the validation sample.
There exists a large body of literature on categorical data analysis in the presence of misclassification. See the excellent reviews of Kuha and Skinner (1997) and Kuha et al. (1998), who note in particular a strong tradition in medical studies. Bross (1954) shows how conclusions drawn from 2 × 2 tables can be affected by misclassification. Tenenbein (1970) introduces the double sampling methodology for binary classifiers. It is assumed that a (simple) random sample is classified by a cheap, but fallible classifier. A subsample is taken, from which a more costly true classifier is obtained. The subsample, from which we can learn the misclassification mechanism is a validation sample. It is shown that making use of the fallible classifier observed outside the validation sample is more efficient than using only the true classifier in the validation sample.
The basic double sampling method can be extended to allow for multinomial variables (Tenenbein, 1971(Tenenbein, , 1972. Hochberg (1977) consider hypothesis testing for multidimensional jointly observed data. Hochbeg and Tenenbein (1983) and Chen et al. (1984) extend double sampling to triple sampling, where only the true classifier is observed in one sample, only the fallible in another and both jointly in a third sample. See also Chen (1989) for a review of the related methods. Swensen (1988) considers the setting, where register-based measure variables are fallible and survey variables are true. Haitovsky and Rapp (1992) study efficient sampling design of the validation sample beyond simple random sampling. Chen (1979) introduces the framework of log-linear models to the double sampling methodology, where one specifies a log-linear model of the misclassification probability matrix, as well as another log-linear model of the true classifications. The log-linear model framework facilitates maximum likelihood estimation (MLE) using the EM-algorithm (Dempster et al., 1977). See for example Chen (1979Chen ( , 1989 and Espeland and Odoroff (1985) for applications of the EM algorithm to misclassification problems.
For situations involving two fallible classifiers without a true classifier, identifiability of model parameters requires additional assumptions. Hui and Walter (1980) assume the misclassification mechanisms of two diagnostic tests are the same for all the units in the population. A partition of the population is introduced, where the case prevalence varies across the subpopulations and the number of subpopulations is such that there are enough degrees of freedom to allow for parameter identification. Lie et al. (1994) study binary variables from two different health registers, both of which are subject to misclassification errors, under the assumption that the positive cases missed by either classifier will all be correctly classified by the other. Qiu et al. (2018) propose two models for confidence interval procedures of the population proportion. Under both the models, they assume there are no false positives for both classifiers.
Meanwhile, multiple fallible classifiers have been studied in situations involving repeated measures. For instance, for estimating gross labour flows, misclassification models of reinterview data have been considered by Abowd and Zellner (1985), Poterba and Summers (1986), Chua and Fuller (1987) and Singh and Rao (1995). Zhang (2005) proposes a special sparse misclassification model, which does not require reinterviews.
There is a huge body of literature on latent class analysis or structural equation models of multiple fallible classifiers, provided there are enough degrees of freedom in the observed data, which usually requires a longitudinal setting. Hidden Markov models (HMMs) are often applied for misclassification adjustment (e.g. Biemer & Bushery, 2000;Magidson et al., 2009;Van de Pol & Langeheine, 1997;Vermunt, 2010;Vermunt et al., 2007). See for example Yoon (2009) for a review for biological sequence analysis. In particular, Pavlopoulos and Vermunt (2015) apply an extended HMM to survey and administrative register data on temporary employment. We will apply the HMM in an off-the-shelf manner to provide a comparison to the method developed in this paper.
The rest of the paper is organized as follows. The model we propose is developed in Section 2, together with the estimation methods. The HMM model for comparison is also briefly described. The application is presented in Section 3. Finally, Section 4 contains a summary and an outline of some topics for future research.

MODELS
Denote by U = {1, … , N} the population that the variables (X, Y ) are associated with, denoted by (X i , Y i ) for i ∈ U. Let X and Y both take values 1, … , K. Under the setting of Table 1, we observe two fallible classifiers X and Z (of the true Y ) jointly in a sample s, but only X outside s. Let R be the binary observation indicator, where R i = 1 if i ∈ s and 0 if i ∈ U ⧵ s. The selection mechanism of R may be unknown generally.
To focus on the central issues, we assume {X i ∶ i ∈ U} to be known for the whole population, which is the case in our application later. But the modelling approach developed below is also applicable in the case of double sampling, where the joint sample s is a subset of a larger probability sample of X from U.

Modelling given discriminant
We introduce our model for the setting of Table 2 in two steps. First, we introduce the discriminant B and the simplest assumptions of Z and R, so that Pr(Y |X, B = 2) is identifiable in the joint subsample where (B, R) = (2, 1), and Pr(Y |B = 2) can be estimated. This leads to two simple models, which contain some strong assumptions of the sample observation and misclassification mechanisms. Next, additional covariates are introduced to relax these assumptions, yielding the model that is more generally applicable. For the classifier X, we assume there exists a known binary discriminant, denoted by B = 1 or 2, such that we have, in the population, The idea is simple. Given that X = Y conditional on B = 1, the joint subsample of (Z, X) is a validation sample of (Z, Y ) where (B, R) = (1, 1). Provided suitable assumptions of R, so that Pr(Z|Y ) is transportable from those with B = 1 to the others with B = 2, one would be able to disentangle the conditional distribution Pr(Y |X, B = 2) from the joint distribution Pr(Z, X|B = 2) in the subsample where (B, R) = (2, 1). Figure 1 gives the independence graphs (Edwards, 2012) of two models of (Y , X, Z, B, R), where two groups of variables are independent of each other if they are unconnected in the graph, and two (groups of) variables are conditionally independent given the variables that separate them in the graph. In the terminology of Rubin (1976), R is missing completely at random (MCAR) under the first model M 0 , and it is missing at random (MAR) given (X, B) under the second model M B . Under either model, Z is independent of (X, B, R) conditional on Y , denoted by In the terminology of Kuha and Skinner (1997), misclassification by Z is nondifferential with respect to (X, B, R) under (2).
Moreover, under either model, (Z,Y ) are conditionally independent of R given (X, B), that is, by integrating out Y from Pr(Z, Y |X, B). The misclassification probabilities zy are said to be transportable (Kuha & Skinner, 1997) from B = 1 to B = 2 by virtue of (2). The probabilities Pr(Y |X, B = 2) are referred to as the calibration probabilities, denoted by The conditional probabilities Pr(Z|X, B, R = 1) by (2) or (3) for the two subsamples given B = 1 or 2 are summarized in Table 3. Neither of the two classifiers is necessarily more accurate than the other in the subpopulation of B = 2.
MCAR for R is a strong assumption that may be unrealistic in many applications. When it comes to MAR for R in Figure 1, allowing B in addition to X is unlikely to be a useful relaxation of using only X to control for R, since the discriminant B is defined with respect to misclassification by X. Whereas the fact that X is a known 'proxy' of Y is usually favourable to the MAR assumption. In the extreme case, if X = Y , then R must be independent of Y given X. Or, heuristically speaking, whatever the effect Y has on R, it would be largely controlled for given X if X contains much information about Y . Still, a reasonable approach is to introduce additional covariates, as it is common in the literature of modelling survey nonresponse or nonprobability sample selection in the absence of misclassification, not least because this would also enable one to relax the assumptions that Pr(Z|Y ) is the same for everyone in the population and Pr(Y |X) is the same for everyone in the subpopulation of B = 2. TA B L E 3 Subsample conditional probabilities Pr(Z|X, B, R = 1) First, for the population calibration probabilities Pr(Y |X), we modify the discriminant assumption (1) to include the additional known covariates, denoted by A, Next, denote by M AB the model whose independence graph is given in Figure 2, where we allow A to be connected to all the other variables in the graph. Under M AB , misclassification by Z is nondifferential with respect to (X, B, R) conditional on A, that is, Note that (5) is similar to (2), albeit with conditioning on A in addition. Moreover, similarly as for The model M AB defined by (4), (5) and (6) thus encompasses the model M B defined by Equations (1)-(3). Of course, these assumptions of M AB may still not hold completely in applications. The sensitivity of the resulting estimator of the target proportions will be investigated later in the application as well as by a simulation study.

Estimation
Provided the sample size accommodates it, one may let A be a population stratification variable based on the relevant covariates. A so-called matrix method (Kuha & Skinner, 1997) follows immediately. For A = a, let a be the matrix of probabilities zy|a and a that of zx|a and H a that of yx|a . Given (B, R) = (2, 1), we have a = a H a under the model M AB . Provided the inverse matrix exists, an estimator of H a is given bŷ wherêa is estimated from the subsample of (Z, X) = (Z, Y ) given (A, B) = (a, 1), and̂a from the subsample of (Z, X) given (A, B) = (a, 2). Next, let X|ab be the vector of subpopulation proportions of X given (A, The estimator̂Y |a2 is easily consistent, as all the stratum sample sizes tend to infinity asymptotically. An estimator of the overall proportions is then given bŷ where N ab is the stratum subpopulation size with (A, B) = (a, b).
We adopt model-based inference in this paper, where the possibly complex sampling design of s is ignorable conditional on the model covariates. Under the model M AB , we treat the realized stratum sample sizes n x|ab as fixed, which is the number of units with (X, A, B, R) = (x, a, b, 1), for which we treat the associated Z as random given B = 1 and the associated (Z, Y ) as random given B = 2. This is justifiable, because R is conditionally independent of (Y , Z) given (X, A, B) and X is known for the population. From each subsample (x, a, b, 1), we draw n x|ab units with the associated Z randomly and with replacement, to obtain a corresponding bootstrap replicate subsample. Repeating this separately for each combination of (x, a, b) yields then an entire bootstrap replicate sample, based on which we obtain a corresponding bootstrap replicate estimate of Y |+2 for the subpopulation with B = 2. The bootstrap variance estimator of̂Y |+2 can be obtained based on a sufficient number of repetitions of the procedure.
One can also consider MLE, using the stratum-specific misclassification probabilities a given A = a and a of Pr(X|Y , B = 2) given A = a. The stratum likelihood is where n zx|ab is the sample cell count given (A, B) = (a, b), and m x|a2 is the corresponding out-of-sample total of X = x. As shown in Appendix A, the MLE of y|a2 is the same as the matrix method estimator above. However, the MLE is also applicable given more parsimonious specifications of zy|a and xy|a in terms of the covariates A when A is not a stratification variable, say, a = (a, ) and a = (a, ) with respective parameter vectors and . The likelihood is then given by

Estimation using hidden Markov models
Here we briefly outline the HMM considered by Pavlopoulos and Vermunt (2015) and Pankowska et al. (2018), which we will later apply in an off-the-shelf manner to provide a comparison to the model M AB developed above.
where T denotes the current time point of interest. This true sequence is unobserved, or hidden. At each time point one observes one or more fallible classifiers (or measure variables) that are dependent on the true variable. It is common to assume that the true and measure variables are independent for different units.
For the setting of two fallible classifiers in Table 1, we observe X t for everyone in the population at time t, as well as Z t for anyone in the sample at time t. Let R t = 1 if Z t is observed or 0 otherwise. We assume MAR for R t given the observed data, and the indicators R = (R 1 , … , R T ) can be omitted in the HMM path diagrams.
Consider the path diagram of HMM 0 in Figure 3, where X t is conditionally independent of all the other variables given (Y t , A), and similarly for Z t . Notice that A is omitted in Figure 3 to avoid cluttering the diagrams, as it would be connected to all the variables. In the literature this is often referred to as the assumption of independent classification errors (ICE). The joint probability of (X, Z, Y ) given A and R can be written as Notice that, given the time points t = 1, … , T, the term involving Z varies according to r, that is, the times anyone is actually in the sample over time.
However, the ICE assumption is most likely too simplistic for our application later. In particular, delay or mistake of reporting is the cause of misclassification by X, so that whether an error has already occurred at t − 1 is likely to affect the misclassification probability of X t , whether the error is repeated or corrected at t. Consider instead the path diagram of HMM in Figure 3, where X t is conditionally independent of the other variables given (Y t−1 , X t−1 ) in addition to (Y t , A). This type of HMM has been considered by Pavlopoulos and Vermunt (2015) and Pankowska et al. (2018). Here we use a simple specification for X t given . The joint probability F I G U R E 3 Path diagrams for HMM 0 (left) and HMM (right) conditional on A of (X, Z, Y ) given A and R can then be written as For estimation, Pavlopoulos and Vermunt (2015) apply pseudo MLE which incorporates the sampling design weights. We adopt model-based inference in this paper. For parameter estimation under the HMM, we use the Baum-Welch algorithm (Baum et al., 1970;Vermunt et al., 2007); see Appendix B. We can incorporate the discriminant B where, at each iteration of the Baum-Welch algorithm, we simply set Pr(X T = Y T |B = 1, A) = 1 in the subpopulation of B = 1, in which case the model is denoted by HMM B .
For variance estimation, we use basically the same bootstrap procedure described in Section 2.2. The only difference is that for each group of respondents in the last quarter with given ( x T , a, b), there is also a group of nonrespondents with the same (x T , a, b), who have responded previously and contribute to the likelihood. Separate bootstrap resampling is applied to these last-quarter nonrespondents.

Data and setting
Statistics Norway publishes monthly register-based employment statistics, the chief source of which is an administrative service coordinated by the Norwegian Labour and Welfare Administration, the Norwegian Tax Administration and Statistics Norway. Since its introduction in 2015, all employers are legally obliged to report all the contractual employer-employee relationships and various related payments every month. However, each month a certain amount of reports is actually corrections of reports in earlier months, which may be due to delays or mistakes. Consequently, an erroneous employment relationship in the data, say, at time t may be removed later at some time point, whereas a missing employment relationship at time t may appear at some time point later. Take the binary variable of Employed or not, denoted by 1 or 0. The left half of Table 4 illustrates the measurement errors in the register Employed status. The reference time point is November 2018, where X is the Employed status based on the data that are available 2 weeks after, TA B L E 4 Register Employed status for November 2018, 2 weeks after November (X) or 6 weeks after November (Ỹ ). LFS Employed status Z, September-November 2018 andỸ is that after 6 weeks and the basis of monthly publication. The proportion of Employed is changed from 0.646 by X to 0.638 byỸ , given the corrections during the month between them.
To provide a context, the difference is about four times the standard error of the Norwegian LFS estimator of the proportion of Employed. Note that progressive sources as such is the case for many administrative data on tax, benefits, migration, etc. The difference from one situation to another is merely the extent of the resulting measurement errors rather than their existence. For instance, Zhang and Fosen (2012) examine the administrative sources for employment statistics, which existed before the current service was introduced in 2015, where progressive measurement errors are noticeable even years after the reference time point.
A binary Employed status Z is available from the LFS. The Norwegian LFS sample is a quarterly rotating panel consisting of eight rotation groups, where close to 20,000 persons are surveyed every quarter. The design at the time of these data is geographically stratified single-stage cluster sampling, where the clusters are families in the Central Population Register. The LFS Employed status follows the ILO definition, which differs from the register Employed status based on contractual employer-employee relationships.
The register and the LFS sample can be linked at the individual level using the unique person identification number, which exists in many register-rich countries including Norway. The discrepancies between Z and X can be seen in the right half of Table 4. Clearly, discrepancies are also unavoidable between Z and the true register Employed status Y . Notice that we do not require Z to measure the same construct as X (and Y ) in our approach, but simply model the misclassification error of Z statistically where misclassification error is the case if Z ≠ Y . However, because Z has a different definition to X (and Y ), assuming the same Pr(Z|Y ) for the whole population is unlikely to be realistic, which is why additional covariates A are needed as remarked earlier in Section 2.
Since X and Z are available at about the same time, the question arises whether one can adjust the errors of X given the additional information provided by Z, even though Z is also subject to misclassification. Notice that, since Z is treated as fallible, the fact that those collected in the LFS September or October are not entirely concurrent with X for November is not a principle obstacle, compared to the more important variance reduction by using quarterly instead of monthly LFS sample in this context.
Finally, the Norwegian LFS does suffer from survey nonresponse (e.g. Hamre & Heldal, 2013;Thomsen & Villund, 2011). Previous studies by Zhang (1999) and Zhang et al. (2013) suggest that nonresponse in the LFS is not MCAR, for example, the proportion of Z = 1 is most likely to be lower among the nonrespondents. This makes it necessary to model the selection mechanism of R, in addition to the misclassification mechanisms.
Below we shall first introduce (B, A) and then apply the model M AB to these data, to estimate the true proportion of register Employed Y . Provided this is possible, one may, for example, consider producing monthly flash estimates at an earlier time point than the current practice, whereas the completely register-based quarterly or yearly statistics can be published at a later time point, allowing more time for the progressive source to settle.
To apply the HMM for comparison, we use two successive quarterly LFS samples. This is the option requiring the least amount of extra data compared to applying the model M AB . Let T = 6 be the month of interest, such as November 2018 in Table 4. Instead of a separate misclassification mechanism for X 1 , we simply set X 0 ≡ Y 0 , which allows one to use the same model for X t given Y t and I(X t−1 = Y t−1 ), for all t = 1, … , T, under a model with fewer parameters. Pankowska et al. (2018) use the same approach. Finally, we shall let A be the same population stratification variable as for the model M AB .

Choice of (B, A)
For this study we have access to the (X,Ỹ , Z) for June-November 2018. To define the discriminant based on the available data, we let B = 1 if an individual's register Employed status has no change at all in terms ofỸ for July-October and X for November, and B = 2 otherwise, where X for November andỸ for October both become available 2 weeks after November. The intuition is that the true status Y is less likely to change in November, for someone with a stable status leading to November, in which case the observed status X for November is also less likely to be erroneous. The population Pr(X|Ỹ , B) for B = 1 or B = 2 are shown in Table 5. It can be seen that the agreement between X andỸ is much better given B = 1 than B = 2. Since the variableỸ is naturally closer to the true Y , we take this to indicate that the misclassification errors of X are indeed much lower given B = 1 than B = 2.
It is unnecessary to be overly concerned with the chosen length of register history when defining B above, since making greater use of the relevant administrative data such as those mentioned in Section 1 is likely more effective for further reducing the errors of X given B = 1. However, such data are not available to this study. Moreover, should Statistics Norway decide to produce monthly flash estimates, it would surely involve many other details that we will not be able to cover here. Thus, we consider the definition of B here to be acceptable for demonstrating the potentials of the proposed methodology.
When it comes to the choice of additional covariates A with respect to nonresponse, Nguyen and Zhang (2020) evaluate empirically reweighting methods for nonresponse adjustment in the Norwegian LFS. The register Employed status is the most effective covariate in this respect. In addition, age, gender, level of education, county, income and nationality are found to be among the most relevant ones. Many of these variables are commonly mentioned in household survey  nonresponse studies. After some experimentation with the available variables, we find including age (in addition to X) to be nearly as effective as other more elaborated choices. Notice that age is also an important aspect of the definitional difference between Z and Y . A parsimonious stratification by age is chosen as A, where A = 1 for age 15 to 24, A = 2 for age 25 to 49 and A = 3 for age 50 to 74. The left half of Table 6 gives the conditional distribution of X givenỸ in the population according to the chosen A and B. The agreement between X and Y remains much better given B = 1 in each stratum defined by A. The right half of Table 6 shows the conditional distribution of Z giveñ Y in the sample. Under the assumption (5), Pr(Z|Y , A = a) should be the same in each stratum by A whether B = 1 or B = 2. However, in light of Table 6, this assumption is unlikely to hold in reality. Indeed, comparing (Ỹ , X) to (Ỹ , Z) within each stratum of A, one can notice that (a) the observed probabilities of Z givenỸ differ more between B = 1 and B = 2 compared to those of X givenỸ , and (b) misclassification errors by Z are greater than by X.
Thus, it is intriguing to see whether useful adjustments, for the misclassification errors of X given B = 2, can nevertheless be achieved by incorporating a second classifier Z that is less accurate (or weaker) than X itself, when the assumption that makes the misclassification mechanism of Z fully transportable from B = 1 to B = 2 is not quite true. Table 7 gives the data for the application of the model M AB , where the indicator R = 1 associated with the sample is suppressed to save space. Outside of the sample, where R = 0, we have the population counts of X given (A, B). In addition, to apply the HMM and HMM B , we need one quarterly LFS sample June-August 2018, and the corresponding monthly X for June-October 2018.

Results
TA B L E 7 Data for adjustments by model M AB , November 2018 The sub-population with B = 2 constitutes about 11% of the population for employment statistics. For simplicity, denote by 2 the true proportion of register Employed (Y = 1) given B = 2, and bŷm ethod 2 its estimate using a given method. The estimates byỸ , X, model M AB , HMM and HMM B are shown in the last row of Table 8, where (A, B) = (+, 2) refers to the subpopulation of B = 2 and (A, B) = (+, +) refers to the whole population. Applying the model of Hui and Walter (1980) with the 3 subpopulations by A given B = 2 does not yield plausible adjustments of̂X 2 , which are omitted here.
The differencêM AB 2 −̂Ỹ 2 = 0.6% given B = 2 is comparable tôX 1 −̂Ỹ 1 = 0.5% given B = 1 which can be obtained from Table 5, and the estimate of the overall proportion using the model M AB (Table 8) is about as precise aŝX 1 for 1 in the subpopulation of B = 1. Since the discriminant and transportability assumptions are not fully satisfied in this application, as noted before, these results suggest that the exact assumptions (4), (5) and (6) can potentially be replaced by the approximate conditions Given (8) one can apply the model M AB with a classifier Z that could be weaker than X, and obtain useful adjustments where X is worst according to the discriminant B.
The results of HMM differ quite much from those of HMM B , overall as well as within each stratum by A, where HMM B incorporating the discriminant B does seem to be an improvement over HMM. Although it seems not straightforward to obtain good results here using these models, it does not mean that it is impossible to achieve better adjustments using other HMMs. Notice that the HMMs tend to place relatively strong assumptions on the 'time homogeneity' of the latent Markov transition and both the misclassification mechanisms. Another concern, common to all latent class analysis, is that the model itself cannot tell which of the two estimated latent classes corresponds to Y = 1. We simply assume that, within each stratum, the total of misclassifications must be lower than that of correct classifications, in order to assigned one latent class to Employed.
Finally, the estimated standard errors shown in the parentheses (Table 8) are computed from 200 bootstrap replicate samples. It is clear that the estimated standard errors of either HMM are dominated by their respective biases, so that the associated uncertainty is under-estimated, without taking into account the bias. Meanwhile, the estimated standard errors under the model M AB are much larger, because it uses a much smaller amount of data compared to the HMMs. Although the estimators under the model M AB here cannot be unbiased in truth, the confidence intervals based on the estimated standard errors are not unreasonable. For instance, the nominal 95% interval (0.509 ± 1.96⋅0.013) = (0.483, 0.535) seems quite likely to cover the true 2 .

A simulation study
We include here a small simulation study with set-ups that are close to the application above. The aim is to explore the sensitivity of the M AB -estimator against departures from the discriminant and transportability assumptions, as well as to check the performance of the associated bootstrap variance estimator.
Let the population and sample be those in stratum A = 1, as shown in Table 7, with the target proportions ( 1 , 2 ) = (0.537, 0.441) given B = 1 or 2. Based on the observed sample of (Z,Ỹ ) and the subsamples given B = 1 or 2, we obtain We have ( B=1 11 , B=1 00 ) = (0.999, 0.969) for the actual subpopulation Pr(X =Ỹ |Ỹ , B = 1). In terms of these values, let three simulation set-ups be as given below. The set-up (a) explores departures from the transportability assumption. The results are given in Figure 4, as box plots of the ratiôM AB 2 ∕ 2 for different combinations of ( 11|2 , 00|2 ) based on 10,000 simulations for each combination. The horizontal dashed lines mark the region where an estimate is closer to 2 than̂X 2 is. We notice the following.
1. The combination ( 11|2 , 00|2 ) = ( B=2 11 , B=2 00 ) is sandwiched between the first two box plots in the second panel from the left, according to which it is more likely than not that the model M AB would yield an improvement tôX 2 in the application, provided the discriminant assumption holds exactly. Despite a small departure from the discriminant assumption where ( B=1 11 , B=1 00 ) = (0.999, 0.969) instead of (1, 1), the actual̂M AB 2 is most likely closer tô2 than̂X 2 tô2, as discussed in Section 3.3. 2. The combination ( 11|2 , 00|2 ) = ( B=1 11 , B=1 00 ) is sandwiched between the last two box plots in the second panel from the right, according to which it is most likely that the model M AB would yield an improvement over̂X 2 , under a true model M AB that is close to the assumed one in the application. For instance, the results obtained for ( 11|2 , 00|2 ) = (0.9, 0.9) suggest MSE(̂M AB 2 ) < MSE(̂X 2 ) under this model M AB ; and similarly for ( 11|2 , 00|2 ) = (0.85, 0.85) or (0.95,0.95). 3. The estimator̂M AB 2 performs better than̂X 2 when 11|2 ≈ 00|2 in all the panels. The likely reason is that B=1 11 ≈ B=1 00 in this simulation study. This suggests that the model M AB is likely more robust against the departure from the transportability assumption, as long as 11|2 ∕ 11|1 ≈ 00|2 ∕ 00|1 . The set-up (b) explores departures from the discriminant assumption. The results from 10,000 simulations are given in Figure 5, as box plots of̂M AB 2 ∕ 2 for different combinations of ( 11 , 00 ) in the subpopulation of B = 1, where the horizontal dashed lines mark the region where an estimate is closer to 2 than̂X 2 is. Clearly, provided the transportability assumption, the improvements of the estimator̂M AB 2 over̂X 2 , both in terms of bias and MSE, are quite robust against small departures from the discriminant assumption.
Indeed, one only needs to be concerned with the results where |̂X 1 − 1 | is sufficiently small. For instance, the deteriorating results for 11 = 00 = 0.95 in the leftmost panel are not a worrisome issue in practice, because it implies Pr(X ≠ Y |B = 1) = 0.05 which is unlikely to be acceptable for a definition of the discriminant. Recall that in the application earlier, we have |̂X 1 − 1 | = 0.005, which is a property of the discriminant B that can be tracked and verified retrospectively over time.
The set-up (c) explores departures from both the discriminant and the transportability assumptions at the same time. For each combination of ( 11 , 00 , 11|2 , 00|2 ), the proportion over 10,000 simulations where |̂M AB 2 − 2 | < |̂X 2 − 2 | is indicated in Figure 6. The separate conclusions above remain largely the same under both types of departure at the same time. In particular, for the panels in the bottom-right corner, where the violation of the discriminant assumption is the least, the estimator̂M AB 2 outperformŝX 2 when the transportability assumption holds exactly, and the improvement is quite robust against departures from the transportability assumption as long as 11|2 ∕ 11|1 ≈ 00|2 ∕ 00|1 .
Lastly, we use double bootstrap to investigate the performance of the proposed bootstrap variance estimator. At the outer level, the replicate sample of Z is simulated by parametric bootstrap; at the inner level, the bootstrap variance procedure described in Section 2. 2 is applied to the simulated sample to yield an estimate of V(̂M AB 2 ), denoted bŷ2 here. We consider two scenarios for the outer level. In scenario I, all the model assumptions are satisfied, where Y has a Bernoulli probability 1 given B = 1 or 2 given B = 2, and Z is generated from the Bernoulli distributions with ( 11 , 00 ), and X = Y if B = 1 and X is generated using ( B=2 11 , B=2 00 ) if B = 2. Scenario II is created in an ad hoc manner, where the transportability assumption is not satisfied. We fix Y to be the observedỸ in the population. We set X = Y if B = 1 and fix X as observed if B = 2. We generate  Z from the Bernoulli distributions with ( 11 , 00 ) given B = 1, whereas given B = 2 we generate Z using ( B=2 11 , B=2 00 ) which are the observed conditional probabilities Pr(Z|X, B = 2). For each scenario we have 10,000 repetitions at the outer level, whereas the variance estimatê 2 at the inner level is based on 200 resamples as in the application reported above. The results of this double bootstrap are given in Table 9, which summarize the distribution of̂2 compared to 2 = V(̂M AB 2 ). We note that̂2 = 0.00029 in the stratum of A = 1 in the application reported earlier. In scenario I, the model assumptions are satisfied and 2 is the unconditional variance of M AB 2 . The conditional bootstrap variance estimator is essentially unbiased in these simulations, where 9477/10,000 of the intervalŝM AB 2 ± 1.96̂contain the target parameter value 2 . Scenario II violates the transportability assumption; neither is M AB exactly the data generation model otherwise. Nevertheless, the bootstrap variance estimator remains essentially unbiased for the actual V(̂M AB 2 ). We conclude that the potential bias due to model misspecification is a more critical element of the proposed adjustment method than the bootstrap variance estimator.

SUMMARY
In the above we have developed a modelling approach for adjusting two fallible classifiers jointly observed in a nonprobability sample. A key innovation is the introduction of the discriminant B, which allows one to separate out the part of the population where the first classifier X is much worse than in the rest population, where misclassification adjustments is most effective for improving the estimation of the true classification. The bias caused by misclassification of X can be removed, if a second classifier Z together with X satisfy the assumptions (4), (5) and (6) exactly. Admittedly, this may not be the case in reality, as is common with any treatment of non-sampling errors. The application demonstrates that useful adjustment can nevertheless be achieved, when these assumptions are relaxed to (8), such that the proposed approach may potentially be helpful in many situations.
To implement the approach to produce official flash estimates, which accounts for the errors of X arising from the progressive nature of the administrative source, the model M AB applied in Section 3 may be refined in two respects. First, we believe it is possible to improve the definition of the discriminant, by incorporating more extensively the relevant information available in the statistical data infrastructure at the NSO. Next, a more thorough process may be implemented to select the covariates A, in order to improve the transportability of the estimated misclassification mechanism of the second classifier Z. At the same time, one may look for a more parsimonious model specification, which can improve the trade-off between bias adjustment and associated variance.
Two issues are then worth attention in practice. First, to obtain a more truthful assessment of the uncertainty of the adjusted flash estimator, it will be useful to examine retrospectively the errors given by e t =̂M AB 2t −̂̇Y 2t , whereẎ is the Employed status based on a sufficiently mature version of the register data, say, 3-6 months later than t. Analysis of e t over time may also suggest other possibilities for improving the flash estimator.
Next, while normally the Norwegian labour market is by no means volatile, shocks do occur from time to time due to global events such as financial crisis or pandemic. In particular, we plan to apply the model M AB to the data in 2020, where the labour market is subjected to considerable dynamics due to Covid-19. It would be interesting to study both the level and change estimates given by the flash-estimation methodology, in comparison with the LFS-based employment statistics which are traditionally considered as the leading indicator for changes in the labour market.

APPENDIX A. MLE UNDER M AB WITH STRATIFICATION VARIABLE A
Since A is a stratification variable, the likelihood can be maximized separately within each stratum by A, where we need to show that the MLE of y|2a is given by the matrix method in Section 2.2. Thus, we can conveniently drop a in the notation, that is, as if the population consisted of a single stratum. The likelihood can then be given as Similarly to Tenenbein (1972), re-parameterization and the invariance property of the MLE lead to the result. Since xy y|2 = yx x|2 , we have ∑ K y=1 xy y|2 = ( ∑ K y=1 yx ) x|2 = x|2 for the third term above, and ∑ K y=1 zy xy y|2 = ( ∑ K y=1 zy yx ) x|2 = zx x|2 for the second term. Since the first two terms of the likelihood refers to B = 1 and B = 2 separately, the MLEs of the parameters ( , ) are given by the corresponding subsample proportions of (Z, X), that is, the matrix method estimator of ( , ). Next, by the invariance of the MLE, the matrix method estimatorĤ =̂− 1̂i s the MLE of the matrix of , and̂Y |2 =Ĥ X|2 is the MLE of Y |2 . This completes the proof. ■

APPENDIX B. BAUM-WELCH ALGORITHM FOR HMM
The Baum-Welch algorithm is a special case of the EM algorithm, which uses the forward-backward algorithm in the E-step. Below we outline the algorithm in terms of the sample units, the notation of which is simpler. In practice, the different units are grouped by distinct paths, which constitute the sufficient statistic. Given sample unit i at time t, let be the parameters in the forward-backward algorithm respectively. The forward sequence i,1∶T is given by for t = 1, and the recursive formula