Algorithm to satisfy l‐diversity by combining dummy records and grouping

Universities and corporations frequently use personal information databases for diverse objectives, such as research and marketing. The use of these databases inherently intersects with privacy issues, which have been the subject of extensive research. Traditional anonymization techniques predominantly focus on removing or altering identifiers and quasi‐identifiers (QIDs), the latter of which, although not unique, are closely correlated with individuals. However, this modification of QIDs can often impede data analysis. In this study, we introduce an innovative anonymization algorithm that combines the dummy‐record addition technique with a grouping method while circumventing the modification of QIDs. This fusion reduces the number of dummy records required for effective anonymization. The principal contribution of this study is the algorithm's ability to reduce the number of added dummy records. The proposed algorithm not only retains a high degree of data usefulness but also successfully adheres to the l$$ l $$ ‐diversity standard, which is a critical metric in privacy security. The experimental findings demonstrate that the proposed method offers a more equitable balance between safety and utility than existing technological solutions.


Contribution
This research's most significant contribution lies in proposing an algorithm that achieves higher utility compared with existing studies.The proposed algorithm, which applies a grouping technique to an algorithm 21 known for its high accuracy, successfully reduces the number of added dummy records and further enhances accuracy.Using this algorithm, it is possible to conduct analyses at a high level while satisfying l-diversity, as demonstrated through verification experiments.
In addition, the proposed algorithm offers the following benefits: 1. Robustness against an increase in the number of data attributes is achieved via the dummy-record addition method.
2. The simplicity of the method allows adaptability to diverse scenarios, including varying data types and potential attackers.3. Remains effective regardless of data size, from small databases with minimal records and QIDs to large databases.4. High data utility is attainable through the use of the analysis algorithm.
The remainder of this study is organized as follows: Section 2 offers the requisite background information, Section 3 delves into related works, and Section 4 elucidates the algorithm proposed herein.Sections 5 and 6 are dedicated to the empirical validation of this algorithm.Section 7 provides a conclusion to the study.

BACKGROUND
2][23] Initially, a personal information database comprised identifiers, quasi-identifiers (QIDs), and sensitive attributes.For instance, identifiers are attributes that uniquely identify an individual's name and address.Quasi-identifiers are attributes such as birth date and gender that, when aggregated, can uniquely identify an individual.Conversely, sensitive attributes refer to data that an individual would prefer to remain confidential, including purchase history, medical conditions, and religious beliefs.For illustrative purposes, Table 1 displays a segment of a disease database, categorizing data into identifiers (patient names), QIDs (gender, age, zip code), and a sensitive attribute (disease information).
In addition, let us consider a scenario wherein an organization holding personal information (hereinafter referred to as the data holder) attempts to furnish data to another organization (the data user).As providing unprocessed data could compromise patient privacy, data anonymization becomes imperative.
Subsequently, let us assume that the data user intends to scrutinize the relationship between a set of QIDs and a sensitive attribute.For example, the disease database shown in Table 1 is anonymized before being provided to a data user interested in determining the prevalence of lung cancer among different age groups and genders.The analysis algorithm proposed here is tailored for such investigations.Moreover, excessive QIDs in histogram creation can obscure analytical clarity; most histogram values may be nullified, rendering meaningful analysis unfeasible.As such, this study contemplates scenarios in which the data user designates a specific subset of QIDs for analysis.First, the data holder who possesses the original database uses an anonymization algorithm to anonymize it, creating an anonymized database, which is then made public or shared with the data user.Next, the data user uses an analysis algorithm to create an inference histogram from the anonymized database, tailored to their respective needs, and then uses them.The general schema outlining the processes of anonymization and analysis in this study is depicted in Figure 1.

Threat model
The threat model in this study aligns with the assumptions mentioned earlier and follows the conventions of existing research.The data holder is postulated to be reliable, whereas the data user's trustworthiness remains variable.In the  context of this study, a data user endeavoring to deduce the sensitive attribute values of individuals in an anonymized database is termed an "attacker."This attacker presumably possesses complete knowledge of all QID values for a specific individual and uses this information to predict sensitive attribute values.Furthermore, the term "attacker" is not limited to honest-but-curious scenarios; it also encompasses situations in which the attacker unintentionally infers sensitive information about an individual.Therefore, for the purposes of this study, all data users were potentially attackers.

k-anonymity
k-Anonymity 22,34 is a main privacy indicator.
Definition 1 (k-anonymity).For a natural number k, any database T can satisfy k-anonymity if T satisfies the following statement: for any record r in T, there exist at least k − 1 records wherein the QID values combine to produce r.
Definition 2 (QID group).A QID group is defined as a set of records with the same combination of QID values.
To prevent attackers from deducing sensitive attributes, the data holder releases a database that adheres to k-anonymity.Thus, even if an attacker possesses comprehensive knowledge of an individual's QID values, the individual's unique record remains indeterminate because of the existence of over k records with identical QID combinations.Table 2 displays a segment of the database extracted from Table 1, a postgeneralization anonymization technique to achieve 2-anonymity.The generalization method, which categorizes QID values into groups, can be executed using diverse algorithms. 20,22,35For instance, if an attacker attempts to identify Alice and knows her complete QID values ("female," "24 years old," "ZIP code 12345-6789"), they would determine that records A and B in Table 2 align with Alice's QID values.Consequently, the attacker fails to uniquely identify Alice's record.Records C, D, and E also presented identical QID combinations, ensuring that all QID groups in the database had a minimum of two records.Therefore, this database complies with 2-anonymity.
Definition 3 (l-diversity).For a natural number l, any database T satisfies l-diversity if T satisfies the following statement: For any QID group in T, the maximum occurrence rate of a sensitive attribute value does not exceed 1∕l.
Although a database conforming to k-anonymity inhibits the unique identification of a target individual's record, it does not necessarily prevent the identification of sensitive attributes.To elucidate this, an attacker aiming to identify Alice in Table 2 could not discern whether record A or B pertains to Alice; however, both records carry the same sensitive value ("influenza"), thereby rendering Alice identifiable as an individual afflicted with influenza.The l-diversity indicator prevents any leakage of this attribute's value.
Table 3 presents a database that complies with the two-diversity criteria.Two QID groups are evident: the first comprises records A and D, and the second includes records B, C, and E. In particular, the maximum occurrence frequencies of the sensitive attribute were 1∕2 and 1∕3 in the first and second QID groups, respectively.Therefore, the likelihood of inferring an individual's sensitive attribute value from this database does not exceed 1∕2.In this study, we employ l-diversity as a metric to assess the robustness of anonymization methods.

An algorithm that satisfies l-diversity
The algorithm for anonymization proposed by Xiao et al. 19 constitutes one of several approaches that fulfill l-diversity.Analogous to our method, the algorithm proposed by Xiao et al. incorporates identifiers and QIDs.However, it diverges by including multiple sensitive attributes.The database conceived by these authors resembles Table 1 but integrates "physician" as an additional sensitive attribute.As acknowledged by Xiao et al., existing research does not accommodate the variable security requirements across different sensitive attributes within a multiattribute database.For instance, when both "Disease" and "Physician" are classified as sensitive attributes, it stands to reason that "Disease" warrants a higher level of protection compared to "Physician."Nevertheless, current research methodologies safeguard both attributes at an indiscriminate security level.To address this gap, Xiao et al. developed a three-tier security system and introduced L sl -diversity, which is an extension of l-diversity considering the security level.Using L sl -diversity, allows for setting differentiated security levels according to the attribute.The authors stated that L sl -diversity offers a more balanced trade-off between utility and security compared with existing methodologies.Jayapradha et al. 41 introduced an algorithm denominated as heap bucketization-anonymity (HBA), which emphasizes the interrelationship among attributes.HBA incorporates Anatomy, 25 a renowned anonymization strategy, to safeguard privacy.This is achieved by segmenting the data into QID tables (T QID ) and sensitive attributes (T SA ).When Anatomy is applied to Table 1 to achieve 2-diversity, the outcome manifests as Tables 4 and 5.The operational flow of the HBA is as follows: 1. Preprocessing of the data.The effectiveness of HBA manifested itself in its resilience against a range of attacks, namely: (i) background knowledge attack, (ii) QID attack, (iii) membership attack, (iv) nonmembership attack, and (v) fingerprint correlation attack.Although HBA is implemented based on k-anonymity herein, it can satisfy l-diversity owing to its use of Anatomy.
Herein, we discuss the most recent advancement in anonymization technology, adapted Mondrian (AM), introduced by Dosselmann et al. 42 AM is a refined version of the earlier anonymization algorithm Mondrian. 22Dosselmann and associates categorize sensitive attributes into "sensitive values" and "nonsensitive values" to minimize information loss.For instance, in the context of a sensitive attribute such as "Disease," "HIV" is designated as a sensitive value, whereas "common cold" is not considered so.Such categorization restricts the unique identification of individuals specifically carrying "HIV," thereby diminishing the necessity for extensive anonymization.As a derivative of Mondrian, AM presents appreciable merits, including accelerated execution speed, reduced memory consumption, and streamlined implementation.Although Mondrian is a more antiquated technique, research initiatives such as MA have continued to evolve. 16,20otably, Mondrian and its derivatives exhibit a measurable discrepancy in accuracy when compared to algorithms like Anatomy. 43ei et al. 21introduce a method involving the addition of dummy records to safeguard privacy, thereby circumventing the need for QID processing or record deletion.These dummy records, possessing the same QID combination but divergently sensitive attribute values, were appended to the database.The sensitive attribute value in the dummy record was randomized and selected from values disparate from those of the actual record.This approach satisfies l-diversity by generating and incorporating l − 1 dummy records per actual record.As demonstrated in Table 6, which meets the two-diversity criteria, this method was applied to Table 1.The pseudo-IDs in Table 6, denoted by apostrophes, are indiscernible from genuine records, thereby affirming the security of this approach.The method proposed by Sei et al. augments the utility of the database, as it eschews the processing of QIDs and the deletion of records.However, the superfluous addition of dummy records inversely affected the database's utility.To address the drawbacks of an excessive number of dummy records, Sei et al. introduced a complementary analysis algorithm.Combined with their primary anonymization technique, this analysis algorithm enhances the utility of the database more effectively than competing methods.Nonetheless, the challenge of an inflated number of dummy records undermining the database's utility persists and remains unresolved.Moreover, Sei et al. proposed an analysis algorithm for the dummy-record addition method.Thus, combining the anonymization method and the suitable analysis algorithm of Sei et al. can provide higher usefulness than other methods.However, the anonymization algorithm adds an excessive number of dummy records.Therefore, improving the usefulness of the data is a challenge for their proposal.
Aminifar et al. 44 advanced a method that concurrently fulfills k-anonymity, l-diversity, and t-closeness. 45Their methodology was articulated as a constrained optimization problem.Similar to the focus of this study, their target database encompasses identifiers, QIDs, and singularly sensitive attributes.They posited a database with binary sensitive attributes, categorized as either positive (protected) or negative (unprotected).In their approach, they established a QID space, allocated points corresponding to each record within this space, and grouped these points into QID clusters.Subsequently, the QID values of the records within these clusters were replaced by the centroid values of each QID group.This clustering process was optimized under constraints to yield a database satisfying k-anonymity, l-diversity, and t-closeness.In contrast, the proposed method employs the same clustering technique but retains the original QID values.Within the scope of this QID processing paradigm, enhancing the attribute count inherently escalates the volume of objects requiring processing, thereby compromising the utility of the database. 16

Symbol definition
Let T and T ′ denote the original and anonymized databases, respectively, and the ith record in the database be r i (i = 1, … , N).Let G represent an ordered set of QID groups, and G(i Let G(i) represent the ordered set of records in the QID group, denoting the jth record in Let S indicate an ordered set of possible sensitive attribute values s i (i = 1, … , |S|).As stated in the Section 2, data users select the set of QIDs for analysis.Let Q represent an ordered set of QIDs selected by the data user, and Furthermore, a set of QID combinations selected by the data user is denoted as C, and the ith combination of QID values in C is represented as c i (i = 1, … , |C|).The combination of the QIDs selected by the data user is obtained from Q(i) as follows: For instance, if S is {Influenza, Asthma, … }, then s 1 is Influenza.Suppose Alice's record is r 1 and a QID group includes Alice in G( 1), then G( 1) is {r 1 , … } and g(1 1) is {Female, Male}.Furthermore, considering the combination of QIDs as gender and age and setting and c |C| is (Male, 99).

Anonymization algorithm
In this subsection, we introduce an anonymization algorithm that combines both grouping and the dummy-record addition method proposed by Sei et al.Although the dummy-record addition method ensures l-diversity by adding l − 1 dummy records to all records, certain databases may contain an excessive number of dummies.Table 7 illustrates a database that incurs overproduction of dummy records.Using the prevalent method of appending dummy records to satisfy 2-diversity 6 × (2 − 1) = 6 dummy records.Note that records A and C in Table 7 share identical combinations of QID values, as do records D and E. Essentially, QID groups exist a priori, eliminating the need for additional dummy records for these entries.Consequently, only a single dummy record (for record B) is required to fulfill the 2-diversity requirement in this particular case.The conventional method, which indiscriminately appends dummy records, generated five superfluous records.Reducing this surplus aligns the database more closely with its original state, thereby enhancing its utility.The algorithm proposed here initially constructs a QID group before adding any dummy records.Initially, records in the database were sorted to facilitate the grouping of QIDs.A record is incorporated into a QID group G(i) provided that it satisfies the following criteria: 1.The aggregated QID values of the records in G(i) precisely correspond to the aggregate QID values of the record in question.2. The number of records contained in G(i) is less than l. 3. The sensitive attribute values of all records in G(i) differ from those of the record to be added.
If no QID group fulfills these conditions, a new QID group is instantiated, and the record is appended to this newly formed group.
Subsequently, dummy records were added to each QID group until the group comprised l records.The generation of these dummy records adheres to established methods.A dummy record possesses the same QID value combination as the records within its QID group but diverges in terms of its sensitive attribute value.The proposed method adds at most N× (l − 1) dummy records to the database, compared with N× (l − 1) dummy records in the existing method.The proposed anonymization algorithm is presented as Algorithm 1.
The first segment of Algorithm 1 is devoted to the creation of QID groups and the sorting of records within the database.Function check(G, r i ) evaluates the existence of an appropriate QID group for r i that satisfies all three constraints described above.If such a group is present, the function returns "true" and r i is incorporated into that group.Function add(G, r i ) adds r i to the QID group.If such a group is not present, the function returns "false" and r i is incorporated into the new group (lines 9-11).The second half adds the necessary number of dummy records to each QID group.Lines 16-23 summarize the information for the dummy records to be added, and from line 24 onwards, create the dummy records and add them to the QID groups.Finally, shuffle(G) randomly retrieves all records from G. Line 31 of Algorithm 1 selects a random record from G to which a dummy record is added and assigned to T ′ .At this stage, T ′ is indistinguishable from the original and dummy records and is output as an anonymized database.
Here, the following theorem holds: Therefore, the maximum occurrence frequency of a sensitive attribute value in QID group G can be deemed as 1∕l.Because this condition holds for all QID groups, Theorem 1 follows from Definition 3. ▪ Here, we reasonably assume that |S| ≥ l.Under such circumstances, the dummy-record addition method delineated in Algorithm 1 can append a maximum of l − 1 dummy records to each QID group, thereby ensuring |S| ≥ l.Therefore, anonymization satisfies l-diversity.If |S| > l, l-diversity cannot be satisfied in principle and is not satisfied in either our method or the existing methods.
Moreover, the subsequent analysis algorithm employs the "number of dummy records added per QID group" in its computations.Accordingly, the proposed anonymization algorithm yields not only the anonymized database but also the metric "number of dummy records per QID group."Illustrative examples are presented in Tables 8 and 9.For instance, by examining Group ID 3, in the aforementioned tables, where the number of dummy records is zero, an attacker may discern that values C and D are genuine.Nevertheless, given that the attacker aims to identify specific diseases, it remains indeterminate which records pertain to the targeted individuals.Importantly, the "Number of dummy records per QID group" output does not compromise security, as assessed by the l-diversity criterion employed in this study as an indicator of security.
Furthermore, the running cost of Algorithm 1 was O(l × N), which is comparable to the computational complexity of the algorithm proposed by Sei et al.Jayapradha et al.'s algorithm is O(N) and requires fewer calculations than the proposed algorithm; however, the verification experiment section shows that the proposed algorithm is superior in accuracy.In this study, the assumption is not real-time anonymization but a one-time anonymization process after data collection.

Analysis algorithm
Herein, we introduce an apt analysis algorithm tailored to the proposed anonymization algorithm.The method for analyzing an anonymized database depends on the data user.When generating histograms from a database, the potential number of histograms grows exponentially with the number of attributes present.Evidently, preemptively creating all possible histograms before their dissemination to the data user is unfeasible.Therefore, we propose an algorithm that empowers the data user to construct and scrutinize histograms that feature attributes of interest from the received anonymized database.Upon receipt of the anonymized database, the data user autonomously selects QIDs and sensitive attributes for histogram creation.A histogram generated without employing an analysis algorithm merely represents the anonymized database and includes a plethora of dummy values along with original values, rendering it unsuitable for analytical purposes.The proposed analysis algorithm extrapolates a histogram devoid of dummy records from an anonymized database replete with these records.In this algorithm, rather than isolating and excising dummy records, it modulates statistical values through inference and probability.This study posits that histograms generated by the data user invariably encompass sensitive attributes, whereas the inclusion of QIDs remains discretionary.
We define three histogram using the analytical algorithm: the original histogram H created from the original nonanonymized database, an anonymized histogram H ′ created from the anonymized database containing dummy records, and an inference histogram H * created from the dummy-containing anonymized database.
We proposed two analysis algorithms: a base analysis algorithm and an improved analysis algorithm.

Base analysis algorithm
This subsection delineates the fundamental or base algorithms.Within the anonymized database, for records in any given QID group G(i), let F(G(i)) and D(G(i)) denote the numbers of original and dummy records, respectively.If F(G(i)) = l when the QID group is created, D(G(i)) = 0 because a dummy record is not required.The group's situation mirrors that of the original database, which is devoid of dummy records.Conversely, if F(G(i)) = 1, the group mandates the incorporation of dummy records up to the maximal limit D(G(i)) = l − 1. QID groups devoid of dummy records obviate the need to factor their effects, whereas those containing dummy records engender inference histograms.These inference histograms, which factor in the count of dummy records, surpass the anonymized histograms in accuracy.Accordingly, the base analysis algorithm initially arranges the QID groups according to the number of included dummy records and subsequently describes histograms based on these counts.An overview of this algorithm is presented in Figure 2. To derive inference histograms, the algorithm performs computations on each dummy-record count histogram, culminating in the output of these collective inference histograms.Let H ′ i (i = 0, … , l − 1) represent the anonymized histograms created from QID groups with i dummy records.These histograms were generated from an anonymized database.The anonymized histogram generated from QID groups with no dummy records is H ′ 0 , and that generated from QID groups with maximum dummy records is H ′ l−1 .The inference histogram H * i and H * are generated from x(j,k) described later as follows: Overview of proposed analysis algorithm (common for base analysis algorithm and improved analysis algorithm).
Let H ′ i denote a histogram generated from QID groups with i dummy records in an anonymized database H ′ that satisfies l-diversity after executing the proposed anonymization algorithm.Let  (j,k) indicate the total number of values for which the sensitive attribute value s j and the QID value combination c k .Therefore, the inference histogram is generated by determining the number of inferred true values as follows: where x (i,j) denotes the true value and x(i,j) indicates the inferred value.The factor (l − i)∕l in Equation ( 4) is F(G(m))∕|G(m)| in the QID group G(m), where i indicates the number of dummy records.The calculation in the base analysis algorithm considers only the number of dummy and original records.For instance, let us consider the anonymization of l = 2. First, because H ′ 0 involves no dummy record and requires no computation, it is directly output as x(j,k) =  (j,k) .Furthermore, let us consider H ′ 1 .Because each QID group contains one original record and one dummy record, we obtain x(j,k) =  (j,k) × 1∕2, indicating that 1∕2 of each value in H ′ 1 is counted from the dummy record.Thus, all x (j,k) values calculated using Equation ( 4) can be inferred as x(j,k) .

Improved analysis algorithm
This subsection describes the improved analysis algorithm.Although the base algorithm constructs histograms based on the number of dummy records in each QID group and subsequently produces inference histograms, the resulting inference histograms have limited utility.This limitation arises because these histograms were derived using a rudimentary computational formula.To address this shortcoming, we introduced an improved analysis algorithm that yields more precise inference histograms.Analogous to its base counterpart, the improved algorithm also develops histograms according to the number of dummy records in the QID group.Let us consider a histogram H ′ i generated from a QID group with i dummy records in an anonymized database H ′ that satisfies l-diversity after executing the proposed anonymization algorithm.The total number of values  (j,k) with the sensitive attribute value s j and QID value combination c k is obtained by adding the total number of nondummy and dummy values.Thereafter, the inference histogram is generated by determining the number of inferred true values as follows: ( In the improved algorithm, the total number of dummy values was represented as expected values.This quantification involves considering the probability q that a particular record is selected as a dummy record, as well as the overall count of other sensitive attribute values.The algorithm then generates an inference histogram by solving for x(j,k) .The likelihood that a dummy record possessing the sensitive attribute s j is incorporated into a QID group containing a record with the sensitive attribute s m is expressed as For instance, consider the total number of records c k in histogram H ′ i generated from a QID group with i dummy records.This aggregate count  (j,k) comprises values attributed to both original and dummy records.In the context of introducing s j as a sensitive attribute value for a dummy record, note that s j is never designated as a sensitive attribute value for a dummy record within a QID group already containing a record with s j .When a QID group that includes a record with sensitive attribute s m (m ≠ j) incorporates a dummy record, a sensitive attribute value distinct from s m was randomly selected.The probability of s j being selected as a sensitive attribute value for the dummy record was 1∕(|S| − 1).In a QID group G(p) with i dummy records, the ratio D(G(p))∕F(G(p)) of dummy records to the original records can be expressed as i∕(l − i) such that the probability is expressed by Equation ( 6).Based on the total number x (m,k) accounted for from the original records in s m and probability q, the expected value for adding s j in a QID group containing a record with s m can be evaluated.Since, as x (m,k) can similarly be derived from the expected values of other sensitive attribute values, Equation ( 5) can be expressed for all s j (j = 1, … , |S|) in c k .Therefore, x(j,k) is inferred by solving the simultaneous linear equations in |S| unknowns.Following the same process, we can develop a set of equations using Equation ( 5) from k = 1 to k = |C| and generate H * i by solving |C| simultaneous linear equations in |S| unknowns.
The methodology of the improved analysis algorithm is encapsulated in Algorithm 2, which returns an inference histogram H * from an anonymized database T ′ and a combination C of QIDs that the data user intends to use for histogram creation.Function create(T ′ , D, S, C, i) constructs a histogram involving records in QID groups with i dummy records within T ′ .Function gauss(f (x)), where f (x) = Equation (5), solves the simultaneous equations.

Experimental data
For empirical validation, four publicly accessible datasets were extracted from the UCI machine-learning repository. 46hese include the Adult dataset (Adult), which is commonly employed in existing research and is based on United States Census data.Additional datasets include the Bank Marketing Dataset (Bank), which records sales activity in Portuguese banks, the Credit Dataset (Credit), which documents Credit card transactions in Taiwan, and the Census-Income Dataset (KDD), which closely mimics real-world data.All datasets feature educational backgrounds among other sensitive attributes and contain multiple QIDs, such as gender and age.Data preprocessing involved the removal of records with missing or uniquely valued attributes.The formatted datasets are presented in Table 10.Histograms focusing on QIDs and sensitive attributes within these four datasets were generated.The key QIDs were age and gender, and the sensitive attributes are listed in Table 10.Subsequently, each dataset was subjected to anonymization, resulting in the anonymized database.These anonymized databases were further processed using an analysis algorithm to yield inference histograms.

Comparison algorithms
To assess the efficacy of the proposed algorithm, we juxtaposed its performance against that of the (l 1 , … , l q )-diversity algorithm, the analysis algorithm proposed by Sei et al. 21and the HBA algorithm proposed by Jayapradha et al. 41 For clarity, the algorithms proposed by Sei et al. are henceforth referred as LQDA.LQDA was chosen as a comparative method because it uses a dummy-record addition technique, and HBA was chosen as a comparative method because it represents a state-of-the-art approach.HBA attempts to achieve k-anonymity by merging the QID table and the sensitive attribute table; however, in this study, the focus is on l-diversity.Therefore, to optimize accuracy under l-diversity, the merging process was omitted, and the algorithm was executed with k = 1.For both LQDA and the proposed algorithm, the only parameter is l, which was set according to the experiments.The specific values assigned to l are explained in the following experimental descriptions.The methodology presented by Aminifar et al. 44 was deliberately excluded from this comparative analysis.This omission is predicated upon the method's fundamental incompatibility with our proposed anonymization algorithm.Notably, Aminifar et al.'s approach undertakes a binary classification of sensitive attributes, neglecting to secure attributes classified as negative.In addition, although AM, 42 a technology grounded in the Mondrian algorithm, represents a recent advancement, it was also eschewed from this study.As discussed in the relevant research section, anatomy possesses superior accuracy compared to Mondrian; therefore, we opted for a comparison with HBA, which is predicated upon the anatomy algorithm.
In this study, both foundational and enhanced versions of the proposed anonymization algorithm were implemented, and their respective performances were evaluated.The methodologies employed in this comparison are listed in Table 11.All empirical tests were conducted under the following computational conditions: Intel (R) Core (TM) i7-7700 processor with a clock speed of 3.6 GHz, 8 GB of RAM, Windows 10 Pro 64-bit operating system, and Java as the programing language.

Valuation index
To facilitate a quantitative assessment of the proposed algorithm compared with benchmark algorithms, both mean-squared error (MSE) and mean-absolute error (MAE) were computed between the original histograms and their  in l, a pattern consonant with the inherent trade-off between data security and utility that characterizes anonymization procedures.
In the specific context of the KDD dataset, which comprises a substantial volume of attributes, the I-Proposal also manifests significant utility.Importantly, elevated performance metrics were observed across all methods on the KDD dataset compared with other datasets.This observation can be attributed to the voluminous record count of the KDD.The proposed anonymization algorithm operates on the principle of maximum likelihood estimation; consequently, the error approaches zero as the database size expands.
Furthermore, as mentioned in Section 4.2, the computational complexity of the proposed anonymization algorithm was O(l × N).The most computationally demanding scenario in this study is the anonymization of the KDD dataset with l = 16, which was completed in approximately 25 min in our environment.This indicates that the running cost of the proposed anonymization algorithm was sufficiently realistic.

Evaluation against common privacy threats
As a confirmation of common privacy threats, Figure 5 presents the results of the prediction accuracy of attackers on the anonymized datasets.The prediction accuracy for all four anonymization algorithms was consistently below the theoretical value of 1∕l, demonstrating that none of the algorithms pose privacy issues.It is noteworthy that the value for HBA clearly falls below 1∕l.This is attributed to the possibility of record deletion operations in HBA, leading to the inclusion of records that cannot be identified.Considering the trade-off between privacy and data utility, this indicates an overemphasis on privacy protection, explaining the lower data utility observed in other experiments.
F I G U R E 5 Result of experiment against common privacy threats.

MSE and MAE with varying data size
The results for MSE and MAE when synonymizing data with varying dataset sizes are shown in Figures 6 and 7.In both cases, it is evident that the values for the I-Proposal were consistently the smallest.This observation indicates that I-Proposal can ensure high accuracy across datasets of different sizes, from small to large.In addition, the results of this experiment align with the discussion on dataset size in Section 6.1, providing further consistency and support to the findings.

Upper limit of expected value of MSE
Here, as a theoretical analysis, we determined the upper limit of the expected value of MSE.The MSE used in this study is given by Equation 7. the dummy-record addition method, is expressed as follows: Therefore, the expected MSE for LQDA, which targets only the sensitive attribute without considering QID, is given by E[L2 LQDA(N)] 2 ∕|S|.In the scenario of this study, a data user defines a QID set Q to be analyzed and generates a histogram based on C according to Equation (1).The number of records corresponding to each element c j of C is N j , and the average value of the error for each c j was the expected MSE value.In addition, while LQDA adds l − 1 records for all records, I-Proposal adds a maximum of l − 1 records.Therefore, the expected MSE for LQDA serves as the upper limit for the expected MSE of the I-Proposal.Considering these factors, the upper bound for the expected MSE of the I-Proposal is as follows:

▪
The experiments conducted in this study confirm that the obtained MSE values are consistently below the derived upper limit.

Other discussions
Table 12 delineates the number of QID groups for l = 5 across various datasets.For instance, in the Adult dataset, 146 QID groups were present within H ′ 0 , indicating that the dataset sufficiently satisfied l = 5 criterion through the aggregation of 730 records.Consequently, both the Adult and Bank datasets manifest a plethora of QID groups while requiring minimal insertion of dummy records, thereby enhancing accuracy.In contrast, the Credit and KDD datasets predominantly feature QID groups that require the addition of at least three dummy records.Accordingly, their accuracy is commensurate with that of LQDA, an algorithm that relies solely on the dummy-record addition technique because substantial groupings are scarcely formed.The Credit Dataset exhibits marked variability in the distribution of QID values, whereas the KDD dataset contains a high cardinality of QIDs (38), complicating the formation of extensive QID groups.Nevertheless, the proposed algorithm retains its significance even in these challenging contexts, as it maintains, if not improves, accuracy when compared with existing methodologies.
The proposed algorithm operates on the premise of using l-diversity, and does not engage with the principles of differential privacy or t-closeness.As articulated in the introduction, each of these metrics is optimally suited to specific sets of challenges and contexts; they are not mutually exclusive but rather complementary.Thus, while the proposed algorithm

Theorem 1 .
If |S| ≥ l, then the anonymized database T ′ obtained by executing Algorithm 1 on database T satisfies l-diversity across all instances.Proof.Let T ′ denote the anonymized database generated by executing Algorithm 1 on database T. Consider a certain QID group G in T ′ .The number of records comprising G is l across all instances.Because each record in G contains multiple values of the sensitive attribute, the set of sensitive attribute values contained by G includes l elements at all instances.

Algorithm 1 . 1 : 3 : 4 : 6 : 15 :
Anonymizing algorithm for database T Input: Database T, domain of a sensitive attribute S, and privacy level l 2: Output: Anonymized database T ′ , Number of dummy records per QID group D Create empty set G, D 5: for i = 1, … , N do if ℎ(,   ) then for i = 1, … , |G| do 16: d ⇐ l − |G(i)| 17: D(i) ⇐ d 18: Create empty set S G and Q G 19: for j = 1, … , |G(i)| do 20: S G ← S G ∪ {g(i) j s sensitive attribute value} 21: QID values that are the same as those in records of G(i) 24: for j = 1, … , d do 25: s ← A sensitive attribute value randomly chosen from S G 26: r ← Create a dummy record with Q G and s 27: S G ← S G − {s} 28:

1 :
Input: Anonymized database T ′ , Number of dummy records per QID groups D, QIDs Q, domain of a sensitive attribute S, privacy level l 2: Output: The inference histogram H * 3:

F I G U R E 3 F I G U R E 4
Result of MSE with varying l.Result of MAE with varying l.

F I G U R E 6 F I G U R E 7
The upper bound of the expected MSE for the I-Proposal, derived based on existing research on Result of MSE with varying data size (l = 5).Result of MAE with varying data size (l = 5).
TA B L E 2 2. Anatomization of the table into T QID and T SA .3. Calculating the correlation separately for both T QID and T SA .Disease database satisfying 2-diversity (QID table).Disease database satisfying 2-diversity (sensitive attribute table).
TA B L E 4 Disease database 3.
Disease database generated by the synonymizing algorithm.Numbers of dummy records.
TA B L E 12Number of QID groups (l = 5).