SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

Today, data collected by service providers can track an individual user's experience in detail, at flow or packet level in real time. However, we still lack analytics methods that can translate this information into a comprehensive and ever-evolving representation of the user experience. In this paper, we provide a layered dynamic model that addresses the problem of how to relate low-level network performance metrics to a user's perception of network service and their subsequent actions. Using time-stamped observations from networks, devices, and customer care, we build probabilistic models to link network performance to an inferred state of customer satisfaction, and then to explicit and implicit customer disengagement events. We provide inference algorithms for the model parameters, and report test results on synthesized datasets based on real, but incomplete, observations. We discuss how popular anonymization techniques such as data masking, encryption, k-anonymization, and differential privacy can be used to protect sensitive and private user data without impacting the user experience inference. © 2014 Alcatel-Lucent.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

Communications service providers (CSPs) are facing fierce competition as they strive to win consumer and enterprise business with their fixed, mobile, and video services. Intense economic pressures, escalating consumer demands, and increasingly complex technologies are raising the stakes, forcing service providers to work harder than ever to attract customers and keep them happy. The previous areas of competitive differentiation such as faster bandwidth, unique services and device innovation have largely disappeared with the advent of ultra-broadband, multi-play services, third generation/fourth generation (3G/4G) smartphones and over-the-top applications. Customer experience remains one key differentiator, when you consider network reliability, coverage, care, provisioning and billing all have an impact on the customer's perception of their service provider. A market segment has emerged for customer experience management (CEM) analytics to address this need.

In order to better understand factors that affect customer experience and to better manage the customer relationship, service providers have looked to mining the vast collections of data originated from network operations, customer care logs, and billing records. Typically the analysis focuses on a particular task, such as churn prediction [16], where the probability for someone to exit a service is modeled as a response to variables from the network such as number of dropped calls, quality of service indicators, as well as service usage, plan type, time to contract end, along with competitor offers and customer demographics. Some studies added social network connections [19] and complaint data [11] into the mix. Standard statistical methods like logistic regression and survival analysis have been applied to the task of prediction, as have machine learning algorithms. In some cases, estimates of customer lifetime value (CLV) [20] are factored into the cost model for churn prediction and loyalty management decisions [10]. The Net Promoter Score* [18] and various indicators from customer satisfaction surveys are also popular means for monitoring customer satisfaction. Statistical models have been developed to help operators respond to potential impacting factors [13].

Panel 1. Abbreviations, Acronyms, and Terms
3G—Third generation
4G—Fourth generation
AES—Advanced Encryption Standard
CEM—Customer experience management
CLV—Customer lifetime value
CSP—Communications service provider
GLM—Generalized linear model
IMEI—International mobile station equipment identity
IMSI—International mobile subscriber identifier
IP—Internet Protocol
KPI—Key performance indicator
MLE—Maximum likelihood estimator
MOS—Mean opinion score
NP—Non-deterministic polynomial-time
PII—Personally identifiable information
SSN—Social security number
TAC—Type allocation code
WNG—Wireless Network Guardian

Most studies in this area address the problem by using data accumulated up to a specific time and use a single stage model for the relevant prediction. The model can be one with interpretable parameters or a black box, general-purpose prediction algorithm. However, since customer experience evolves over time as a result of instantaneous network performance, the customer's own activities and tolerance levels, as well as prior experience and the quality of care, in this work we take a different approach that focuses on modeling the customer experience by building a unifying, dynamic multi-layer model. Based on this model, we can monitor customer experience in real time as well as to anticipate customer engagement and disengagement actions, for example, increasing services or, respectively, churning to another carrier.

In the sections that follow, we first present the model by introducing each layer, then show an inference algorithm.

A Multi-Layer Model Framework

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

User experience is a subjective measure of how a customer feels about the service he or she receives. The service provider would like to quantify this measure, to monitor it, to learn how network performance affects it, and to anticipate customer actions based on it. Below we describe the model framework (shown in Figure 1) to achieve these goals. We start with the bottom layer, the network condition.

thumbnail image

Figure 1. The multi-layer framework and an example of network condition and session performance affecting customer experience leading to customer behavior and action.

Download figure to PowerPoint

  • Layer 1. Network condition

    • Description. Overall network condition that affects customers in aggregate. For example: load and presence of congestion in a cell.

    • Measured by. Network key performance indicators (KPIs).

    • Remark. Network condition does not necessarily affect individual users to the same extent, because each user interacts with the network differently. For example user A may need to download a substantial amount of data during busy hours, in which case a congested cell will translate into a poor experience, while user B who accesses the network at non-peak hours may be rarely affected. This leads us to the second layer.

  • Layer 2. Individual session performance

    • Description. Session-specific network performance metrics experienced by individual users.

    • Measured by. A vector of quantities describing the session experience such as throughput, loss, and delay, as well as application-specific metrics, for example the length of stalls encountered when streaming video. We denote this measure for user i at time t by a vector xi(t).

  • Layer 3. User experience

    • Description. Individual user experience as a function of time. Unobserved and to be inferred from other layers. The main goal of inference in the model.

    • Measured by. Opinion score si(t). It is the main quantity of interest in this model. It is a function of the individual session experience

      • equation image(1)

      with parameter θ which can be the weights of the multiple metrics if, for example, f(.) is a linear additive function. Therefore determining the functional form of f(.) and value of θ is the main goal of the sections that follow.

    • Remarks. The opinion score is closely related to the concept of mean opinion score (MOS) [2], which has been widely used as a measure of the quality of service in telecommunications. It is a function of key quantities relating to the quality of the transmission. Most available literature on mean opinion scores originated in the context of voice quality, and more recently, MOS has been defined for video quality as well. Conventionally, the mean opinion score is calibrated by customer surveys in tightly controlled experimental settings. However surveys are very costly to conduct and other ways to calibrate customer opinion are being investigated, leading to the next layers.

  • Layer 4. Customer behavior

    • Description. A short term observable customer-initiated action reflecting his/her satisfaction regarding his/her experience, such as canceling a slow download, or complaining about service quality by calling customer service.

    • Remarks. We denote such user behavior or the lack thereof as ri(t) = 1 or 0. We can calibrate the latent opinion score with the following relationship (the parameter ϕ includes the intercept and linear coefficients if g(.) is of linear form)

      • equation image(2)
  • Layer 5. Customer action

    • Description. Customer-initiated changes that are longer term and more consequential than customer behavior, for example, changing service plans and churning.

    • Remark. In an unpublished Bell Labs study of churning customers from a United States (U.S.) mobile provider, a statistically significant relationship has been detected between the decision to churn and the service quality as measured in loss, delay, and throughput that the user experienced in that month. Since churning is an important decision, it is usually not based on instantaneous experience, but accumulation of past experience. For example a customer may churn only after he or she has had several bad experiences or unaddressed complaints. Therefore we consider customer actions to be dependent on the cumulative opinion score equation image rather than the instantaneous score si(t). That is (the parameter φ includes the intercept and linear coefficients if h(.) is of linear form),

      • equation image(3)

Note that both customer behavior and customer action may be influenced by other factors such as individual users' tolerance level or contract terms. We will consider these factors in the case study section.

Inference Algorithms

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

The goal of inference for the model above is the latent opinion score. This involves inferring the function forms linking the second through the last layers f(.), g(.), h(.), as well as the model parameters θ, ϕ, φ. In this paper we assume that the function forms are known or can be approximated by widely used forms but the parameters need to be estimated. Specifically, the parameters can be estimated by maximizing the likelihood which equals

  • equation image(4)

More specifically, if we assume f(.), g(.), h(.) are linear functions and θ, ϕ, φ are the coefficients to be inferred, then we have

  • equation image(5)
  • equation image(6)

In (6) we replaced the integral in (3) with summation since the experience pertains to individual sessions k = 1, …, ni(t) where ni(t) is the total number of sessions user i initiated up until time t. and is therefore discrete in time. Below we do away with the continuous time t and use j to denote the session number. Note that each user has different number of sessions.

The log-likelihood derived from (4) can then be written in discretized form (where the time t is replaced with session number j)

  • equation image(7)

This is equivalent to a logistic regression likelihood where the output variable is the vector of equation image the design matrix is equation image where the first columns of X correspond to the vectors of instantaneous experience metrics and the rest of the columns to the cumulative experience metrics. The maximum likelihood estimate of the parameters is then the maximum likelihood estimator (MLE) of the logistic regression. Notice that the parameter θ cannot be identified separately from φ or ϕ. As a result, the opinion score f(.) can only be inferred up to a linear transformation.

Accounting for Inhomogeneity

In the formulation above, we have assumed that the customers are homogeneous, i.e., given the same session performance, their behaviors and actions have the same probability distribution, i.e., φ and ϕ are the same for all customers. This can be easily extended to a case in which subpopulations exist, and the parameters differ across these subpopulations. If the grouping is known to the service provider, then the log-likelihood can be written as a summation of the likelihoods of the individual groups, where the MLE can be obtained in the same manner as above.

However, it is also possible that the grouping is unknown to the provider. For example, there may be a group of frequent customer service callers versus a group of infrequent callers, i.e., under the same problematic session performance, the first group is more likely than the second group to call-in a trouble ticket. We do not directly observe which group a customer belongs to. In this case, if we assume that the number of groups is known (when it is unknown, one can start with 2 to see if such a grouping exists), we augment the dataset with the grouping

  • equation image

where Gi denotes the group membership of the i-th user. Let β be the model parameters to be inferred, including combination of θ, ϕ, φ as well as the random effects associated with each subpopulation, then the augmented likelihood can be written

  • equation image(8)

where equation image is a matrix with the first few columns identical to X and the last column being the vector G.

To obtain the parameter estimates we use a Gibbs sampler [9]. We first sample β(l) | G(l), X, y from a multivariate normal approximation to the above generalized linear model (GLM) likelihood (8). Then we sample G(l+1) | β(l), X, y from

  • equation image(9)

The parameter mean can be obtained by iteratively sampling from the above two full conditional distributions until the distributions converge.

Case Study

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

In this section we present a case study, where the above model and associated inference algorithm can be applied to monitoring user experience and predicting user actions.

Data

Our approach is designed to exploit an end-to-end dataset that contains data at different layers, observed over a sufficiently long duration, of say one month or more. In current practice, such data typically reside in different, unconnected systems. The integration of operational data and business data remains a challenge. However, there is increased effort in the industry to link and uncover value from existing data systems such that the availability of end-to-end datasets may soon become a reality. To demonstrate how our model can extract value from such a dataset, we proceeded with a sample of real-world network performances and activities from a large collection of users, and enhanced it with simulation. The simulation concatenates user activities to approximate an individual's activities across multiple days, and adds probabilistic responses of the individual in the form of complaints and churn events. We parameterize the simulator in a way that can produce several scenarios with different types and levels of variability.

Specifically, we are interested in the user's experience with data service in a mobile network. We do not have data on the network condition layer but luckily in this case study that does not affect our inference, since we have all the information needed on individual session performance. Data on the individual session performance are based on a large dataset from the Alcatel-Lucent 9900 Wireless Network Guardian (WNG) product. From per-user data session performance measures on a single day, we generate a 30-day history of session experiences for 17,000 customers by concatenating one-day experience from different customers. Metrics for user i on day t include daily average loss rate, round-trip delay and throughput:

  • equation image

The instantaneous opinion scores for user i on day t are calculated as a discretized function of xi(t). Specifically for each metric we let a metric-specific score take value in {−2, −1, 0, 1, 2} which respectively signify {“very unhappy,” “unhappy,” “neutral,” “happy,” “very happy”}. We specified that the top and bottom 2.5 percent of all measurements are assigned “very happy” and “very unhappy,” the next top and bottom 13.5 percent are assigned “happy” and “unhappy,” the remaining 68 percent are “neutral.” Then we combine the scores by pre-specified linear weights θ = (1, 1, 3). The value of θ is chosen to be approximately proportional to that estimated from a proprietary internal study on the relationship of churn to these metrics. Note that we chose to discretize first then combine linearly instead of the reverse because the distribution of each metric is highly non-normal and asymmetric with different means. Discretizing first serves as an approximate normalizing procedure. Such a way of assigning opinion scores is also consistent with conventional ways of scoring subjective experiences.

Simulator

We assume that customer behavior is based on instantaneous opinion scores si(t), whereas the more consequential customer actions are based on a cumulative opinion score equation image We also assume that customers will voice their opinion right away. For example, when network experience is simulated on a daily aggregate basis, and the behavior to be simulated is calling-in a trouble ticket, then as long as the call is placed on the same day the user experienced trouble in the network, the behavior is considered to be concurrent with network conditions. If not, a delay factor can be built into the simulator, however that is outside the current scope of the paper.

Two possible approaches have been considered in simulating the events: one is that an event is generated each time a threshold has been surpassed. This approach may be intuitive but it can generate problems for inference if there is one outlier, e.g., if no complaint is generated when experience metrics are obviously worse than at other times when users did call with complaints. Therefore we adopt an approach where we generate the events probabilistically. The probabilities are higher when the opinion score is lower.

In particular we simulate network-related customer care tickets ri(t) = 0 or 1 based on the instantaneous opinion scores. Tickets are generated as follows

  • equation image(10)

Other events, if considered, can be generated in a similar fashion.

We simulate customer actions, specifically the event of churn based on the cumulative opinion score

  • equation image(11)

Simulation Scenarios

We discussed the possibility of subpopulations of customers, where customers are homogeneous within each subpopulation but behave or act differently across subpopulations. Below we describe how the simulator can generate these scenarios.

  • Baseline scenario: homogeneous. When customers are homogeneous, under the same session performance, they have the same rate of customer service events and the same rate of churn. We fix the value of ϕ and φ according to the desired service inquiry rate and churn rate.

  • Known subpopulation scenario: contract term random effect. Under the same experience, customers who are in a fixed contract may be less likely to churn than someone whose contract has expired. We simulate this scenario by including a contract term effect.

    • equation image(12)

    contract(i) indicates different contract status 1 through nG, nG being the total number of distinct groups or contract status.

    • equation image
  • Unknown subpopulation scenario: tolerance level random effect. When faced with the same experience, some users may be more likely to call customer service to complain than others. We call this difference the tolerance level effect and create this effect by including a corresponding random effect.

    • equation image(13)

    tol(.) = 1 or 2 for frequent/infrequent callers.

    • equation image

Results

We have data from a large U.S. mobile provider showing that the monthly churn rate is between two percent and four percent. We assume a monthly average churn rate of three percent. We do not have data on the average rate of customer service events and adopt a rate of 0.3 trouble tickets per month, or roughly one trouble ticket per quarter.

We use several means to verify the accuracy of the inference, the first of which is whether the multiplying parameters were recovered up to a scaling constant. More importantly, we wish to recover the opinion score up to a scaling and translation constant. To measure whether this is achieved, we use two measures, Pearson's correlation, and concordance. Concordance is a widely used measure, which in this context is defined as the proportion of all pairs of scores where the pairs of inferred scores have the same ordering as the true scores. A concordance of 0.5 means that the inferred score does not reflect the ordering in the true score, where a 1 means that the inferred score managed to fully recover the ordering, making further decisions based on quantiles most precise.

In the baseline (homogeneous) scenario, the parameter θ is estimated by the GLM function in R the model induced by equation 7. The estimates are accurate up to a scaling constant. The correlation between inferred score and “true” score is 0.89, and the concordance is 0.85. Note that the correlation was not higher because the “true” scores are discrete, but instead because the “inferred scores” are continuous for higher resolution monitoring (and can be discretized as desired).

For the contract term random effects, we generate nine subpopulations of sizes ranging from 50 to 6000. The random effects are generated from N(0, 1) on the logit scale. All random effects are correctly estimated (larger subpopulations with smaller standard errors). The correlation between simulated and inferred opinion score was 0.90 (see Figure 2). The concordance was 0.85.

thumbnail image

Figure 2. A customer's 30-day history of network experience, opinion score trajectory, and simulated customer service events and (absence of) churn.

Download figure to PowerPoint

For the tolerance level random effects, we generate two hidden subpopulations of frequent callers and infrequent callers. An infrequent caller generates trouble tickets at the same rate as the “enriched” baseline scenario: approximately one ticket per 30-day period. In contrast, a frequent caller on average generates six tickets per 30-day period. We randomly assigned 20 percent of customers to the frequent caller group, the rest follow the infrequent caller rates. We apply the Gibbs sampler to this dataset. The Gibbs sampler converged quickly and opinion scores were recovered with a correlation coefficient of approximately 0.90 and a concordance of 0.85. Group memberships can also be inferred as a byproduct of the Gibbs sampler.

We also explored inference results for when the models were misspecified, i.e., the random effects were ignored. When the contract term random effects were ignored and only the parameter θ was estimated by a naïve GLM, the correlation and concordance were the same as when the random effects were taken into account. This means that the main effects can be correctly estimated regardless of whether the random effects were included. The same was observed when we ignored the tolerance level random effects. This gives us confidence that even the basic (naïve) GLM model can be applied with confidence for effectively monitoring the opinion score. However caution needs to be exercised when predicting customer actions. User-specific random effect may be inferred from his/her prior behaviors/actions if available.

Data Privacy

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

We used a simulator for the work above to address any lack of data. Below we discuss how we can approach the data problem by enhancing data privacy such that CSPs are more willing to share information given the right tools to anonymize user data.

CSPs have access to a lot of subscriber data (e.g., demographics, call records, location history, and calling patterns) and have an obligation to protect the privacy of their subscribers based upon the specific regulatory requirements in the country, region, or jurisdiction in which they operate. However the regulations do not mandate any particular anonymization technique and CSPs are free to use any one that meets their needs. Below we provide an overview of the most widely used anonymization techniques: data masking, k-anonymity [21], and differential privacy [7], and an anonymization library as a tool for providers.

Data Masking

Data masking refers to replacing a data item from domain A with a data item from another domain B. It is represented as a function f: A[RIGHTWARDS ARROW]B, called the masking function. Typically a data masking function f is used to replace the personally identifiable information (PII) (e.g., Internet Protocol (IP) address or phone number) in the data with non-identifiable values that cannot be mapped back to the PII, i.e., given a value f(b), an attacker cannot compute f−1(f(b)).

Example: Table I provides an example of demographics for a fictional set of mobile subscribers along with their most frequented websites. The PII (phone number) in each row is masked by replacing the last four digits of the phone numbers with strings “xxxx” (of length four). The result of this data masking is shown in Table II. Note that it is not possible to obtain the phone numbers from the resulting strings without any additional information.

Table I. Unanonymized data.
Thumbnail image of
Table II. Anonymized data.
Thumbnail image of

There are several classes of data masking functions and the choice depends upon the legal requirements and application that consumes the data:

  • Reversible. It is possible to obtain the original data f−1(f(b)) from the masked data f(b) using some additional information, e.g., key-based symmetric encryption such as Advanced Encryption Standard (AES).

  • Irreversible. It is not possible to obtain the original data from the masked data, i.e., it is not possible to compute f−1(f(b)), e.g., a salt-based hash function.

  • Syntactic. The output of the data masking function is syntactically compatible with its input. Such functions are typically used for generating anonymized data for software development and testing, e.g., replacing a social security number (SSN) with a random integer.

  • Semantic. The output of the data masking function preserves the desired semantic properties of the masked data, e.g., encrypting the last seven digits of an international mobile station equipment identity (IMEI) number while preserving the type allocation code (TAC).

  • Pros and cons. Typical data masking functions are simple and efficient, making them a good choice for anonymizing real time/streaming data and big data. They also meet the legal requirements in many countries. Unfortunately, privacy breaches due to the AOL* search data release [3] and Netflix* challenge [17] have shown that masking PII (e.g., an IP address) is not sufficient for preserving an individual's privacy, and it cannot be used for attributes such as location which may be required by an application.

k-Anonymity

Some seemingly unidentifiable data such as age, gender, ZIP code* and date of birth can also uniquely identify an individual [21]. This set of attributes also known as pseudo-identifiers are used in a linking attack—joining them with an external database (e.g., voter registration list) that contains pseudo-identifiers of individuals. This can be prevented by k-anonymizing [21] the data. The principle of k-anonymity is to hide the secret in a set of k possible secrets. It means that the value of pseudo-identifiers for an individual must be same as those of k−1 other individuals in the data. We show the definition of k-anonymity from [21], for data corresponding to individuals represented as a relational database table.

Definition. Let T[A1 … Am] be a table of data where each row corresponds to an individual and QI be the set of columns that constitute a pseudo-identifier. T satisfies k-anonymity if ∀t ∈ T, ∃t1 … tk−1 ∈ T such that t[QI] = ti[QI].

Example. Table II shows a 3-anonymous form of Table I with respect to quasi-identifier QI = {Age, Gender, ZIP}. There are two more rows that have same values for columns in QI. The sets of rows with same values on QI are [{1, 2, 6}, {3, 4, 9}, {7, 8, 10}, {5, 11, 12}].

Most algorithms for k-anonymity use generalization and/or suppression. Generalization refers to replacing a specific value with a less specific value, while suppression removes that value completely, e.g., in row three of Table II age “33” is generalized to age group “31–35” and gender is suppressed using special character “*”. Since generalization and suppression reduces the quality of the data, the key challenge in k-anonymity is to keep the generalization and/or suppression to a minimum. We refer to this problem as the utility preserving k-anonymity problem. Unfortunately the problem of optimum utility preserving k-anonymity is non-deterministic polynomial-time hard (NP-hard) [15] but a number of approximation algorithms [1, 4, 6] have been proposed that can work in practice.

Pros and cons. The k-anonymity model is simple to understand, and it protects against direct and indirect identification. It also provides a configurable model (via “k” and generalization) to address the privacy and utility tradeoff. However, since generalization reduces the quality of the data, it may be difficult to create a k-anonymous form that satisfies the requirements of the two different applications. It is also criticized for only preventing attacks that are known, i.e., where the anonymizer knows what additional background knowledge an attacker has on an individual or a population, which is rarely the case.

Differential Privacy

One way of handling the problem above is by instituting a privacy model that is independent of the any background knowledge that can be gained by an attacker. Differential privacy [7] achieves this by ensuring that a function R's output over a dataset changes minimally due to the presence or absence of one individual in the dataset. This is typically achieved by adding random noise to the output of a function. There are several variants of differential privacy, and below we describe the most popular, ∈-differential privacy.

Definition. A random function Q satisfies ∈-differential privacy if for every dataset D1 and D2 that differ by exactly one data item, Pr[Q(D1)] ≤ e × Pr[Q(D2)].

While both data masking and k-anonymity state a condition on the released data set, differential privacy states a condition on the function used to release the dataset. Example. Given a dataset D of music preferences in a population, the function Fb represents the number of persons who like Mozart. The differentially private output of this function can be obtained by adding random noise whose values are taken from a Laplacian distribution with standard deviation of λ = 1, i.e., Fb(x) + Lap(λ).

Pros and cons. Differential privacy assures an individual that the output of a differentially private function over a population's data is nearly the same whether or not his/her data is used in the input to the function. There are several data analytics functions that can be computed in a differentially private way, e.g., a histogram, decision tree classifier [12], k-core clustering [8], network trace analysis [14] and frequent item set mining [5]. However it only guarantees that an attacker can identify whether an individual's data is in the complete dataset, it does not guarantee that an attacker will not learn any new information about an individual. It has also been criticized as being too restrictive, i.e., very few functions have been shown to have differentially private form, and there is no known mechanism for obtaining a differentially private form of an arbitrary function.

Anonymization Library

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

We created a library of functions for anonymizing structured data to: 1) meet the legal requirements of anonymization and 2) ensure that the utility of the data is preserved for the analytic tasks at hand.

Anonymization requirements vary by country and by region. Since a single privacy model or anonymization technique cannot meet every possible requirement, we support all the previously described anonymization techniques while maintaining an architecture that is extensible to new anonymization methods.

The quality of the data must meet the requirements of the application but this can vary from one application to another, since it is difficult to provide an optimum k-anonymizing algorithm a priori. Since it is easier to understand and specify the output of the anonymization as opposed to the value k, we allow the data owner to specify the acceptable level of anonymization for various columns in the data and along with the anonymized data, we output the degree k achieved by the anonymized data through the anonymization selector. Figure 3 shows an instance for five attributes: Phone number, age, gender, ZIP code and date of birth. The data owner wants to mask the phone number, ZIP code, and date of birth and k-anonymize age and gender. Based on the selections, the anonymizer selector generates a configuration that specifies how each column in the data is to be transformed to achieve the desired anonymization. It also includes additional parameters such as a masking string for data masking and generalization hierarchy, and the level of generalization for k-anonymity.

thumbnail image

Figure 3. Anonymization selector.

Download figure to PowerPoint

Anonymizer. The anonymizer uses the configuration to create an array of data transformation functions, one function per column in data. Using the input schema, it reads one record at a time from the input data and applies data transformation functions to each column in the record to obtain an anonymized record. This approach is geared towards a row-oriented data model and works well with big data processing models such as MapReduce that process data serially. It also works well in anonymizing streaming data and helps to meet the legal requirements of some countries which specify that data must be anonymized before it can be stored.

The anonymizer can be extended and/or customized in two ways: 1) by using user-defined data transformation functions (conforming to a simple interface) in the array of functions applied to each record and 2) by providing user-defined generalization hierarchies for k-anonymization. Thus CSPs can fit custom generalization hierarchies or anonymization functions to obtain anonymized data that meets the utility requirements of their applications.

Anonymization for Customer Experience Models

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

In our case study we used data masking to anonymize our data. Specifically, we applied a hash function to the PII which is the user's international mobile subscriber identifier (IMSI). In addition we did not ask for other PII or pseudo-PII such as age, gender, or address as they are not required by the current analysis. In a more refined version of the analyses where these variables are taken into account, we can potentially use k-anonymity such as generalizing user ages to intervals. Depending on the amount of generalization, the parameter estimates may have increased uncertainty (i.e., wider confidence intervals). However this is less of an issue if the model is not too richly parameterized for the amount of data available.

Conclusion/Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information

We describe the design of a multi-layer, dynamic model for customer experience evolution. The model reflects the reality that customer experience is a subjective variable that is not directly observed, and allows for statistical inferences to be made on its values based on observable factors that either cause the experience or are consequences of the experience. We believe that this model will prove to be a useful tool for extracting business intelligence from the integrated depositories of operational and business data that many service providers are striving to develop. Depending upon the type of the data (IP address, location) and needs of the analytic task at hand, one of many anonymization techniques can be used to preserve the privacy of an individual without interfering with the analysis.

(Manuscript approved October 2013)

*Trademarks

  1. 1

    AOL is a registered trademark of America Online, Inc.

  2. 2

    Net Promoter Score is a trademark of Satmetrix Systems, Inc., Bain & Company, Inc., and Fred Reichheld.

  3. 3

    Netflix is a registered trademark of Netflix, Inc.

  4. 4

    ZIP code is a trademark of the United States Postal Service.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information
  • [1]
    G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu, “Anonymizing Tables,” Proc. 10th Internat. Conf. on Database Theory (ICDT '05) (Edinburgh, UK, 2005), LNCS vol. 3363, pp. 246258.
  • [2]
    M. Andrews, J. Cao, and J. McGowan, “Measuring Human Satisfaction in Data Networks,” Proc. 25th IEEE Internat. Conf. on Comput. Commun. (INFOCOM '06) (Oakland, CA, 2006).
  • [3]
    M. Barbaro and T. Zeller, “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, Aug. 9, 2006, p.A1, <http://www.nytimes.com/2006/08/09/technology/09aol.html?pagewanted=all&_r=0>.
  • [4]
    R. J. Bayardo and R. Agrawal, “Data Privacy Through Optimal k-Anonymization,” Proc. 21st Internat. Conf. on Data Eng. (ICDE '05) (Tokyo, Jpn., 2005), pp. 217228.
  • [5]
    R. Bhaskar, S. Laxman, A. Smith, and A. Thakurta, “Discovering Frequent Patterns in Sensitive Data,” Proc. 16th ACM SIGKDD Internat. Conf. on Knowl. Discov. and Data Mining (KDD '10) (Washington, DC, 2010), pp. 503512.
  • [6]
    J.-W. Byun, A. Kamra, E. Bertino, and N. Li, “Efficient k-Anonymization Using Clustering Techniques,” Proc. 12th Internat. Conf. on Database Syst. for Adv. Applic. (DASFAA '07) (Bangkok, Tha., 2007), LNCS vol. 4443, pp. 188200.
  • [7]
    C. Dwork, “Differential Privacy,” Proc. 33rd Internat. Colloquium on Automata, Languages and Programming (ICALP '06) (Venice, Ita., 2006), LNCS vol. 4052, pp. 112.
  • [8]
    D. Feldman, A. Fiat, H. Kaplan, and K. Nissim, “Private Coresets,” Proc. 41st ACM Symp. on Theory of Comput. (STOC '09) (Bethesda, MD, 2009), pp. 361370.
  • [9]
    A. E. Gelfand and A. F. M. Smith, “Sampling-Based Approaches to Calculating Marginal Densities,” J. Amer. Statist. Assoc., 85:410 (1990), 398409.
  • [10]
    N. Glady, B. Baesens, and C. Croux, “Modeling Churn Using Customer Lifetime Value,” Eur. J. Oper. Res., 197:1 (2009), 402411.
  • [11]
    J. Hadden, A. Tiwari, R. Roy, and D. Ruta, “Churn Prediction Using Complaints Data,” Internat. J. World Acad. Sci., Eng., Technol., 19 (2008), 809814.
  • [12]
    G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “A Practical Differentially Private Random Decision Tree Classifier,” Proc. IEEE Internat. Conf. on Data Mining Workshops (ICDMW '09) (Miami, FL, 2009), pp. 114121.
  • [13]
    D. R. Jeske, T. P. Callanan, and L. Guo, “Identification of Key Drivers of Net Promoter Score Using a Statistical Classification Model,” Efficient Decision Support Systems—Practice and Challenges from Current to Future (C. Jao, ed.), InTech, Rijeka, Cro., New York, 2011, Chapter 8.
  • [14]
    F. McSherry and R. Mahajan, “Differentially-Private Network Trace Analysis,” Proc. ACM SIGCOMM Conf. on Data Commun. (SIGCOMM '10) (New Delhi, Ind., 2010), pp. 123134.
  • [15]
    A. Meyerson and R. Williams, “On the Complexity of Optimal k-Anonymity,” Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Syst. (PODS '04) (Paris, Fra., 2004), pp. 223228.
  • [16]
    M. C. Mozer, R. Wolniewicz, D. B. Grimes, E. Johnson, and H. Kaushansky, “Predicting Subscriber Dissatisfaction and Improving Retention in the Wireless Telecommunications Industry,” IEEE Trans. Neural Networks, 11:3 (2000), 690696.
  • [17]
    A. Narayanan and V. Shmatikov, “Robust De-Anonymization of Large Sparse Datasets,” IEEE Symp. on Security and Privacy (SP '08) (Oakland, CA, 2008), pp. 111125.
  • [18]
    F. Reichheld, The Ultimate Question: Driving Good Profits and True Growth, Harvard Business School Press, Boston, MA, 2006.
  • [19]
    Y. Richter, E. Yom-Tov, and N. Slonim, “Predicting Customer Churn in Mobile Networks Through Analysis of Social Groups,” Proc. SIAM Internat. Conf. on Data Mining (SDM '10) (Columbus, OH, 2010), pp. 732741.
  • [20]
    S. Rosset, E. Neumann, U. Eick, and N. Vatnik, “Customer Lifetime Value Models for Decision Support,” Data Min. Knowl. Discov., 7:3 (2003), 321339.
  • [21]
    L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” Internat. J. Uncertain. Fuzziness Knowledge-Based Syst., 10:5 (2002), 557570.

Biographical Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. A Multi-Layer Model Framework
  5. Inference Algorithms
  6. Case Study
  7. Data Privacy
  8. Anonymization Library
  9. Anonymization for Customer Experience Models
  10. Conclusion/Discussion
  11. References
  12. Biographical Information
Thumbnail image of

SINING CHEN is a member of technical staff in the IP Platforms Research Program at Bell Labs in Murray Hill, New Jersey. She received a B.S. in applied mathematics from Tsinghua University, Beijing, China, and a Ph.D. in statistics from Duke University, Durham, North Carolina. Her research interests include Bayesian methods and forecasting. Prior to joining Bell Labs, she was an associate professor at the Department of Biostatistics, Rutgers Biomedical and Health Sciences.

Thumbnail image of

TIN KAM HO leads the Statistics of Communication Systems Research Activity in Bell Labs at Murray Hill. She pioneered research in multiple classifier systems, random decision forests, and data complexity analysis, and pursued applications of automatic learning in many areas of science and engineering. She also led major efforts on modeling and monitoring large-scale optical transmission systems. Recently she worked on wireless geo-location, video surveillance, smart grid data mining, and customer experience modeling. Her contributions were recognized by a Bell Labs President's Gold Award and two Bell Labs Teamwork Awards, a Young Scientist Award in 1999, and the 2008 Pierre Devijver Award for Statistical Pattern Recognition. She is an elected Fellow of IAPR (International Association for Pattern Recognition) and IEEE, and served as editor-in-chief of the journal Pattern Recognition Letters in 2004-2010. She received a Ph.D. in computer science from State University of New York (SUNY), Buffalo.

Thumbnail image of

AVINASH VYAS is a member of technical staff in Bell Labs' IP Platforms Research Program in Murray Hill, New Jersey. For the past 12 years he has worked on personalization, privacy, and other data management issues in telecommunication services (e.g., location-based services). He is also interested in Extensible Markup Language (XML) data management, distributed systems and programming languages. He received his masters in computer science and engineering from the Indian Institute of Technology, Kanpur, India, and his Ph.D. in computer science from University of California at San Diego.

Thumbnail image of

JIN CAO is a distinguished member of technical staff in the IP Platforms Research Program Bell Labs, Murray Hill, New Jersey. Dr. Cao earned a B.A. in Applied Mathematics from Tsinghua University, China, and a Ph.D. in statistics from McGill University, Montreal, Canada. Her thesis was on the statistical analysis of brain images. Since joining Bell Labs, Dr. Cao has done research in various areas, mostly focusing on statistical problems arising from data networks, for example, network tomography, traffic modeling and simulation, network monitoring and performance analysis, and data streaming algorithms.

Thumbnail image of

JEFFREY SPIESS is director, product management in Alcatel-Lucent's Networks and Platforms group in Plano, Texas. He is currently managing the Motive Analytics product line for customer experience analytics and care analytics. Since joining Alcatel-Lucent in 1997, he has held numerous positions managing carrier applications, including software development, systems engineering, product and solution management, product and field marketing, strategy, solutions architecture and consulting. He holds three patents in telecommunications technology. Prior to joining Alcatel-Lucent, Mr. Spiess was employed with Texas Instruments where he designed hardware and software systems for defense and telecom applications. He holds a bachelor of science degree in electrical engineering from the University of Cincinnati, Ohio, and a master of science degree in computer science engineering from the University of Texas at Arlington.