Personality computing: New frontiers in personality assessment

Personality Computing (PC) is a burgeoning field at the intersection of personality and computer science that seeks to extract personality ‐ relevant information (e.g., on Big Five trait levels) from sensor ‐ assessed information (e.g., written texts, digital footprints, smartphone usage, non ‐ verbal behavior, speech patterns, game ‐ play, etc.). Such sensor ‐ based personality assessment promises novel and often technologically sophisticated ways to unobtrusively measure individual differences in a highly precise, granular, and faking ‐ resistant manner. We review the different conceptual underpinnings of PC; survey how well different types of sensors can capture different types of personality ‐ relevant information; discuss the evaluation of PC performance and psychometric issues (reliability and validity) of sensor ‐ derived scores as well as ethical, legal, and societal implications; and highlight how modern personality and computer science can be married more effectively to provide practically useful personality assessment. Together, this review aims to introduce readers to the opportunities, challenges, pitfalls, and implications of PC.

machine learning (ML) approaches. Novel ML approaches offer the tools needed to appropriately analyze vast amounts of sensor-based data and can be used to predict personality scale scores (e.g., Ilmini & Fernando, 2017;Stachl et al., 2020aStachl et al., , 2020bVinciarelli & Mohammadi, 2014a). Sensor-based personality assessment is a promising way to unobtrusively measure individual differences in a highly precise, granular, and faking-resistant manner (Harari et al., 2015(Harari et al., , 2016. This new area of research may inform personality theory in an unprecedented fashion through the detection of complex patterns that cannot be retrieved by traditional statistical approaches. PC also holds the potential to yield real-world applications that might improve health and well-being through personalized interventions (Alexander et al., 2020). There has been an increase in PC studies in recent years (Mehta et al., 2019), originating not only from personality psychology but also from computer science given the centrality of highly complex data-analytical methods.
The purpose of this article is to give a basic introduction to PC. First, we briefly introduce personality science and discuss the conceptual underpinnings of PC. We then review the field of PC, focusing on the performance of different approaches, limitations regarding performance, and psychometric issues. We conclude with a short overview of some ethical, legal, and societal implications and discuss the role of personality psychologists in PC.
Throughout this article, we use the term Personality Computing, coined by Vinciarelli and Mohammadi (2014a), to describe ML approaches to personality research and assessment. However, other terms such as computational personality traits assessment (e.g., Ilmini & Fernando, 2017), digital phenotyping (e.g., Onnela & Rauch, 2016), psychoinformatics (e.g., Markowetz et al., 2014), personality sensing (e.g., Harari et al., 2020aHarari et al., , 2020b, automatic personality detection (e.g., Mehta et al., 2019), ML personality assessment (e.g., Bleidorn & Hopwood, 2019), and automatic personality assessment (e.g., Kedar & Bormane, 2015) have been used to describe similar approaches. correspond to how we behave and how others see us. Other-reports tap how other people see a target-person, which is part of someone's reputation. For example, an acquaintance rates a target-person on the item "She/He is someone who is outgoing, sociable." Together, identity and reputation are integral parts of personality (Hogan, 1996;Hogan & Roberts, 2000). Though moderately correlated (Connelly & Ones, 2010), they do not fully overlap and each harbor unique insights and predictive value (Vazire, 2010). Each source has different information available (McAbee & Connelly, 2016) and may be contaminated with different forms of "biases" (the self: e.g., self-serving bias, social desirability, impression management; others: e.g., familiarity, liking, relationship). A multimethod approach considering both self-and other-reports may provide a better, or more accurate, picture of target-persons' "true" personalities (Funder, 1995;Kenny, 1994). For example, composite trait scores capturing shared variance between self-and other-reports can be used as proxies for "actual" traits (i.e., what is known to the self and others).
Besides self-and other-reports, there are numerous other, more behavior-linked data sources that personality researchers could tap. For example, some bio-physiological data (e.g., pulse, heart rate, skin conductance) may be indicative of people's transient affective states but also their enduring traits (DeYoung, 2015;Eysenck, 1967). These can be understood as involuntary and automatic reactions to circumscribed sets of stimuli. Furthermore, behavioral products that remain in an environment after a behavior has been performed (e.g., purchase decisions, room arrangements, clothing) and digital footprints (e.g., Facebook Likes, Tweets, emails) have also been used to infer people's personality traits with some accuracy (Gosling et al., 2002;Kosinski et al., 2013;Kulkarni et al., 2018;Youyou et al., 2015). Such novel methods have far-reaching consequences because people may be unaware they are giving off personality-relevant information and thus would not be able to fake their scores. However, personality is expressed in any form of behavior (Table 1)-be it real-time physiological reactions, complex movements, or lasting remnants.

| PERSONALITY COMPUTING
The field of PC is mainly concerned with three broad problems (Pianesi, 2013;Vinciarelli & Mohammadi, 2014a): Automatic Personality Recognition, Automatic Personality Perception, and Automatic Personality Synthesis.
Automatic Personality Recognition and Automatic Personality Perception approaches infer trait scores from machine-detectable personality-expressive signals (for an overview of behavioral modalities, see Table 1). These signals-pertaining to (re-)action patterns or behavioral processes-can be derived, for example, from information encapsulated in text, audio, video, mobile phone data, digital footprints, wearables, and online games. This information is processed (automatically) by machines without human raters. ML algorithms are then trained to predict self-reported personality trait scores (Automatic Personality Recognition) or other-reported trait scores (Automatic Personality Perception) from machine-detected signals. Targets' self-reports and other-reports serve as benchmarks for accuracy. Automatic Personality Synthesis attempts to generate artificial personalities via virtual, or embodied agents (e.g., avatars, robots, smart homes). Vinciarelli and Mohammadi (2014a) place these three approaches within a Brunswikian lens framework (Brunswik, 1956) based on attribution and externalization processes in interpersonal perception and humanmachine interaction (see Figure 1). In this framework, aspects of an individual's personality (e.g., traits) are externalized via machine-detectable behavioral signals (i.e., distal cues). Behavioral signals can be actual behavior (e.g., speech, movements, gestures, facial expressions, etc.) or behavioral products. Automatic Personality Recognition is concerned with the automatic inference of a person's self-reported traits based on these signals, or cues.
The relationship between (self-reported) traits and distal cues is called ecological validity (Brunswik, 1956; not to be confused with external validity; Araújo et al., 2007). Importantly, not all behaviors are trait expressions. All experiences and behavior are always contextualized and determined by both the person and the situation (Rauthmann, 2016). Depending on the situation, the same behavior can vary in how indicative it is for an individual's trait and, thus, differ in its ecological validity. Note: Modalities can also be combined for multimodal PC.
Abbreviations and explanations: EAR = electronically activated recorder; LIWC = linguistic inquiry and word count (text analysis program to categorize percentages of words in pre-defined categories); MRC PD = Medical Research Council Psycholinguistic Database (machine-readable dictionary containing over 150,000 words, including [psycho-]linguistic attributes for each); N-grams = sequence of N items (e.g., words, letters) in a text; OpenSMILE = open-source speech and music interpretation by large-space extraction (toolkit for audio feature extraction and music and speech signal classification); prosody = properties of speech and linguistic functions (e.g., intonation, tone, stress).
The lens represents sensory-perceptual processes transforming distal cues to proximal cues, the cues that an observer actually perceives. Naturally, proximal and distal cues overlap, but some behavioral signals may be omitted in perceptual processes and signals that were not expressed can be perceived erroneously. Proximal cues activate attribution processes and are used to form perceptual judgments about a target's personality (i.e., trait levels attributed by observers). The relationship between a persons' reputations and proximal cues is called representation validity.
Automatic Personality Perception theoretically concerns the automatic inference of a person's reputation from proximal cues. In practice, however, it is usually not possible to determine which cues were actually perceived, so most Automatic Personality Perception research uses distal cues to approximate proximal cues.
Furthermore, Automatic Personality Perception usually aims at the prediction of average judgments (e.g., of a target's trait) across multiple raters. This means Automatic Personality Perception identifies cues that best predict average ratings, possibly ignoring relevant associations between certain cues and individual perceptual judgments.
Lastly, in Automatic Personality Synthesis, machines (e.g., robots, avatars, virtual agents) express automatically generated distal cues to elicit attributions of certain trait levels in human observers. The main goal is to elicit attributions as planned by machine designers (Vinciarelli & Mohammadi, 2014a). Effective Automatic Personality

Synthesis requires further advancements and improvements in Automatic Personality Recognition and Automatic
Personality Perception. For example, household robots, smart homes, and assisted programs (e.g., in physiotherapy or psychotherapy) can be built to express certain personalities-either normatively to everyone or tailored idiographically to a user after "reading" the user's trait levels (i.e., Automatic Personality Recognition). Automatic Personality Synthesis informed by Automatic Personality Recognition and Automatic Personality Perception could achieve more trust and compliance as well as more efficiency and effectiveness in human-machine interactions (Salem et al., 2015).

F I G U R E 1
Brunswik lens of processes underlying personality computing. Adapted from Vinciarelli and Mohammadi (2014a). μ S = inference of self-reports from distal cues; μ p = inference of other-reports from proximal cues; ρ EV = ecological validity = correlation between self-reported personality traits and distal cues; ρ EV = representation validity = correlation between proximal cues and other-reported personality traits however, most hover around ∼60%-70%). Among unimodal approaches, visual modalities (e.g., images or videos) yielded the highest accuracies (∼91%). The authors further highlighted that for most modalities the best performance had been achieved using deep learning techniques (compared to traditional or "shallow" ML) and that these can be expected to perform even better in the future.
A recent meta-analysis by Azucar et al. (2018) focused solely on Automatic Personality Recognition from digital footprints on social media and reported meta-analytic correlations ranging from r = 0.29 for agreeableness to r = 0.40 for extraversion. To our knowledge, an extensive and systematic meta-analysis across all signal modalities has not yet been attempted.
In conclusion, Automatic Personality Recognition and Automatic Personality Perception increasingly result in above-chance predictions of personality trait scores. However, several factors impede the comparability across studies highlighting the need for a systematic meta-analysis to assess average performance for each trait and across different modalities. Furthermore, there are some problematic common practices in PC that complicate the evaluation of performance. We will elaborate on these issues in the next section.
To evaluate PC performance appropriately, it is also important to clearly distinguish between Automatic Personality Recognition and Automatic Personality Perception given that the self and others have unique insights into a person's personality (Vazire, 2010). For some studies, Automatic Personality Perception showed stronger 6 of 17 -PHAN AND RAUTHMANN performances than Automatic Personality Recognition, but this awaits further systematic replication. Furthermore, studies must move beyond binary classifications and acknowledge the continuous, hierarchical, and multifaceted nature of dimensional traits (Wright, 2014). Moreover, multimodal and deep learning approaches are promising avenues to further improve automatic personality prediction. We refer readers interested in the particulars of selected studies to the review articles cited here and to Finnerty et al. (2016). Probability that a classifier ranks a random positive example higher than a random negative example It is likely that performance estimates in PC are overly optimistic due to overfitting (Mønsted et al., 2018;Stachl et al., 2020b). A central objective of ML is to find models that generalize from a sample (i.e., training data) to new observations drawn from the same population (i.e., test data). To achieve this, ML models are trained to minimize overfitting. Overfitting means a statistical model erroneously fits sample-specific noise as signal, thus overestimating the performance in a training data set that does not generalize to a test data set.

| Evaluating PC performance
This especially tends to occur with large numbers of predictors and relatively small effect and sample sizes (Yarkoni & Westfall, 2017), which is often the case in PC studies (Mønsted et al., 2018). However, distorted performance estimates can also result from underfitting, which occurs when a statistical model does not adequately represent the true underlying function in the data (e.g., fitting a linear function if the true relationship is exponential). This leads to poor performance both in the training and test data sets (Lantz, 2019).
Balancing the error sources associated with overfitting and underfitting is therefore a central goal in ML (Yarkoni & Westfall, 2017 (Schmitt et al., 2007). This leads to poor performance in binary classification tasks (Mariooryad & Busso, 2015). Moreover, using Monte Carlo simulations, DeCoster et al. (2009) showed that using continuous instead of dichotomized variables is generally preferable. Furthermore, dichotomizing can result in the underestimation of effect sizes, reduce statistical power, and increase false-positives (Cohen, 1983;MacCallum et al., 2002). Even in cases where effect sizes increase after dichotomization, this is likely due to sampling error, especially when sample or effect sizes are small (MacCallum et al., 2002). Lastly, dichotomizing impacts the measurement and representation of individual differences and inevitably entails a loss of information, possibly affecting the psychometric properties of a measure (MacCallum et al., 2002). Indeed, true differences between observations that are present in continuous variables might disappear through dichotomization. There are approaches circumventing the problems accompanying polytomization or dichotomization (e.g., regression-based binary classifiers, weighted sum loss functions, Bayesian-optimal classifiers, etc.; see Mariooryad & Busso, 2015), but these are rarely used in PC research.
When evaluating PC performance estimates, researchers should also note that these probably have a ceiling.
PC performance can be regarded as the degree of convergent validity between ML measures and questionnairebased measures of personality. In "traditional" assessment, convergent validities across data sources, measurement methods, or same-source measures with different content are usually moderate due to method artifacts as well as unique information captured by different measures and data sources (Wiernik et al., 2020). Convergent Another issue that further challenges PC research is a reproducibility crisis in ML (Hutson, 2018;Raff, 2019).
ML code is often not shared openly and has to be recreated from published descriptions. Moreover, modeling decisions affecting performance (e.g., hyperparameter settings) are frequently not reported (Hutson, 2018).
Reproducibility is also hampered by inaccessibility of data and hardware (Barber, 2019). However, the culture in ML is slowly beginning to change towards open science. For example, reproducibility checklists (Pineau et al., 2020) as well as frameworks for open and reproducible code (Forde et al., 2018;Van Lissa et al., 2020) are being implemented.
We conclude that despite some promising advances in PC, performance estimates should be interpreted with caution. Surprisingly high estimates might result from overfitting issues and not replicate. It should be noted, however, that PC cannot be evaluated on the basis of performance alone. To truly understand PC and its utility, researchers will also have to evaluate ML measures' psychometric properties. We will elaborate on this issue in the next section.

| PSYCHOMETRIC ISSUES
Besides performance issues, personality scientists have lamented the atheoretical "black box" approach and lack of psychometric validation in PC (Bleidorn & Hopwood, 2019;Harari et al., 2020b). Algorithm-based personality trait measures rarely undergo the same rigorous psychometric testing as "traditional" assessment tools. Performance measures usually provide some insights into the convergent validity of ML measures. However, few studies go beyond this and focus on (different forms of) reliability or other types of validity (content, discriminant, factorial, criterion, incremental).
Notably, there are a few differences in the psychometric evaluation of ML measures compared to traditional measures. For example, Stachl et al. (2020b) proposed a distinction between internal and external convergent and discriminant validities. For a supervised ML task, they distinguished between correlations of model-predicted values with the target variable (i.e., internal convergent validity) and correlations with external measures of the same construct (i.e., external convergent validity) or other constructs that were not considered during training (i.e., external discriminant validity). Furthermore, they suggest modifying a model's loss function to minimize associations with theoretically distinct constructs (i.e., internal discriminant validity). High external convergent correlations and low external discriminant correlations would provide more convincing evidence for the validity of a ML measure compared to internal validity correlations. Moreover, as mentioned earlier, researchers should expect a moderate ceiling for convergent validities due to method artifacts and unique construct information in different data sources (Wiernik et al., 2020). Furthermore, ML measures could be particularly prone to discriminant validity issues because the same indicators may be substantially linked to many traits, leading to high intercorrelations between distinct ML measures. Therefore, researchers should assess a broad enough range of indicators so that variables can contribute uniquely to distinct constructs (Wiernik et al., 2020).
That being said, the intended goal of an ML measure needs to be considered to determine whether it should obey traditional psychometric standards . If an ML measure is created to capture a latent variable from classical or probabilistic test theory, psychometric properties (i.e., objectivity, reliability, validity) should be evaluated. Also, its nomological network (Cronbach & Meehl, 1955) should be examined to determine the causes, concomitants, and consequences of the measured variable. If the purpose of an ML measure is to optimize the prediction of a criterion (e.g., job performan5e), psychometric evaluation may be secondary . For a detailed discussion and recommendations on the psychometric evaluation of ML measures, we refer to Bleidorn and Hopwood (2019) and Stachl et al. (2020b).
The reliability and validity of sensor-based features that enter ML models should also be considered.
For example, even a simple variable such as "step count"-measured via different pedometers, wearables, and smartphones-varies between devices and deviates from the steps counted by human raters (Case PHAN AND RAUTHMANN -9 of 17 et al., 2015). Knowledge about the reliability and validity of features is important to inform appropriate data collection and interpretation of the resulting model's content validity (Alexander et al., 2020). Much research is still to be done to ensure that ML personality measures appropriately measure the target constructs. Only then will it be possible to move beyond a black-box approach and substantially inform personality theory.

| ETHICAL, LEGAL, AND SOCIETAL ISSUES
Despite its potentials and because of its limitations, PC research, and in particular its application, raises some ethical, legal, and societal issues. Especially privacy, autonomy, and fairness are of paramount concern, and we briefly touch upon these issues here.
Privacy has been a longstanding issue in psychological testing, pertaining to an individual's right to decide which, and with whom, personal information is shared and to know one's own test results (Anastasi, 1980). These concerns are amplified in the light of passive data collection and automatized personality prediction. Digital traces may be collected involuntarily and without informed consent to make inferences about people's private traits.
Nonconsensual personality profiling does not only undermine people's autonomy and freedom of choice but could also pose a threat to democracy and informed self-determination on a societal level (Boyd et al., 2020;see Cambridge Analytica Scandal: Confessore, 2018).
Another pressing issue is whether PC produces fair results, that is, whether it performs equally well across different groups (Alexander et al., 2020), which may pertain to race, ethnicity, gender, sexuality, disability, confession, class, or culture. Any kind of algorithm-based discrimination might result in amplifications of social inequalities due to systematic discrepancies in access to personalized services or targeted manipulation (Kusner & Loftus, 2020). For example, commercial algorithms used in the US health care system have been shown to discriminate against Black patients assigning them the same risk score as White patients while being considerably sicker (Obermeyer et al., 2019). These risk scores are used to target patients with complex health needs to improve their care by providing additional resources. This racial bias arises because the algorithm predicts health costs rather than health needs. This is problematic because differences in the amount of money spent on care results from unequal access in the first place. Unfortunately, there have been many more reports of instances where discriminatory biases have permeated ML algorithms. For instance, facial recognition algorithms have been shown to perform worse for non-Caucasian faces, voice recognition systems were found to be biased against female voices, and targeted ads for high-paying jobs on Google were displayed far more often to men than women (for an overview, see Howard & Borenstein, 2017). From a psychometric perspective, biases in ML algorithms are reflected in issues with measurement invariance across different groups of people. More research is needed to determine whether algorithms that are trained with samples from certain groups are generalizable to other samples. Another related issue affecting measurement invariance and, thus, fairness is selection bias in the samples studied. There may be class barriers that prevent certain groups from participating in or being recruited for studies such as owning a (certain type of) smartphone. While the issues discussed here relate to ML research in general, the same problems apply also to PC research.
To combat some of the issues mentioned above, data protection laws have been put in place ensuring informational self-determination and regularizing data protection (e.g., EU General Data Protection Regulation; European Union, 2016). However, these laws differ substantially between nations (Guzzo et al., 2015), and PC researchers working with Big Data are prompted to carefully consider the laws and standards of the nation where data is collected, stored, and analyzed. Above and beyond such legal implications, researchers should consider the ethical and societal implications of their work. For example, statements of societal consequences of one's work at conferences can be requested and research on topics such as de-identification or data masking performed (Boyd et al., 2020). Especially for applied settings, such as personnel selection (Tippins et al., 2021), it is important that 10 of 17 -PHAN AND RAUTHMANN researchers and practitioners together develop and implement professional guidelines to ensure that legal, ethical, and societal standards are being met.

| THE ROLE OF PERSONALITY SCIENCE IN PC
There is a discrepancy in how and why PC research is conducted in personality psychology and in computer science. Another central role of personality scientists is to assist in the interpretation of results. Interpretation (e.g., regarding the relationships between features and personality constructs; embedding results into theory) affords expertise in personality psychology and cannot be provided by computer scientists. Also, personality scientists could improve the communication of results to the scientific community by providing conceptual clarity as well as concise and consistent terminology. Theoretically founded interpretations could improve future ML models, pointing data collection and modeling towards robust effects. Furthermore, personality-psychological expertise is necessary to identify future directions that are relevant to psychological research questions (see also Vinciarelli & Mohammadi, 2014b). For example, personality experts supported the investigation of constructs other than the Big Five (e.g., facets, personality states, values, narratives).
Another important point was that personality scientists should determine what should be used as the "ground truth" in PC. Self-and other-reports are often treated as the "gold-standard"-however, people's actual trait levels might be reflected more accurately by composite scores of shared variance between self-and other-ratings or the latent variance from multiple data sources. Researchers should even consider whether broad traits are appropriate targets in the first place. "Big Few" traits (i.e., Big Five, Five-Factor Model, or HEXACO; Ashton & Lee, 2020;Goldberg, 1990;McCrae & John, 1992) have been instrumental in organizing and integrating personality research.
However, if the ultimate goal is to use personality information to predict important outcomes, the question can be raised why multi-indicator information (which are indicators of personality) cannot be used directly instead of intermediately estimating personality traits. In this vein, Mõttus et al. (2020) show that out-of-sample predictive PHAN AND RAUTHMANN -11 of 17 accuracy increases with higher resolutions of predictors (e.g., a set of individual items predicted outcomes better than their aggregated facets or domains). Furthermore, experts mentioned that the distinction between measures and constructs should be clarified. Other relevant points included contributions of expertise in construct validation and test development; generalizability; promotion of good measurement practices; knowledge about normed distributions of personality constructs; perceptional biases; and choosing adequate methodological and theoretical approaches depending on whether the research goal is explanation, description, or prediction (see also Mõttus et al., 2020).
Lastly, personality scientists are equipped to collect personality-relevant behavioral data in relevant contexts coupled with personality measures that are necessary to build powerful and accurate PC models. These data sets need to be large, clearly labeled, and would ideally include multi-modally sensed behavior (however, computer scientists might be more experienced in collecting sensor data). Optimally, such data sets would be made publicly available (e.g., on a joint platform where data, information on sensing, etc. could be shared) while ensuring privacy and data security. Ultimately, promoting PC might not only require transdisciplinary collaborations between personality and computer scientists but also multi-lab efforts within personality science to provide high-quality data.

| CONCLUSION
PC opens up new and exciting avenues to gaining insights into complex behavioral aspects of personality, thereby also fostering theory-building and widespread applications. Despite the current limitations in PC and reproducibility issues, much progress should be expected in PC with continuous improvements in technology and computational power, modeling approaches, and an increasing cross-disciplinary integration of computer and personality science. However, personality scientists should be aware that PC research is and will be conducted without their involvement. To ensure conceptual precision and methodological soundness, personality science will have to enhance its influence on PC research. Personality scientists should especially promote psychometric evaluation and nomological validity of ML data. Accordingly, it is important that personality scientists assist in the interpretation of results in PC research to clarify what aspects of personality are being measured (Rauthmann, 2020), but they should also participate in designing studies and data collection. Moreover, state-of-the art knowledge in personality science is essential to move beyond a mere prediction of personality trait scores and to steer PC towards descriptive and explanatory personality research as well as relevant research questions revolving around personality dynamics. To effectively foster personality theory and to develop safe and sound personalized applications, PC research must be a truly interdisciplinary endeavor.