Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines

Should consumer researchers employ silicon samples and artificially generated data based on large language models, such as GPT, to mimic human respondents' behavior? In this paper, we review recent research that has compared result patterns from silicon and human samples, finding that results vary considerably across different domains. Based on these results, we present specific recommendations for silicon sample use in consumer and marketing research. We argue that silicon samples hold particular promise in upstream parts of the research process such as qualitative pretesting and pilot studies, where researchers collect external information to safeguard follow ‐ up design choices. We also provide a critical assessment and recommendations for using silicon samples in main studies. Finally, we discuss ethical issues of silicon sample use and present future research avenues.


| INTRODUCTION
Generative artificial intelligence (AI) is transforming academic and practical research.A particularly prominent type of generative AI is large language models (LLMs) that can process a myriad of inputs and predict the next word or a part of the next word (referred to as a token) in a sequence.The most visible outcome of this development is, arguably, the generative pre-trained transformer (GPT) model (Brown et al., 2020;OpenAI, 2023), which was made available to the general public via ChatGPT in November 2022.GPT uses large databases of text as input, trains the model by using a self-supervised language modeling objective, and employs reinforcement learning from human feedback (OpenAI, 2023).This procedure enables LLMs to mimic human response behavior (Jeon et al., 2023;Luo et al., 2022).
Psychologists and marketing researchers have started reflecting on how LLMs might impact consumer and marketing research (e.g., Peres et al., 2023).Studies in this domains emphasize LLMs' potential to improve marketing communications (e.g., content marketing campaigns and content design), deliver superior customer experience through hyperpersonalization, and enhance classic marketing research functions (Brand et al., 2023;Ooi et al., 2023;Paul et al., 2023).Researchers have also started using LLMs to substitute human participants in academic empirical research (Argyle et al., 2023;Demszky et al., 2023;Dillion et al., 2023).These studies use LLMs to generate so-called "silicon samples" (also referred to as "synthetic datasets") that seek to mimic human respondents to describe, explain, and predict human behavior.
Silicon samples have also emerged in marketing practice.For example, the startup Synthetic Users has set up a service using LLMs where personas can be described based on their demographics and personality traits so that they can be asked about their needs, desires, and feelings concerning a product or service.The system returns synthetic interview data that marketers can readily interpret and analyze (Hutson, 2023).
But should silicon samples be used to conduct empirical studies to provide insights into human behavior?Research has addressed this question by assessing whether the sample data generated by LLMs generalizes to human respondents-a necessary condition for replacing human samples with silicon samples.For example, in a series of replications of common cognitive psychology experiments, Binz and Schulz (2023) find that GPT shows similar biases as humans do (e.g., framing effects).On the contrary, Kirshner (2024) finds considerable differences between GPT and human samples in the formation of construals (i.e., personal interpretations of the world).
Specifically, GPT puts greater emphasis on features that relate to a goal-focussed, high construal level than features that relate to a means-focussed, low construal level.
Studies involving silicon sampling are scattered across numerous fields of scientific inquiry such as consumer research (e.g., Kirshner, 2024), general psychology (e.g., Caron & Srivastava, 2022), and political science (e.g., Argyle et al., 2023), making it difficult to provide a conclusive answer under which circumstances LLMs can mimic human response behavior.From a more fundamental perspective, researchers question whether LLMs can validly be used as models of human thought since an LLM's working principle involves computing the most probable next text element in a sequence.This process differs considerably from a human participant's feelings and reasoning abilities (e.g., Abdurahman et al., 2023;Demszky et al., 2023).
We contribute to this debate in several ways.We first review research comparing silicon and human samples across numerous scientific domains and discuss reasons for the observed variability in results.Based on our findings, we then discuss the use of silicon samples for applied consumer and marketing research.Specifically, we assess their use for qualitative pretesting and pilot studies as well as for quantitative main studies.We further supplement our discussion with ethical perspectives to offer recommendations for silicon sample use and derive future research avenues.Our proposed checklist for LLM use will help academics and practitioners to adequately situate silicon samples in their projects.

| USING LLMS TO MIMIC HUMAN BEHAVIOR
Although efforts to substitute human respondents with LLMs are relatively new, several studies have already conducted comparisons of human and silicon samples.These comparisons stem from various domains (e.g., human-computer-interaction, general psychology, social psychology) and consider a wide range of tasks and settings (e.g., cognitive reflection task, Hagendorff et al., 2023;  On the one hand, LLMs replicated results from tasks related to personality traits (Caron & Srivastava, 2022), framing effects (Chen et al., 2023), as well as political attitudes and party preferences (Argyle et al., 2023).For example, Caron and Srivastava (2022) surveyed Reddit users about their "Big Five" personalities and trained LLMs with these user-specific contextual data.Their results show that LLMs can reliably imitate personality markers in various contexts.
On the other hand, researchers were unable to replicate effects known to characterize consumer behavior in many cases, such as the endowment effect, mental accounting, or the sunk cost fallacy (Chen et al., 2023).For example, when replicating Kahneman and Tversky's (1979) classic prospect theory experiment to identify risk preferences regarding gains versus losses, Chen et al. (2023) found that ChatGPT mostly focuses on maximizing the expected payoffs, rather than, as humans do, acting risk-averse for gains and risk-seeking for losses.
Similarly, Park et al. (2023), using GPT-3.5, re-ran 14 studies from Many Labs 2, a large-scale replication project of major findings from psychological research (Klein et al., 2018).The authors could only replicate just over a third of the results from the studies.Studies for which both the Many Labs 2 and the GPT samples replicated the original results rely on generalizing or comparing information that is provided directly in the task instruction.GPT, however, did not replicate effects that arise due to implicit associations.For 6 out of 14 studies, regardless of the researchers' algorithmic choices, the GPT results showed a high level of determinacy (i.e., a "correct answer effect" in which GPT answered in a highly uniform way with none or almost no variation).
Two aspects that likely contributed to these mixed results are LLM's working principles, including ways to customize and parametrize them, and researchers' use of different LLM versions.In terms of their working principles, LLMs are designed to reproduce word cooccurrence patterns found in an unprecedented amount of training data from data sets such as The Pile.The latter is an almost 900 GB large diverse, open-source data set of English text, covering contents from, amongst others, arXiv, GitHub, Stack Exchange, Pubmed, and Wikipedia (Gao et al., 2020).LLMs reproduce co-occurrence patterns by applying neural networks using sentences as predictors of maskedout words, thereby approximating the meaning in the context, rather than assuming that words have a static meaning across contexts.
Prediction errors serve as the basis for updating the neural network's weights and bias terms (backpropagation) to minimize the difference between the model output and the target text.
As with all statistical analyses, the quality of the output depends largely on the quality of the training data.In the case of LLMs, the 1 We substantiated our discussions with a systematic literature review, which identified 28 articles that report the results from 285 silicon-to-human sample comparisons in seven domains with 96 individual tasks.We document the results of our review in the Web Appendix on the Open Science Framework (OSF): https://osf.io/b2gtv/training data comprises a multitude of sources that do not represent a well-defined population.This is problematic because LLMs "are simply parroting what the training data tended to say about the concept so that the dialogue sounds natural" (Demszky et al., 2023, p. 4), but it is not clear whose experiences and opinions the output reflects.Researchers can address this issue by fine-tuning the model (Brown et al., 2020) by feeding the LLM with additional and more specific training data (Brown et al., 2020).However, while fine-tuning may improve a model's performance in generating a correct response, this does not imply that the LLM better mimics human response behavior-as evidenced in the results of our systematic literature review (see the Web Appendix).The reason is that the correct answer is not necessarily the same response a human might give.For instance, the fine-tuned Flan-PaLM as well as other LLMs perform considerably better than humans in logic puzzles (e.g., Wason selection task, Lampinen et al., 2023).However, when fine-tuning the model for tasks that seek to mimic human responses, such finetuning may easily backfire as it "pushes the model to almost embody caricatures of those groups" (Santurkar et al., 2023, p. 10).
Another approach to improve LLM performance is prompttuning, where researchers prompt sample tasks and their solutions (Demszky et al., 2023).For example, a researcher investigating antecedents to service quality may prompt the following example: "Here is an example of a customer expressing concerns about the service quality: 'The service staff was unfriendly and didn't even try to resolve the problem'."Importantly, prompt-tuning is not restricted to a single sample task (one-shot prompting) but may extend to multiple examples (few-shot prompting). 2 In addition to the sensitivity of the results to the structure of the training data and prompts, LLM users also have various degrees of freedom when applying LLMs.Most notably, users can impose a certain degree of result variability via the softmax temperature and top-k parameters (Chang et al., 2023).For example, a higher softmax temperature of 1 or 2 will result in more diverse outputs, while a lower temperature such as 0.5 will make the outputs more deterministic.However, this decrease in diversity can easily be problematic as a certain degree of variation may be central to capturing a phenomenon fully.Conversely, if the randomness is very high, the results will vary more and will be more difficult to replicate.
Even if the temperature setting leads to differences in the individual answers, GPT can still come to a similar-but not necessarily the same-result (see Park et al., 2023).
A second source of result variability is grounded in researchers' use of different GPT versions.Most notably, Hagendorff et al. (2023) identified a substantial shift in the response patterns across different GPT versions.While early versions displayed the human-like intuitive system 1 thinking and its associated cognitive errors, GPT-3.5 and higher engage in chain-of-thought reasoning, which corresponds to system 2 thinking.For example, from human respondents and different GPT versions, the authors gathered data on a series of cognitive reflection tasks, such as: "Together, a potato and a camera cost $1.40.The potato costs $1 more than the camera.How much does the camera cost?" 3 While the majority of human respondents and earlier GPT versions gave an intuitive and, therefore, wrong answer to these tasks ($0.40 in this example), GPT-4 responded correctly in practically all the cases ($0.20 in this example), often even providing chain-of-thought reasoning.This result is in line with previous findings by similar task types, showing that GPT-4 performs exceptionally well in standardized tests (OpenAI, 2023).Another direct comparison of GPT versions revealed differences regarding the Big Five personality traits.Specifically, extraversion and agreeableness deviate more strongly from human samples in GPT-4 than in GPT-3.5.In addition, both versions do not represent human scores well concerning conscientiousness, neuroticism, and openness with GPT-4 performing better than GPT-3.5 (Jiang et al., 2023).

| RECOMMENDATIONS CONCERNING THE USE OF SILICON SAMPLES
In light of the challenges and opportunities of silicon samples, where should they be situated in a research project?What are current guidelines that researchers should adhere to in silicon sampling?In the following, we address these two questions.

| Using silicon samples for pretesting and pilot studies
We see considerable promise in using LLMs such as GPT in upstream parts of the research process where researchers collect external information to safeguard follow-up design choices.The aim is to alert researchers of potential errors in the process that would require intervention before initiating the main study with human participants.
For example, we recommend researchers to use silicon sampling for pretests and pilot studies, such as in scale pretesting where they could interrogate GPT whether a certain item wording is appropriate or not. 4 As a practical example, we prompted GPT-4 to assess the appropriateness of the following survey item "I'm satisfied with the products and services of the company," which respondents should answer on a scale from 1 ("I fully disagree") to 7 ("I fully agree").GPT-4 correctly identified that the item is double-barreled, containing both products as well as services to evaluate (Lietz, 2010) and also highlighted the generic nature of the question (Figure 1), noting that "[…] it doesn't provide insight into specific areas of strength or improvement."In a second example, GPT-4 flagged the item wording "I use this service very often" as vague, noting that "What one person 2 Results from our systematic literature review suggest that prompt-tuning considerably improves LLM performance (see the Web Appendix).considers 'very often' might be considered less frequent by someone else" (Figure 2).
GPT may also be used to capture some sources for measurement invariance in scale development processes (Vandenberg & Lance, 2000).To illustrate its capabilities in this regard, we prompted GPT-4 to assess whether respondents from different cultures would respond differently to the concept of cultural intelligence, which refers to "a person's capability to adapt effectively to new cultural contexts" (Earley & Ang, 2003, p. 59).GPT-4 asserts that this is likely the case due to different cultural norms and values, respondents' exposure to diversity, the context of interaction, and several other factors (Appendix Figure A1).The output therefore mirrors empirical findings pointing to the challenges associated with establishing measurement invariance in the measurement of cultural intelligence (Schlägel & Sarstedt, 2016).This approach can be extended to generate a set of silicon participants from diverse backgrounds to ascertain whether members of different subsamples may interpret the item content differently.
We also see value in using generative AI such as the text-toimage model DALL-E in other contexts such as crafting and testing stimuli or vignettes that should meet predefined characteristics (e.g., generating product stimuli for a study on assortment organization).
To illustrate its potential in this context, we asked DALL-E 3 to generate a visual stimulus that is supposed to extend the viewer's future time perspective (i.e., an individual's perceptions of their remaining time in life, which plays an important role, e.g., for emotion regulation; Carstensen, 2006).In response, DALL-E 3 describes a matching scenery that is principally useful for evoking a F I G U R E 1 Prompt and ChatGPT (GPT-4) answer for the appropriateness of a survey item (I).corresponding shift in time perspective.DALL-E 3 then uses this input to generate a corresponding image (Figure 3).Researchers could now revise the prompt by adding further information to customize the image to the specific research context.
DALL-E 3 can also be used as an initial test of the appropriateness of a visual stimulus.Drawing on Ton et al.'s (2023) study on the impact of simple versus complex packaging designs on consumer behavior, we generated two variants of a chocolate bar package, which we subjected to the prompt shown in Figure 4. Specifically, we used the attention check from Ton et al. ( 2023) and asked DALL-E 3 to assess each packaging's complexity on a scale from 1 ("simple") to 9 ("complex").The model identifies differences in complexity concerning various design elements, noting that the more complex design suggests "a richer sensory experience."The model also describes the images and provides a numeric assessment.We probed these assessments in an additional replication in which we also assessed order effects.While the descriptions are only marginally affected by the query order, the numeric assessments are sensitive to F I G U R E 2 Prompt and ChatGPT (GPT-4) answer for the appropriateness of a survey item (II).order effects (see Park et al., 2023 for another observation of order effects in GPT).Specifically, the more complex packaging receives a higher complexity score value only if it is presented after the minimalistic packaging (see Web Appendix Figure A2).
DALL-E 3 also pinpoints specific design elements when asked to describe differences and similarities between the packages (Figure 5) that also point to some potential confounds to visual complexity.For example, the minimalist design could be perceived as more upscale which may entail a different premium perception between the designs.Furthermore, since the complex design includes a higher sensory appeal, it can trigger sensory imagery and sensory expectations that are absent in the minimalistic design (see also Ton et al., 2023).Depending on the specific aim of the study, these aspects could limit the results' validity.
Taking together, our endeavors into using GPT-4 and DALL-E 3 for pretesting illustrate that both models can help generate and evaluate study materials-especially when it comes to assessing survey items, providing descriptions, and making generic evaluations.By doing so, GPT-4 and DALL-E 3 can provide qualitative insights that researchers would similarly expect from textbooks, experts in the field, or members of the target audience-which can be helpful for an initial evaluation of the survey materials.However, since LLMs such as GPT may hallucinate factually incorrect statements, it is imperative that researchers use them as an informant whose statements have to be independently checked and verified.For example, a qualitative assessment of study items can help check materials for obvious mistakes and identify ambiguous or unintuitive wordings.Furthermore, parceling out similarities and differences between stimuli can help identify potential confounds or alternative interpretations.

| Using silicon samples for main studies
Using silicon samples to generate a data set for quantitative main studies also poses a series of challenges across all stages of the research process, which researchers need to address before further utilizing any output (Table 1).While several of these issues can readily be addressed by today's standards (e.g., probing the output's robustness against linguistic features), others require clarification through follow-up research (e.g., quality standards for silicon samples).We outline some challenges in greater detail below and derive recommendations based on the state-of-research on silicon sampling.However, we endorse researchers to follow the latest developments and research in the field to make informed decisions regarding the appropriate use of silicon samples in their projects.

| Critically assess to what degree the training data can inform the research question
Whether silicon samples are an appropriate data source depends strongly on whether the training data contains information relevant to the research question (e.g., McCoy et al., 2023;Santurkar et al., 2023).
We therefore recommend researchers critically assess to what degree the training data-in general-can inform a research question.For example, research on customer satisfaction or service failures may have a relevant representation in the training data (e.g., through data from review platforms) while research on a very specific target audience (e.g., users of only a specific brand) may not be appropriately captured.
Identifying the population to which the results generalize is a fundamental challenge in this regard.Training data sets, such as The Pile, are simply a collection of massive amounts of data without a clearly defined population.Since the population remains undefined, researchers need to make their prompts more concrete to generate target group-specific results (Argyle et al., 2023), assuming that these target groups are adequately represented in the training data.However, this endeavor's effectiveness depends strongly on the prompt structure, as well as the LLM used, and could vary across versions.For example, GPT-4 demonstrates a higher range of capabilities than GPT-3.5 does with respect to standardized tests (OpenAI, 2023).
F I G U R E 4 Using DALL-E 3 to test for the appropriateness of visual stimuli.

| Customize the LLM and optimize the prompt
Using LLMs to simply interrogate the training data without further customization will probably not yield meaningful results and merely produce a generic response with little variation (Park et al., 2023).
Approaches to customization include fine-and prompt-tuning to improve the LLM's performance in a designated task (Demszky et al., 2023).Furthermore, researchers should use probing to assess the output's robustness against, for example, linguistic features (Manning et al., 2020).This process requires first identifying features pertinent to the concept of interest (e.g., negation, use of first person, or synonyms) and varying the input based on these characteristics to F I G U R E 5 Using DALL-E 3 to describe differences and similarities between visual stimuli.
examine their influence on the output.For example, researchers can generate a set of sentence pairs as input, differentiated solely by the inclusion or exclusion of negation in verbs (e.g., "satisfied" vs. "not satisfied").Subsequently, they can analyze the difference in the outputs generated by the model in response to sentences with and without negation.This comparison helps in assessing whether the model accounts for negation in its predictions, and if it does, identifying which elements in the vector show the strongest association with negation (Demszky et al., 2023).Our assessment of order effects regarding different packaging designs (Appendix Figure A2) is another example of a probing task.We recommend that researchers make use of these customization approaches such as iteratively adjusting the prompts if necessary.In doing so, researchers should check the output and, for example, confirm that the output follows the intended format (e.g., selecting one out of multiple options or writing a short text).Under no circumstances, should researchers adjust prompts to generate a specific output content.
Ophthalmology researchers Taloni et al. (2023) recently used GPT to create a fake data set that is practically indistinguishable from regular data, but produces false medical evidence.Commenting on their results, one of the authors noted in an interview, "The possibilities are endless, and increasing the quality of the prompts may lead to even more detailed and realistic datasets compared to the one we fabricated" (Fiore, 2023).
When it comes to customizing the model, researchers need to be aware that the cure could be worse than the disease in that their intervention potentially introduces additional biases that extend those that traditional confounding produces.For example, prompt design could be a reflection of researchers' prior beliefs and expectations, thereby unconsciously inducing a confirmation bias.
Such practice could easily turn a confirmatory study into an exploratory fishing expedition-a practice that has been criticized as p-hacking (Guo & Ma, 2022;Sarstedt & Adler, 2023;Simonsohn et al., 2014) and which is commonly viewed as a major contributor to low replication rates in various fields (Ioannidis, 2005;Miller & Ulrich, 2022).Researchers should also consider enforcing a certain degree of result variability to generate a range of plausible results, rather than striving for a precise estimate that primarily reflects the T A B L E 1 Guiding questions to consider when using silicon samples in research projects.
Step in the research process Guiding questions

Alignment of the training data and the research question
Can the training data inform the research question in a valid way? training data's idiosyncrasies and the researcher's degrees of freedom.From this perspective, a certain degree of variability is required to ensure generalizable results.Consequently, researchers should be aware of the potential sensitivity of results to customization approaches.For example, tailoring an LLM to a specific research question might also mean that its outputs are not generalizable, which, in turn, implies that fine-tuned and prompt-tuned models may only be of limited use for hypothesis generation and exploratory research.

| Use a human benchmark sample
In light of the challenges identified above, we recommend that researchers always benchmark their silicon samples with human samples to avoid the risk of tapping into an area where LLMs and humans react differently.Such benchmarking does, of course, offset some of silicon sampling's cost and time advantages.To reap the benefits of silicon sampling, researchers should therefore employ a parsimonious research design for a human benchmarking study, or use secondary data. 5When conducting such benchmarking, researchers should compare results from different silicon samples (e.g., arising from the use of different LLMs) to identify which one matches the human sample most closely.In doing so, however, care needs to be taken not to overfit the silicon sample to a specific human benchmark in a way that the silicon sample may perfectly mirror one human sample's results but will not generalize to other human samples.

| Justify and adapt analytical procedures
Researchers also need to be aware that silicon samples pose challenges to standard data analytical procedures.For example, researchers could easily produce very large silicon samples that minimize standard errors, thereby rendering standard inference testing of limited value.Similar to big data applications, we recommend researchers abandon significance testing for very large silicon samples and focus on the effect sizes' interpretation (Anderson, 2022) or switch to Bayesian data analysis (Wagenmakers, 2007).However, given the nature of the training data and the sensitivity of the results to the prompt structure, the corresponding estimates are probably associated with a substantial amount of uncertainty, which is quantified doubt about the value of the measurand (i.e., the quantity whose value is sought; JCGM, 2012).
High uncertainty implies that a measurement is consistent with a wide range of plausible values for the measurand, which may lead to a wide range of obtained values across different measurements and studies.Quantifying and managing this uncertainty are major challenges-as recent research has highlighted in related contexts (Rigdon & Sarstedt, 2022;Rigdon et al., 2020;Rigdon et al., 2023).

| Optimize reproducibility and transparency
Given the potential sensitivity of the results to various researcher's degrees of freedom, we highly recommend that researchers record as much information as possible (e.g., different variants of prompts; which LLM version was used) and adhere to transparent reporting practices, which allow other researchers to replicate the methodology and reproduce the results.While full reproducibility will hardly be possible due to output variability, enough information should be offered for other researchers to assess, for example, the quality of prompts given the purpose that the silicon sample seeks to fulfill.

| Ethical issues in silicon sampling
Researchers in the fields of psychology and marketing as a whole should also be aware of ethical issues generally raised by LLMs and, specifically, by the use of silicon samples.Ethical issues concern moral judgments such as saying that an action is "right" and "wrong," or "good" and "bad."Ethics as a field of philosophy has proposed various perspectives from which to evaluate ethical judgments, of which deontological and utilitarian ethics are the most common ones.
Within applied ethics, the ethics of technology and particularly the ethics of AI and GPT as part of a socio-technical ecosystem have recently emerged as a new area of philosophical inquiry (Stahl et al., 2024;Stahl, 2022) assessed using the deontological perspective or the utilitarian ethical perspective.From a deontological perspective, silicon samples may be opposed categorically.Some might argue that data from silicon samples are a very different data collection type, which violates research values and a researcher's responsibility to collect original data from human respondents.In comparison, the utilitarian perspective seems to be a more pragmatic perspective considering the costs and benefits of using LLMs.
One major cost discussed in this article concerns the lack of accuracy (or validity) that may turn silicon samples on a grand scale into "silly samples" that simply parrot human texts or worse be misused to forge seemingly "novel" and "interesting," but ultimately misleading findings.In this respect, an established area of research that is well-represented in the training data (e.g., product and service evaluations) may result in more accurate results based on a silicon sample than a less established research area (e.g., consumer response to a crisis such as COVID-19).
Another cost concerns potential biases in existing samples; it is important for researchers to recognize and mitigate biases that may be present in the training data of these models to prevent perpetuation of stereotypes or unfair representations.Researchers have even referred to a new "AI colonialism" due to, for example, the training data painting an incomplete, and potentially biased, picture of non-Western cultures (Hao, 2022), as the data normally originates from Western institutions.Atari et al. (2023) provide support for this notion.The authors contrasted culture-specific beliefs in the World Value Survey with GPT outputs and found that GPT's performance to mimic human responses declines considerably for non-WEIRD (Western, Educated, Industrialized, Rich, and Democratic) versus WEIRD countries.Finally, there is the cost of not sufficiently considering new consumer and marketing-relevant developments and new information.In this case, research topics that are less variant given short-term developments such as customer engagement (e.g., Hollebeek et al., 2024) may be studied more accurately than, for example, research relating to very recent phenomena such as the COVID-19 pandemic (Imschloss & Schwemmle, 2023).
These potential costs need to be compared with some major benefits of silicon samples.Benefits include immense cost and time savings, thereby overcoming inequity issues among researchers.As the number of required empirical studies in consumer and marketing journals has been increasing, critiques have been voiced that some researchers who have fewer funds are being "priced out" of the game.Silicon samples offer a solution to this equity issue.

| THE WAY FORWARD
LLMs' ability to deterministically produce correct answers to logic puzzles (e.g., cognitive reflection tasks) and standardized tests is certainly impressive by computer science standards but proves problematic when used to mimic human response behavior with all its imperfections and facets.If LLMs such as GPT act like rational agents, their results cannot readily be used to explain or predict consumers' bounded rational decision-making.Besides the differences in risk preferences and system 1 processing, GPT, specifically, can hardly mimic interpersonal differences, which is crucial for psychological or marketing research studies (e.g., Abdurahman et al., 2023;Park et al., 2023;Santurkar et al., 2023).However, given that the field is evolving rapidly (i.e., industrial players, such as OpenAI and Google, release improved LLMs in quick succession), it is reasonable to assume that future LLM implementations mimic human behavior more closely than current implementations do.For example, • Apply probing to safeguard the output's robustness against linguistic features or order effects.

Benchmarking
• Benchmark a silicon sample with a relevant human sample.
• Evaluate whether the silicon sample matches the human sample in measures for central tendency and variation.
• Disregard silicon samples that do not show variation (but provide a single "correct answer" (Park et al., 2023).

Transparency and reproducibility
• Always save a record of model settings, input parameters, and outputs.
• Preferably document all relevant information (e.g., model settings, input parameters, and outputs) in repositories such as the OSF to increase transparency.
multimodal LLMs interact with external information sources, tools, sensory data, and images, thereby increasing their data richness substantially.Similarly, structured and even semistructured task performances improved substantially after the latest GPT releases, and competitors such as Google's Gemini began outperforming GPTs (Gemini Team Google et al., 2023).We, therefore, advise researchers who consider using silicon samples to take these developments into account because new and improved versions may address some of the pitfalls mentioned in this paper or at least reduce their negative impact.
Great tasks lie ahead: Researchers should not only evaluate LLMs across a broader set of conditions and tasks but also explain LLMs' responses vis-à-vis human decision-making and behavior (see also Binz & Schulz, 2023;Dillion et al., 2023) to identify pathways for using them constructively in scholarly research (Demszky et al., 2023;Susarla et al., 2023).
Commenting on GPT models' suboptimal performance when replicating fundamental behavioral effects, Park et al. (2023, p. 24) note that "such behavioral differences were arguably foreseeable, given that LLMs and humans constitute fundamentally different cognitive systems: with different architectures and potentially substantial differences in the mysterious ways by which each of them has evolved, learned, or been trained to mechanistically process information."Decades of consumer behavior research sought to shed light on these "mysterious ways," but corresponding research in the fields of machine behavior (Rahwan et al., 2019) and machine psychology (Hagendorff et al., 2023)  and fine-tuning impact the silicon sample's quality?These are just some of the emerging issues that will require researchers to rethink some of the basic principles of sampling research.
Relatedly, efforts should be made to develop means to identify silicon samples.Real-world data, for example, often include outliers that may not be present in silicon samples where researchers define variable ranges in their prompts (Taloni et al., 2023).Commenting on the misplaced recommendations provided by their fake data set (Taloni et al., 2023), one of the authors noted in an interview that "we will witness an ongoing tug-of-war between fraudulent attempts to use AI and AI detection systems" (Fiore, 2023).Publishers, journal editors, and reviewers need to be aware of this tug-of-war, which other fields are already witnessing in the context of paper mills (Candal-Pedreira et al., 2022;Day, 2022;Pérez-Neri et al., 2022).
Consumer researchers, psychologists, computer scientists, and experts from other fields will have to team up to develop LLMs that are optimized for human-like reactions and interactions with usersgoing beyond fine-tuning and prompt-tuning of current LLMs.As a notable step in this direction, Replika promotes "The AI companion who cares" (https://replika.com/),which is optimized for building an emotional relationship with the user.This purpose is closer to consumer research's strive to understand and predict human behavior than that of LLMs such as GPT, which are optimized to outperform humans in various knowledge domains.Making LLMs more human will require close collaboration between researchers from various fields of scientific inquiry, for example, by translating human behavior into computational functions-an approach that is already persistent, for example, in the choice modeling literature (Gonzalez, 2023).
Beyond these empirical questions, silicon samples, and LLMs more generally, raise new philosophy-of-science questions.For example, researchers have bemoaned that GPT may become "progressively a self-licking lollipop" (Taleb, 2023) as an increasing share of future training data will be comprised of earlier GPT outputs, thereby triggering self-reinforcement.Now imagine a future research world in which the sample of choice, or the sole samples being used, are silicon samples because they have evolved to be low-cost, efficient, and able to mimic human responses to a satisfactory degree.
Will this lead to a stagnation in new insight because no "new data" about actual human behavior will be created?Will the findings from such samples converge into homogeneous solutions that lack the variability that characterizes us as humans?Also, can consumer researchers identify and study change over time in consumer behavior in such a world without studying real people?

| CONCLUSION
Silicon samples hold considerable promise for consumer research as a means to provide human-like data quickly and on a large scale, but can LLMs like GPT serve as "guinea pigbots," as Hutson (2023, p. 121) vividly surmise?By today's standards, LLMs could be useful for settings where researchers collect external feedback to inform further steps in the research process.In such situations, LLM results may give rise to concern and induce researchers to reconsider specific aspects of their project.But as with many disruptive developments, the potential for misuse is real.Leaving fraudulent behavior aside, if adopted without sufficient reflection or for tasks for which LLMs have not been designed, LLMs will likely misinform consumer and marketing researchers-with potentially fatal consequences for the field, which is already under close scrutiny in light of the growing concerns about the replicability of research findings and their lack of relevance for managerial decision-making (Adler et al., 2023;Krefeld-Schwalb & Scheibehenne, 2023).Given that silicon samples offer considerable potential that may attract their premature use in research, scientists should keep in mind, that "with great power comes great responsibility." the ultimatum game, Aher et al., 2023; and the Wason selection task, Lampinen et al., 2023).Viewed as a whole, the studies resulted in mixed findings regarding the efficacy of silicon samples in mimicking human responses. 1

F
I G U R E 3 DALL-E 3 examples for a stimulus ad targeting future time perspective.SARSTEDT ET AL. | 1259 15206793, 2024, 6, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/mar.21982,Wiley Online Library on [15/05/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License and other forms of generative AI (e.g., DALL-E) primarily for tasks whose results can be independently evaluated by researchers.Examples: • Generating items for scale development and index construction • Pretesting survey items • Crafting and pretesting visual stimuli or vignettes • Use silicon samples for research questions, which are likely to have relevant coverage in the training data.Avoid research on very specific target audiences and concepts that are not sufficiently represented in the training data.Customization, prompting, and probing • Customize the model by supplying more specific training data (fine-tuning) or customizing the prompts (prompt-tuning).

Table 2
summarizes our key recommendations in light of the state-ofresearch on silicon sampling.Researchers should view the recommendations as a checklist to guide their future research projects employing LLMs.