A Reliability Generalization Meta-Analysis of the Kaufman Domains of Creativity Scale

The Kaufman Domains of Creativity Scale (K-DOCS) is a self-reported rating scale that measures creative behaviors in ﬁve areas. Despite the vast amount of research on the scale, the internal consistency reliability of K-DOCS scores have not been examined. Speciﬁcally, there is no study on the overall reliability coefﬁcients, the variation in the reliability of scores, and reliability induction. In the present study, reliability generalization meta-analyses were conducted to address these issues. The sample consisted of 56 studies that produced 60 Cronbach’s alpha coefﬁcients in total. The pooled alpha values were estimated to be .904 (total scale), .825 (Self/Everyday subscale), .858 (Scholarly subscale), .887 (Performance subscale), .867 (Scientiﬁc subscale), and .861 (Artistic subscale). The moderator analyses showed that the reliability estimates of K-DOCS total scores and Self/Everyday subscale scores did not differ with respect to any of the moderator variables. On the other hand, certain variables affected the alpha coefﬁcients for Scholarly (educational level, SD of the age, and mean age), Performance (continent, educational level, ethnicity, Caucasian percentage, SD of the age, and mean age), Scientiﬁc (language, test version, continent, country, ethnicity, SD of the age, and mean age), and Artistic (continent, language, country, mean age, and total mean score) subscale scores. Finally, the reliability induction rate was 39.62%, and there were no statistically signiﬁcant differences between the inducing and reporting studies with respect to the continuous variables (mean of the total score, means and SD s of the sample age, and percentages of female and Caucasian). Our ﬁndings indicate that the K-DOCS provides highly reliable scores. However, certain variables contribute to systematic errors in Scholarly, Performance, Scientiﬁc, and Artistic subscale scores. Hence, scores on these subscales should be interpreted with caution. Due to a high amount of variation in the reliability coefﬁcients, reliability induction is not advised for the K-DOCS.

Creativity plays a pivotal role across diverse domains, spanning from the arts and sciences to business and education.For instance, in artistic domains, creativity leads to artwork that evokes emotions, challenges conventions, and captures the essence of the human experience.In business, creativity enables organizations to adapt, differentiate, and seize opportunities in markets.At its core, creativity empowers individuals to perceive the world through fresh lenses, envision new horizons, and generate original as well as useful ideas, solutions, perspectives, and products (Funke, 2009;Mumford, 2003;Runco, 2007;Runco & Jaeger, 2012;Starko, 2014;Stein, 1953).
Creativity manifests itself in two forms: creative potential and creative performance (Guilford, 1966;Hinton, 1968).Creative potential refers to an individual's capacity to generate novel and functional ideas or products (Guilford, 1966;Runco et al., 2001).On the other hand, creative performance pertains to the tangible outcomes (e.g., a song and a novel) and practical manifestation of creative potential (Guilford, 1966).Therefore, the concept of creativity is shaped by the interplay between creative potential and creative performance.

DOMAIN-GENERALITY AND DOMAIN-SPECIFICITY OF CREATIVITY
There are three different views on the nature of creativity.Some researchers argue for the domaingeneral view, which suggests that individuals must possess domain-general skills to produce creative work in any domain (e.g., Chen et al., 2006;Plucker, 1998Plucker, , 1999Plucker, , 2004;;Simonton, 2017).Research on the The Journal of Creative Behavior, Vol.0, Iss.0, pp.1-26 Ó 2023 The Authors.The Journal of Creative Behavior published by Wiley Periodicals LLC on behalf of Creative Education Foundation (CEF).DOI: 10.1002/jocb.620 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.characteristics of creative people provides supporting evidence for this standpoint (Silvia et al., 2009).Some other researchers propose that creativity is domain-specific (e.g., Baer, 1998Baer, , 2012Baer, , 2015;;Boccia et al., 2015;Dow & Mayer, 2004).In other words, each domain requires a different set of thinking skills, and creative thinking skills in one domain cannot be transferred to another.Research on creative performance offers evidence that supports this perspective (Silvia et al., 2009).Finally, there are researchers who hypothesize that creativity entails both domain-general and domain-specific requirements (e.g., Baer & Kaufman, 2005;Plucker & Beghetto, 2004;Sternberg, 2009).
An example of this third view is the Amusement Park Theory (Baer & Kaufman, 2005), which presents a four-level model.As the levels move forward from the first to the last, each subsequent level becomes more domain-specific.According to the model, a person needs to possess both domain-general skills (e.g., general intellectual ability and desire to generate work) and domain-specific skills (e.g., style of thinking) in thematic areas (e.g., the arts and science), domains (e.g., music and literature domains in the arts), and microdomains (e.g., piano or guitar in the music domain) to produce creative work.
The Amusement Park Theory highlights that researchers need to identify specific factors in a particular area, domain, or micro-domain to develop instruments that can reliably, validly, and fairly measure creative potential or creative performance.However, there are numerous areas (e.g., the arts and sports), domains (e.g., music and painting in the arts), and micro-domains (e.g., piano and guitar in music) to consider (Baer & Kaufman, 2005).Additionally, there are various ways to measure creativity, such as employing divergent thinking tasks, observing behaviors, and rating products.Hence, the development of reliable, valid, and fair instruments to measure creativity across different domains poses considerable challenges.
ASSESSMENT OF CREATIVITY Different instruments or methods are available for measuring creativity in various domains.Instruments or methods that intend to assess individuals' achievements (e.g., the Creative Achievement Questionnaire; Carson et al., 2005) and products (e.g., the Consensual Assessment Technique; Amabile, 1982) are two examples.The Creative Achievement Questionnaire (Carson et al., 2005) asks respondents to report their past achievements in 10 domains and measures creative performance.The Consensual Assessment Technique (Amabile, 1982) makes it possible to evaluate creative products based on expert judgment.
Another approach involves the use of instruments that focus on everyday creative behaviors in certain domains.The Creative Behavior Inventory (Hocevar, 1979) and the Kaufman Domains of Creativity Scale (K-DOCS; Kaufman, 2012) are examples of this final approach.The K-DOCS differs from the Creative Behavior Inventory in that the former requires respondents to rate their creativity level on certain behaviors (e.g., "Making up rhymes;" Kaufman, 2012, p. 307) compared to other people, while the latter asks respondents to report whether they displayed certain behaviors (e.g., "Designed a game;" Hocevar, 1979, p. 15).
Although both are respected instruments, considerably more researchers prefer to use the K-DOCS to measure creative behavior.As of August 21 2023, the number of studies that cited the K-DOCS (n = 467) is much higher than the number of studies that cited the Creative Behavior Inventory (n = 248) based on the results on Google Scholar.This is possibly due to the psychometric qualities (see Brauer et al., 2022;Kapoor et al., 2021;Kaufman, 2012;Miroshnik et al., 2022), easy-to-use structure, and recentness of the K-DOCS.Additionally, the K-DOCS stands out among other instruments, as it was developed to measure everyday creative behaviors from a layperson's perspective in a short amount of time (Kaufman, 2012).Furthermore, even if respondents have not exhibited a certain behavior covered in a particular item, the K-DOCS allows respondents to rate themselves based on behaviors similar to those targeted by the item (Kaufman, 2012).
Hence, the present study examines the K-DOCS.Specifically, this study investigates the internal consistency reliability of K-DOCS scores.We followed the guidelines explained in the REGEMA (reliability generalization meta-analysis) checklist for conducting the study (see S anchez-Meca et al., 2021).
When developing the scale, Kaufman (2012) created an item pool composed of 94 items based on the Creativity Domain Questionnaire (Kaufman et al., 2009).Those 94 items were analyzed through exploratory factor analysis (EFA) to determine the best items and identify the factors.The sample was composed of 2,318 undergraduate students in the United States.Fifty items with factor loadings over .45were retained after the analyses.Those 50 items were loaded on five factors, each of which had an eigenvalue greater than 2.00.Finally, considering the target behaviors of the items, those five factors were labeled as the "Self/Everyday Creativity," "Scholarly Creativity," "Performance Creativity," "Mechanical/Scientific Creativity," and "Artistic Creativity" (Kaufman, 2012, pp. 300-302).
The retained 50 items constitute the original version of the K-DOCS.Each item is on a 5-point scale, ranging from "much less creative" to "much more creative" (Kaufman, 2012, p. 307).The lowest response option is given 1 point, while the highest option is given 5 points.There is no hierarchy among the items within a subscale.
The Self/Everyday subscale focuses on intrapersonal (e.g., "Understanding how to make myself happy;" Kaufman, 2012, p. 307) and interpersonal (e.g., "Helping other people cope with a difficult situation;" Kaufman, 2012, p. 307) creative behaviors.The Scholarly subscale is related to the use of knowledge in certain domains that involve verbal or written communication (e.g., "Debating a controversial topic from my own perspective;" Kaufman, 2012, p. 307).The Performance subscale focuses on behaviors that require some level of psychomotor activity in certain domains, such as music and writing (e.g., "Learning how to play a musical instrument;" Kaufman, 2012, p. 307).The Scientific subscale encompasses certain behaviors in science, engineering, and math (e.g., "Writing a computer program;" Kaufman, 2012, p. 307).Finally, the Artistic subscale covers art-related behaviors in certain domains such as painting, photography, and sculpture (e.g., "Making a sculpture or piece of pottery;" Kaufman, 2012, p. 307).
The items require respondents to compare themselves to other people regarding certain behaviors and to rate themselves in those behaviors (see Kaufman, 2012 for details).The following instructions are given to the respondents: Compared to people of approximately your age and life experience, how creative would you rate yourself for each of the following acts?For acts that you have not specifically done, estimate your creative potential based on your performance on similar tasks.(Kaufman, 2012, p. 300) Kaufman (2012) did not explain the target populations of the K-DOCS.However, he stated that the original study "allow[ed] only a certain amount of extrapolation to the general population" (Kaufman, 2012, p. 304).Thus, it can be inferred that the target populations consist of individuals from the general population, regardless of age and field of interest or expertise.
Different versions of the K-DOCS are available.In addition to the full version with 50 items, there are validated 42-item (S ßahin, 2016a) and 20-item (Tan et al., 2021) versions.Note that the former is in Turkish and the latter is in English.The short versions also measure creative behaviors in five areas.Apart from these versions, some researchers use modified versions of the scale-note that those modifications are not based on any statistical analysis (e.g., Ashraf et al., 2019;Jung et al., 2021).

Psychometric properties of the K-DOCS Validity of the original version
The evidence for the structural and generalizability aspects of the validity of the K-DOCS is satisfactory.The five-factor model identified by Kaufman (2012) and measurement invariance were supported in previous studies.Kapoor et al. (2021) analyzed data obtained from 22,013 respondents to examine the structural and generalizability aspects of validity.Note that the data were collected from university students in the United States in previous studies.A confirmatory factor analysis (CFA) and measurement invariance analyses were employed.The five-factor model received statistical support from the CFA.Measurement invariance analyses showed that strict invariance held for gender.Kapoor et al. (2023) addressed the structural and generalizability aspects of validity using data obtained from 15,868 respondents.Note that the data were collected from university students in the United States in previous studies.The CFA provided supporting evidence for the five-factor model.Supporting evidence for strict invariance was received for ethnicity (i.e., European Americans, African American, Asian American, and Hispanic American).Finally, Brauer et al. (2022) conducted a CFA using the responses of 511 German respondents and measurement invariance analyses using the responses of 502 German respondents.The findings showed that the K-DOCS was a fivefactor instrument and that scalar invariance held for gender.
The external aspect of validity was also examined for K-DOCS scores.Kaufman (2012) estimated correlations among the five subscales and five personality traits (Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experience) to examine convergent validity.Consistent with previous research (Feist, 2019a(Feist, , 2019b;;Grajzel et al., 2023;King et al., 1996), statistically significant (p < .001)positive correlation coefficients were estimated between extraversion and the Self/Everyday (r = .19),Scholarly (r = .13),and Performance (r = .24)subscales as well as between openness to experience and the Self/Everyday (r = .15),Scholarly (r = .42),Performance (r = .31),and Artistic (r = .19)subscales.McKay et al. (2017) and Miroshnik et al. (2022) also examined the relationship between the five subscales of the K-DOCS and certain personality traits.McKay et al. (2017) conducted a study with 802 American and 500 Polish respondents.The correlation coefficients among the five K-DOCS subscales and extraversion as well as openness to experience were statistically significant at the p < .05level, except for the coefficient between extraversion and the Scholarly subscale.The coefficients ranged from .12 to .51.Miroshnik et al. (2022) collected data from 1,313 Russian respondents to examine the relationship between scores on K-DOCS and scores on the short version of the Creative Behavior Inventory (Dollinger, 2003) as well as certain personality traits (e.g., curiosity).Statistically significant (p < .001)positive correlation coefficients were estimated between the following: the Artistic subscale and the Visual arts (r = .54)as well as Crafts (r = .30)subscales of the inventory; the Performance subscale and the Literature (r = .43)subscale of the inventory; and the Scholarly subscale and curiosity (r = .24).Kandemir and Kaufman (2020) worked with 1,215 university students in T€ urkiye to compare K-DOCS across different majors.The students were pursuing degrees in various fields (e.g., Primary Education, Social Science Education, and Mathematics).After conducting two-way ANOVA analyses, Kandemir and Kaufman (2020) found no statistically significant differences among majors for the Self/Everyday and Scholarly subscales.On the other hand, in certain subscales, students majoring in art (the Performance and Artistic subscales) and mathematics (the Scientific subscale) scored significantly higher, providing evidence for convergent validity.

Validity of the 42-Item version
S ßahin (2016a) conducted a study with two different samples.One sample consisted of 241 and the other sample consisted of 254 Turkish high school students.S ßahin ran an EFA and a CFA.After conducting the EFA on the responses from the first sample, Items 1,5,16,28,33,41,42,and 49 were removed from the scale due to low factor loadings or loading on more than one factor.The CFA conducted on the responses from the second sample provided supporting evidence for the five-factor and 42-item scale.

Validity of the 20-Item version
Tan et al. ( 2021) worked with three different undergraduate student samples in Malaysia.Initially, an EFA was conducted using 484 responses for the 50-item version.Based on the results, four items with the highest factor loadings were retained for each subscale, resulting in 20 items in total.Afterwards, the 20-item and the 50-item versions were compared based on model fit indices.Responses obtained from 724 students were analyzed through a CFA.The 20-item version had superior model fit indices compared to the 50-item version.Measurement invariance was also tested for ethnicity (Malay vs. non-Malay) in the second sample.Supporting evidence for strict invariance was obtained.Finally, Tan et al. collected data from 201 students using the 20-item version and examined convergent validity.Correlation coefficients between scores on the 20-item version and scores on the Creative Self-Efficacy (Beghetto, 2006), Self-Perceived Creativity (Zhou & George, 2001), story writing, and Product Improvement Task (Torrance, 1974) were estimated.Statistically significant (p < .001)positive correlation coefficients were obtained between scores on the K-DOCS and scores on the Creative Self-Efficacy (rs ranged from .19-.39) as well as Self-Perceived Creativity (rs ranged from .26-.41).

Reliability of the original version
A range of reliability coefficients were reported for K-DOCS scores.Kaufman (2012) estimated Cronbach's alpha (a; Cronbach, 1951) coefficients for internal consistency and test-retest reliability coefficients for temporal stability.The a coefficients were as follows: .83(Artistic), .86(Self/Everyday, Scholarly, and Scientific), and .87(Performance).The test-retest reliability coefficients were .80(Self/Everyday), .76(Scholarly), .86(Performance), .78(Scientific), and .81(Artistic).Kaufman (2012) did not report the reliability coefficients for the total score due to the multidimensional structure of the scale (J.C. Kaufman, personal communication, July 27, 2022).The reliability coefficients reported in other studies for the full version were between .69 and .98 for the total score and between .63 and .94for the subscale scores.

Reliability of the short versions
The reliability of scores on the short versions of the K-DOCS was also acceptable.The a coefficients for K-DOCS total and subscale scores were between .77and .90 in S ßahin (2016a).The a coefficients reported in other studies ranged from .76-.92 for the 42-item version.The a coefficients for scores on the subscales were between .67 and .86 in the study by Tan et al. (2021).The a coefficients reported in other studies were between .64 and .86. for the 20-item version.

Variation among the reliability coefficients
As was just presented, there is a considerable variation in the reliability of K-DOCS scores.This variation can be attributed to two distinct sources of measurement error: random error and systematic error (American Educational Research Association et al., 2014).Random error, stemming from unpredictable factors (e.g., health issues of respondents), contributes to the scattered distribution of scores around the true score (American Educational Research Association et al., 2014).This type of error could lead to inconsistent results upon repeated administration of an instrument.On the other hand, systematic error, which arises due to consistent biases in measurement, can be influenced by various factors such as culture, ethnicity, or gender (American Educational Research Association et al., 2014).
A quick investigation of studies on the K-DOCS reveals that the studies differed from each other with respect to certain variables, including Caucasian percentage, continent, country, culture, educational level, ethnicity, female percentage, language, mean of the age of respondents, mean of the total score, population, publication type (i.e., published vs. unpublished), publication year, scale version (i.e., full vs. short), SD of the age of respondents, and study focus (i.e., applied vs. psychometric).There is a possibility that some of these variables significantly impact the reliability of K-DOCS scores and explain a certain amount of systematic error because the reliability of scores on an instrument may vary across subgroups (e.g., different countries and ethnicities) and may be affected by certain variables (e.g., Deng et al., 2019;Graham et al., 2006Graham et al., , 2011;;Vicent et al., 2019).
It should be noted that several studies did not estimate reliability coefficients for the K-DOCS and committed reliability induction (i.e., the practice of not reporting reliability coefficients).There are two types of induction: reliability induction by report (i.e., reporting the reliability estimates of prior research) and reliability induction by omission (i.e., reporting no reliability estimate).Note that both types of induction are discouraged, and they warrant caution among researchers.That is because reliability induction has the potential to lead to erroneous interpretations of score reliability for a given instrument and a large number of errors in reliability generalization meta-analysis studies (S anchez-Meca et al., 2021).

PURPOSE OF THE STUDY
The K-DOCS has been examined in different aspects, such as its factorial structure (e.g., S ßahin, 2016a), validity (e.g., Awofala & Fatade, 2015), and measurement invariance (e.g., Kapoor et al., 2023).However, no study has focused on the overall reliability coefficients and addressed the variability of the reliability of K-DOCS scores.This situation represents an important research gap because previous studies showed that the reliability of scores on an instrument varies across subgroups and is impacted by certain variables (Deng et al., 2019;Graham et al., 2006Graham et al., , 2011;;Vicent et al., 2019).Moreover, there has been a lack of research exploring reliability induction for the K-DOCS and comparing the characteristics of studies that omitted reliability with those that reported reliability.Due to the absence of evidence, it is not certain whether the inducing and reporting studies have similar characteristics or whether there is a reporting bias problem with regard to the reliability of K-DOCS scores (S anchez-Meca et al., 2021).
In the present study, we aim to address these issues using the reliability generalization approach (Henson & Thompson, 2002;Vacha-Haase, 1998).Reliability generalization (RG) is a method that allows researchers to estimate an overall reliability coefficient, examine reliability for different subgroups, identify the sources of the variability in the reliability coefficients, and investigate reliability induction for an instrument (S anchez-Meca et al., 2021;Vacha-Haase, 1998).
The following research questions are answered in the present study: 1.What are the pooled a coefficients for the K-DOCS total score and subscale scores? 2. How do the characteristics of the samples (Caucasian percentage, continent, country, culture, educational level, ethnicity, female percentage, language, mean of the age, SD of the age, mean of the total score, and population), research (publication type, publication year, and study focus), and instrument (scale version) impact the overall reliability coefficients?3. Are the characteristics of the inducing studies different from the characteristics of the reporting studies with respect to the mean of the total score, means and SDs of the sample age, and percentages of female and Caucasian?

METHOD
SEARCH STRATEGIES In order to find the empirical studies on the K-DOCS, we conducted an internet search in several databases including Academic Search Complete, ERIC, ProQuest, PsycARTICLES, PsycINFO, and Web of Science.The following keywords were searched in the entire paper: "Kaufman Domains of Creativity Scale", "K-DOCS and Kaufman," and "K-DOCS."Additionally, we conducted a search on Google Scholar using these keywords as well as the cited by option.Overall, 1001 studies were found.
In order to find additional studies, we scanned four prestigious journals on creativity.Those journals were Creativity Research Journal, The Journal of Creative Behavior, Psychology of Aesthetics, Creativity, and the Arts, and Thinking Skills and Creativity.We obtained an additional 26 studies through this process.
Finally, we conducted a backward search using the references cited in the reporting and inducing studies.We found four studies: Two of them were not written in English, while the other two were work-in-progress papers.As of September 1, 2022, the internet search produced 1031 studies that mentioned the K-DOCS.See Figure 1 for the Internet search procedure.
SELECTION CRITERIA Studies were expected to meet certain criteria regarding research format, test version, language, publication year, and reliability coefficient to be included in the analyses.We included the empirical studies in which the 50-item version, the 42-item version, or the 20-item version were administered in the analyses.The reason is that these three versions were validated in previous studies.In some studies, researchers used the 50-item version but focused on certain subscales; we included those studies in the analyses as well because we conducted separate analyses for each subscale (see Table S1).
We selected the studies reported in English.The reasons are that most of the studies were published in English and that the authorship team does not speak any other languages besides their mother tongue.We kept the studies that have been conducted since 2012 because Kaufman (2012) published an article on the development of the K-DOCS in that year.See Table S1 for the publication years of the studies.
Finally, we kept the studies that reported a coefficients.Initially, we also intended to conduct RG analyses for other types of reliability coefficients.However, the overwhelming majority of the studies (87.5%) reported a coefficients.Only seven studies reported other types of reliability coefficients, including test-retest reliability coefficients (four studies) and omega coefficients (three studies).Separate RG analyses could not be conducted for test-retest and omega coefficients, and thus, those seven studies were excluded.

SELECTION OF STUDIES
Duplicates, nonempirical studies, and non-English studies were excluded.We kept 471 studies for consideration.Three hundred and sixty-five studies only mentioned the K-DOCS, and thus, they were excluded.The scale was administered in 106 studies.However, the reliability coefficients were not estimated in all 106 studies.Upon an initial examination of the studies, we identified 62 studies in which a reliability coefficient was reported.
The full version (original or translated) was used in 41 studies, the 42-item version was used in six studies, and the 20-item version was used in two studies.In nine studies, either one subscale or a few subscales were administered.In one study, respondents were given 54 items.Finally, a researcher-modified version of the scale was used in five studies.
We sent emails to the corresponding authors of the papers that did not include reliability coefficients for the entire scale and/or for each subscale.Fourteen corresponding authors emailed back and shared their estimations for either the K-DOCS total score or subscale scores-note that five of those authors committed reliability induction.We obtained an additional 39 reliability coefficients through email.Overall, there were 67 studies whose a coefficients were available.
The a coefficients of four studies (Awofala & Fatade, 2015; Lee & Portillo, 2022;Tan et al., 2016Tan et al., , 2021) whose samples were the same as another study were excluded-note that more than one set of a coefficients was reported in some studies.We did this to avoid the dependency problem because when multiple effect sizes are reported in the same study based on the same sample, a dependency problem may arise (van den Noortgate et al., 2015).The a values were calculated from different samples in the rest of the studies that reported multiple a coefficients.Thus, we treated those studies as independent studies.
Five studies (Ashraf et al., 2019;Dong et al., 2022;Dousay & Weible, 2019;Jung et al., 2021;Magnusson, 2018) were removed from the analyses due to item number-note that the researchers used neither the full version nor the short versions and did not focus on any subscales in these studies.Susanto et al. (2018) administered 54 items, and this study was excluded.Another study (Aznar et al., 2021) was excluded because the parents filled out the K-DOCS for their children.
The final sample consisted of 56 studies.It should be noted that because some studies reported multiple a coefficients from independent samples, a total of 60 a values were obtained in this study.In the labeling of studies that reported more than one a coefficient, the source name was kept the same, with numerical identifiers added to the end (e.g., McKay et al., 2017_1 andMcKay et al., 2017_2).It should be noted that the inducing studies were also saved in the coding sheet because we compared the inducing studies with the

ANALYSES Publication bias
Publication bias was examined before the main analyses.We made use of Egger's regression test (Egger et al., 1997) as well as Begg and Mazumdar's (1994) rank correlation test-p-values greater than .05indicate a lack of publication bias.We also utilized the trim-and-fill method (Duval & Tweedie, 2000) and the funnel plot approach.
Reliability generalization meta-analyses Before conducting the RG analyses, the a coefficients were converted to Bonett's T-values (Bonett, 2002b).This procedure was employed to normalize the distributions of the a coefficients.We used the following formula for the transformation: We conducted random-effects meta-analyses with inverse variance weights (Vacha-Haase, 1998) to estimate the overall a values for K-DOCS total scores and subscale scores.Residual heterogeneity among the a coefficients was examined with Q-statistics (Cochran, 1954) and I 2 values (Higgins et al., 2003).High I 2 values and a statistically significant Q-test indicate heterogeneity.As suggested by Faggion Jr. et al. (2021), prediction intervals (PI) were also estimated for the pooled reliability estimates under the random-effects model.PIs show the range of the predicted reliability coefficients in a future study and are estimated based on the population of studies that are already included in the meta-analysis (Nagashima et al., 2019).
We conducted moderator analyses to examine the effect of several variables on the a coefficients.The studies included in the moderator analyses were coded based on the coding scheme seen in Table S2.There were 10 categorical moderator variables: continent, country, culture, educational level, ethnicity, language, publication type, population, scale version, and study focus.The subcategories of the categorical variables can be seen in Table S2.There were six continuous moderator variables: publication year, female percentage, Caucasian percentage, mean of the age, SD of the age, and mean of the total score (see Tables S1 and S3 for details).We performed moderator analyses for both the categorical and continuous variables using univariate metaregression models as implemented in the metafor package (Viechtbauer, 2010).The categorical variables were recoded into dichotomous dummy variables for conducting categorical moderator analyses.
Some studies failed to report the values of certain moderator variables.No action is taken for those missing values, and the missing values were coded as missing.All the analyses were conducted in R with the metafor package.

Reliability induction
In order to evaluate reliability induction by report and reliability induction by omission, we divided the studies into categories based on estimating, reporting, and omitting reliability coefficients.Then, we calculated the reliability induction rates for the K-DOCS.Afterward, the inducing studies and reporting studies were compared with an independent-samples t-test with respect to the total score, means and SDs of the sample age, and percentage of females and Caucasians.

CHARACTERISTICS OF THE STUDIES
Tables S1 through S4 provide information about the characteristics of the studies (k = 56).Table S1 shows the publication year (between 2012 and 2022), number of items (between 9 and 50), number of respondents (between 34 and 2,318), and a coefficients for the total score (between .69 and .99)as well as the subscale scores (between .64 and .95) for each study.Table S2 presents the following categorical study characteristics: test version (i.e., original, validated translation, validated short, and mere translation) used in each study, as well as the publication type (i.e., article, thesis, or conference paper), location (i.e., country and continent), population, language, educational level of the sample, ethnicity percentages of the sample, culture (i.e., East or West), and focus (i.e., applied or psychometric) for each study.

PUBLICATION BIAS
The trim-and-fill procedure implied that the Self/Everyday subscale and the Scientific subscale should have nine and four a coefficients imputed to them, respectively.No imputation was needed for the total scale and the other three subscales.On the other hand, the funnel plots indicated no publication bias.The funnel plots are presented in the supplementary material (see Figures S1-S6).
Tests based on rank correlation and regression analyses also indicated a lack of publication bias.The rank correlation analyses provided nonsignificant Kendall's s values between the a coefficients and the corresponding standard errors of the total scale as well as the subscales.The Kendal's s values were not statistically significant for the total scale (s = À0.220,p = .101)as well as for the Self/Everyday (s = À0.195,p = .059),Scholarly (s = À0.139,p = .182),Performance (s = À0.011,p = .919),Scientific (s = À0.139,p = .193),and Artistic (s = À0.048,p = .657)subscales.The Egger's tests of symmetry for the four funnel plots were not statistically significant for the total scale (z = À1.208,p = .238)as well as for the Scholarly (z = À0.509,p = .613),Performance (z = 0.059, p = .953),and Artistic (z = 0.723, p = .474)subscales.However, statistically significant results were obtained for the Self/Everyday (z = À2.202,p = .033)and Scientific (z = À2.163,p = .031)subscales based on Egger's tests of symmetry.In sum, we concluded that publication bias did not pose a substantial risk to the reliability of K-DOCS scores.

MEAN AND POOLED RELIABILITY COEFFICIENTS
As shown in Table 1, 28 a coefficients were available for K-DOCS total scores.It should be noted that not all studies reported reliability coefficients for the total scale, as some studies reported a coefficients for only the subscales.The numbers of a coefficients obtained for the subscales were 45 (Self/Everyday), 44 (Scholarly), 41 (Performance), 42 (Scientific), and 42 (Artistic).First, we calculated the mean a value based on the a values reported in the studies (i.e., raw a coefficients) without applying any transformation.The mean a value of the raw coefficients was .887(SD = 0.066) for the K-DOCS total score.The mean a values Note.k is the number of a values; a + is the mean coefficient a; CI is confidence interval; PI is prediction interval; LL is lower limit of 95% confidence interval for a+; UL is upper limit of 95% confidence interval for a+; Q is heterogeneity statistic, df = (n À 1); s 2 is estimated total heterogeneity; I 2 is heterogeneity index.***p < .001.calculated for the Self/Everyday (.812, SD = 0.066), Scholarly (.846, SD = 0.067), Performance (.882, SD = 0.029), Scientific (.860, SD = 0.044), and Artistic (.854, SD = 0.049) subscales were also over .800.Second, transformed a values based on Bonett (2002a) were used with random-effects meta-analysis to estimate pooled a coefficients (from this point on, the term pooled a is used).Table 1 presents the pooled a coefficients estimated for the K-DOCS total score as well as the subscale scores.Under the random-effects model, the pooled a value for the K-DOCS total score was estimated to be .904(95% CI: [.880, .923]and PI: [.686, .971]).The pooled a value was statistically significant for the total score at the p < .001level.The pooled a values for Self/Everyday (.825, 95% CI: [.805, .844].921])subscale scores were also statistically significant at the p < .001level.

MODERATOR VARIABLES
The Q-test statistics (see Table 1) and I 2 values (> 90%) denoted considerable heterogeneity among the a coefficients for the K-DOCS total score as well as the subscale scores.The variability detected among the a coefficients can also be seen in the forest plots (see Figures S7-S12).To examine potential sources of this heterogeneity, we carried out moderator analyses.The results of the moderator analyses are displayed in Tables 2-8.Note.k is the number of a values; LL is the lower limit of 95% confidence interval; UL is the upper limit of 95% confidence interval; Q B is the heterogeneity statistic.

AN RG META-ANALYSIS OF THE K-DOCS
Scholarly subscale For the Scholarly subscale, only educational level was a statistically significant moderator variable (Q B = 9.898, df = 3, p < .05).The pooled a value estimated for the samples with graduate students (.893) was higher than the pooled a value calculated for the mixed samples (.890).The pooled a values estimated for these two subgroups were higher than the a values estimated for the other subgroups (undergraduate [.852] and high school [.812]).Table 4 shows the full results for the Scholarly subscale.

Categorical variables
Regarding the categorical variables, none of the moderator variables significantly impacted the a coefficients for K-DOCS total scores and the Self/Everyday subscale scores.In other words, the a coefficients estimated for each subgroup did not differ from one to another for K-DOCS total scores and Self/Everyday subscale scores.The findings can be seen in Tables 2 and 3.

Performance subscale
Educational level was also a statistically significant moderator variable for the Performance subscale (Q B = 9.924, df = 2, p < .01).The pooled a value for the samples composed of mixed educational levels (.903) was higher than the pooled a values for the samples composed of undergraduate (.883) or high school levels (.857).Continent (Q B = 12.802, df = 2, p < .01)and ethnicity (Q B = 12.811, df = 2, p < .01)were the other statistically significant moderators.The pooled a value for the studies conducted in North America (.896) was higher than the a values for the studies conducted in Asia (.895) or Europe (.858).The pooled a value for the Asian samples (.908) was higher than the a values for the ethnically mixed (.894) and Caucasian (.861) samples.The full results for the Performance subscale are presented in Table 5.

Scientific subscale
Ethnicity was also a statistically significant moderator variable for the Scientific subscale (Q B = 8.830, df = 2, p < .01).The pooled a value for the ethnically mixed samples (.880) was higher than the a values for the Asian (.859) and Caucasian (.840) samples.The other statistically significant moderators were continent (Q B = 13.409,df = 2, p < .001),country (Q B = 8.663, df = 1, p < .01),language (Q B = 6.356, df = 1, p < .05),and test version (Q B = 9.661, df = 1, p < .05).The pooled a value for the North American samples (.885) was higher than the a values for the Asian (.854) and European (.838) samples.The pooled a value for the U.S. samples (.885) was higher than the a values for the non-U.S.samples (.851).Similarly, the pooled a value for the studies conducted with the English version (.879) was higher than the a value for the studies conducted with the non-English versions (.849).Finally, the pooled a value for the studies conducted with the original version (.881) was higher than the a values for the studies conducted with the other versions (valid translation, .859;mere translation, .855;valid short_1, .850;and valid short_2, .813).The full results for the Scientific subscale are presented in Table 6.

Artistic subscale
For the Artistic subscale, statistical significance was detected for continent (Q B = 14.235, df = 2, p < .001),country (Q B = 5.826, df = 1, p < .05),and language (Q B = 6.579, df = 1, p < .05).The pooled a value for the North American samples (.877) was higher than the a values for the Asian (.867) and European (.822) samples.The pooled a value for the U.S. samples (.877) was higher than the a value for the non-U.S.samples (.847).The pooled a value for the English version (.874) was higher than the a value for the non-English versions (.842).The full results for the Artistic subscale are in Table 7. Note.k is the number of a values; LL is the lower limit of 95% confidence interval; UL is the upper limit of 95% confidence interval; Q B is the heterogeneity statistic.*Shows the significant p Values.
Continuous variables With respect to the continuous variables, statistical significance was not detected for any of the variables for K-DOCS total scores, Self/Everyday subscale scores, and Scientific subscale scores (see Table 8).On the other hand, the mean age of the study sample was found to be a statistically significant predictor of the a values for the Scholarly (p = .019),Performance (p = .001),Scientific (p = .019),and Artistic (p = .046)subscales.SD of the sample age was a statistically significant moderator for the Scholarly (p = .038),Scientific (p = .010),and Performance (p = .001)subscales.Finally, the percentage of Caucasians and total mean were statistically significant moderators for the Performance (p = .024)and Artistic (p = .017)subscales, respectively.

RELIABILITY INDUCTION
Reliability induction was committed in several studies in which the K-DOCS was administered.Twentytwo studies reported the reliability coefficients estimated in studies and committed reliability induction by report.Most of these studies reported the a coefficients estimated by Kaufman (2012).On the other hand, 20 studies did not mention the reliability of K-DOCS scores and committed reliability induction by omission.The induction rates were 39.62% (total induction rate), 20.75% (induction by report), and 18.87% (induction by omission).
We examined the characteristics of the inducing studies, as suggested by S anchez-Meca et al. ( 2021).The following variables were investigated for the inducing studies: the publication type, country, continent, culture, year of publication, focus of the study, language, scale version, population type of the sample, The samples were composed of Asian (12.70%),Caucasian (20.60%), or ethnically mixed (28.20%)respondents.Approximately 39% of the studies did not specify the ethnicities of the respondents.Students (56.40%), individuals from the general population (34.90%), or a combination of these two groups (7.70%) comprised the respondents.With respect to education, undergraduate (43.60%), high school (2.60%), middle school (2.60%), or mixed levels of (25.60%; e.g., undergraduate and graduate students) students were the respondents.Both the original version (46.20%) and translated versions were used in the inducing studies (33.30%).Approximately 21% of the studies did not mention the language in which the scale was administered.The K-DOCS was administered in psychometric studies (7.7%) as well as applied studies (92.3%).
We used an independent-samples t-test to compare the inducing and reporting studies for certain continuous variables.Those variables included the mean of the total score, the means and SDs of the sample Note.k is the number of a values; LL is the lower limit of 95% confidence interval; UL is the upper limit of 95% confidence interval; Q B is the heterogeneity statistic.*Shows the significant p Values.
14 AN RG META-ANALYSIS OF THE K-DOCS age, and the percentages of females and Caucasians.As seen in Table S4, no statistically significant differences were found between the inducing and reporting studies for any of the variables.

DISCUSSION
Reliability signifies the consistency and dependability of the measurements an instrument produces.The impact of sample characteristics (e.g., culture, ethnicity, and gender) and instrument characteristics (e.g., full vs. short version) on reliability is inevitable (American Educational Research Association et al., 2014).Nevertheless, it is imperative to emphasize that reliability must remain a non-negotiable attribute of scores on any instrument, regardless of the systematic impact of certain variables in favor of certain subgroups.In the case of significant impacts of certain variables on the reliability of scores on an instrument, RG analyses should be conducted to identify potential sources of systematic error.In the present study, we conducted RG analyses and focused on the reliability of K-DOCS scores, investigated variables that impact the reliability coefficients in a significant way across samples, and examined reliability induction.

OVERALL RELIABILITY COEFFICIENTS FOR THE K-DOCS
The RG meta-analyses were employed to answer the first question, which addresses the overall a values.The mean a values estimated in the present study are lower than the raw a coefficients reported by Kaufman (2012).Nevertheless, the mean a values for K-DOCS total scores as well as subscale scores are still over .80 and denote good reliability (O'Rourke et al., 2005).The findings indicate that the error (i.e., random and systematic errors) in K-DOCS scores is relatively low and that the scale provides adequately reliable measures of respondents' creative behaviors.Our findings suggest that the K-DOCS can be used for research purposes, as the mean a coefficients are over .80(Nunnally & Bernstein, 1994).IMPACT OF THE MODERATOR VARIABLES Some variables may significantly affect the reliability of K-DOCS scores across samples and explain a certain amount of systematic error in scale scores.Ten categorical and six continuous variables were analyzed to investigate the sources of variance in the a coefficients and to answer the second research question, which addresses the effect of moderator variables on the mean a coefficients.
Prior to conducting moderator analyses, we focused on the variation in the reliability of K-DOCS scores.The fluctuations in the a coefficients and CIs implied that the variation in the a values might be too high.To examine this issue, we ran heterogeneity analyses.The Q-test statistics and I 2 values estimated in the present study indicate a considerable amount of variation.This finding suggests that some moderator variables significantly impact the reliability of K-DOCS scores (Higgins et al., 2003;Thompson & Vacha-Haase, 2000).Thus, it is not sufficient and not advisable to report the reliability estimations of previous studies.Researchers using the scale should estimate reliability coefficients for each application because, as indicated by the PIs, the reliability coefficients for the total scale and subscales may be between .636 and .971 in future assessments.
Variables with a nonsignificant impact None of the moderator variables affected the a coefficients for K-DOCS total scores and Self/Everyday subscale scores.Our findings imply that the reliability of total scores and Self/Everyday subscale scores does not significantly change across samples in terms of the moderator variables considered in the present study (Botella et al., 2010).This means that the moderator variables do not contribute to any systematic error in K-DOCS total scores and Self/Everyday subscale scores.
Unlike the other subscales, the Self/Everyday subscale does not require any training on a certain domain.Consider the following items: "Finding something fun to do when I have no money" and "Choosing the best solution to a problem" (Kaufman, 2012, p. 307).This situation negates the impact of formal and informal training on the Self/Everyday subscale items to a certain extent.The items on this subscale are possibly not as vulnerable as the items on the other subscales to the impact of training originated from certain variables, such as country, ethnicity, and educational level.This point is supported by a study (Kandemir & Kaufman, 2020) in which there were no differences among students from different majors regarding Self/ Everyday subscale scores.This aspect of the Self/Everyday subscale may be a reason why the reliability coefficients for this subscale are not affected by the moderator variables.Note that the impact of training may have impacted the reliability of scores on other subscales to a certain extent.
Another reason for obtaining statistically nonsignificant results for the Self/Everyday subscale may be due to publication bias.Note that a statistically significant result was obtained for the Self/Everyday subscale based on Egger's tests of symmetry.Moreover, we found that the Self/Everyday subscale should have nine a coefficients imputed to it.
However, the Q-test statistics (see Table 4) and I 2 values (> 90%) imply that some other variables may impact the reliability of total scores and Self/Everyday subscale scores.Specifically, the instructions given at the beginning draw attention.Respondents are instructed that "[f]or acts that [respondents] have not specifically done, [they need to] estimate [their] creative potential based on [their] performance on similar tasks" (Kaufman, 2012, p. 300).It is possible that respondents' estimations of their creative behaviors based on similar tasks may vary, and this situation may lower the stability of responses for the Self/Everyday subscale as well as the other subscales.That is because if an instrument's scoring is based on some type of rating and if raters (respondents in the case of the K-DOCS) are not well-informed on the use of the instrument, reliability of scores varies considerably from one respondent to another (American Educational Research Association et al., 2014).
Four categorical (publication type, culture, population, and focus of the study) and two continuous (publication year and female percentage) variables failed to reach statistical significance with respect to the reliability of K-DOCS total scores and subscale scores.These findings show the absence of systematic error and indicate that the scale does not favor any particular group in terms of item interpretation or response patterns with regard to those variables.The K-DOCS seems to produce equally reliable scores in different types of studies (e.g., published studies vs. unpublished studies), in different study settings (i.e., exploratory research vs. reliability or validity research), and in different cultures (i.e., Eastern vs.Western cultures).
The literature has evidence that shows similarities in creativity between Eastern and Western cultures.For instance, Chen et al. (2002) worked with European American (n = 50) and Chinese (n = 48) university students to compare creativity levels based on drawings of geometric shapes.The products were assessed through the Consensual Assessment Technique.The drawings were found to be similar for both samples regarding creativity level.Guo et al. (2021) examined measurement invariance of divergent thinking tasks for culture.University students from China (n = 316) and the United States (n = 302) were the participants.Configural and metric invariance were supported for fluency and originality scores.Consistent with these results, our findings suggest that similarities between Eastern and Western cultures produce equally reliable K-DOCS scores.Therefore, the K-DOCS can be used for cross-cultural research.
It has been 10 years since the K-DOCS was developed.However, the scale still provides reliable estimates of individuals' creativity levels.Based on this finding, we can argue that the K-DOCS can be used for longitudinal studies.Finally, regardless of the female percentage in the study, the K-DOCS can be administered.This point is not surprising because measurement invariance was supported for gender for the K-DOCS subscale scores (Brauer et al., 2022;Kapoor et al., 2021).The items seem to work the same for female and male respondents with respect to giving stable responses on the perceptions of one's own creative behavior.
Variables with a significant impact On the other hand, certain moderator variables significantly impacted the reliability of Scholarly, Performance, Scientific, and Artistic subscale scores.Before further discussing those variables, we want to emphasize that the statistically significant results obtained for those variables may not be due to systematic errors originating from certain variables; rather, the results may be due to unevenly distributed cells or merging certain subgroups (e.g., different countries) into a larger subgroup (e.g., Eastern countries).Thus, the differences should be interpreted with caution due to the low number of studies for certain moderator variables (e.g., test version and culture).
The North American, English-speaking, and U.S. samples always had the highest a coefficients compared to the coefficients estimated for their counterparts for the Scientific, Performance, and Artistic subscales.These results are not surprising considering that the K-DOCS was developed in the United States.However, we do not attribute the differences regarding these three variables to culture because culture did not have a statistically significant impact on the reliability of K-DOCS scores in our study.Rather, our findings indicate that the accuracy of K-DOCS scores is reduced during the adaptation process of the scale outside the United States.It is likely that some information is lost during the translation process and that this missing information lowers the reliability of K-DOCS scores outside the United States and in non-English-speaking countries.
Caucasian samples produced the lowest a coefficients for the Performance and Scientific subscales.This is an eye-catching finding because the K-DOCS was developed in the United States where the majority of the population is Caucasian.Nevertheless, the literature has evidence that is aligned with this result.For instance, in Paletz and Peng's (2009) study, the a coefficients estimated for the Caucasian participants (between .42 and .80)were lower than the a coefficients estimated for the Asian American participants (between .64 and .85).
There are some possible reasons for obtaining such a result for the Performance subscale.One reason may be related to the content of the items.Consider the following items: "Spontaneously creating lyrics to a rap song" and "Writing a poem" (Kaufman, 2012, p. 307).The literature indicates that rap music is more closely associated with African-American individuals than with Caucasian individuals (Charry, 2012;Elligan, 2000;Keyes, 1996).With respect to creative performance in literature, differences between Caucasian and other ethnicities were reported previously (Kaufman et al., 2004).Therefore, it is likely that some items in this subscale are more or less difficult for the Caucasian respondents and that the difficulty levels of the items lower the variability for samples with high Caucasian percentages (American Educational Research Association et al., 2014).It is possible that the samples with mixed ethnicities have higher variability due to varying levels of creative behavior and that the variation in mixed samples leads to more reliable scores.
With respect to the Scientific subscale, we cannot pinpoint any particular item content that leads to low reliability for Caucasian respondents.However, the literature shows some differences among ethnicities regarding engagement with as well as achievement in math and science (Lewis et al., 2009;Riegle-Crumb et al., 2011).These differences may be the reason for our findings.It is also possible that the items in these two subscales are more or less difficult for the Caucasian samples and that the difficulty levels of the items lower the variability (American Educational Research Association et al., 2014).Another reason may be that the Caucasian respondents are more familiar with the content of the items in these two subscales and that this familiarity lowers the variability (American Educational Research Association et al., 2014).
The educational levels of the samples impacted the a coefficients in a significant manner for the Scholarly and Performance subscales.There is no surprise in these results because creative performance is displayed at higher levels as years of education and age increase (Simonton, 1997).The variability among the respondents with higher levels of education seems to produce more consistent scores.Note that high school samples produced the lowest overall a values.The reason is possibly that the items in these two subscales are more difficult for high school students and that the mismatch between creative ability and item difficulty lowers the variability.
Furthermore, both the Scholarly and Performance subscales are not robust against the impact of training.Consider the following items: "Researching a topic using many different types of sources that may not be readily apparent," "Gathering the best possible assortment of articles or papers to support a specific point of view," and "Learning how to play a musical instrument" (Kaufman, 2012, p. 307).Although Kandemir and Kaufman (2020) found no statistically significant differences among majors for Scholarly subscale scores, our findings suggest that individuals exposed to education in a specific domain interpret the items more consistently and provide more stable responses for both the Scholarly and Performance subscales.
Finally, the test version has an impact on the a coefficients for the Scientific subscale.As expected, the original version provided the highest a coefficients (American Educational Research Association et al., 2014).Validated translations of the full version provided the second-highest coefficients.Interestingly, the 20-item version provided more reliable results than the 42-item version.However, the 20-item version was administered in English, while the 42-item version was administered in Turkish.It is likely that the accuracy of the items was reduced for the Scientific subscale during the Turkish adaptation, although the Turkish version included more items-note that the 20-item version has four and the 42-item version has 10 items.This finding makes more sense considering that English-speaking samples produced higher reliability coefficients compared to non-English-speaking samples.Nevertheless, it should be kept in mind that the numbers of studies that administered the 20-item and 42-item version were considerably low.If there were more studies on the 20-item and 42-item versions, the results might be different.
In terms of the continuous variables, the percentage of Caucasians had a negative impact on the a coefficients for the Performance subscale.This finding is aligned with the finding obtained for ethnicity.Note that the Caucasian samples produced the least reliable scores for the Performance subscale.
As the mean and SD of the pooled sample age increased, so did the a coefficients for the Scholarly, Performance, Scientific and Artistic subscales.These results are not surprising considering that cognitive development and creative performance is expected to increase with age (Groslambert & Mahon, 2006;Piaget & Inhelder, 1966;Simonton, 1997).Note that this finding is aligned with the findings obtained for the variable educational level.

RELIABILITY INDUCTION
In order to answer the third research question, which addresses reliability induction, we estimated the reliability induction rates and compared the inducing studies with the reporting studies for certain variables.The induction rate estimated in our study (39.62%) is lower than the induction rate reported by S anchez- Meca et al. (78.6%;2021).Nevertheless, our findings show that considerable data are still not available for estimating the overall a coefficients of the K-DOCS and examining the effects of the moderator variables.This situation poses substantial risk because when several studies commit reliability induction, the pooled reliability coefficient is estimated with a large amount of error (S anchez-Meca et al., 2021).This situation should be kept in mind when interpreting our findings.
Furthermore, reliability induction may impact the refinement of the K-DOCS.The reason is that reliability induction may have led to an overly optimistic view of the reliability of K-DOCS scores.When there is missing information on reliability and when reliability is considered a property that remains constant throughout all administrations of an instrument, researchers miss the opportunity to identify and address potential sources of error in scores on the instrument (S anchez-Meca et al., 2021).
It is noteworthy that our study found no statistical significance between some characteristics of inducing and reporting studies, indicating that reliability induction does not threaten the validity of our findings (S anchez-Meca et al., 2021).However, this should not diminish the importance of reporting reliability, as the reliability of scores on an instrument varies across samples (American Educational Research Association et al., 2014).Therefore, it is imperative that researchers using the K-DOCS should estimate and report the reliability coefficients for each application.This practice is not only vital for the K-DOCS's credibility but also for enhancing the reliability and validity of scale scores.

CONCLUSIONS, LIMITATIONS, AND FUTURE DIRECTIONS
The present study is the first one that examines the overall reliability coefficients, the variability of the reliability of K-DOCS scores, and reliability induction for the K-DOCS.Our findings show that the K-DOCS produces highly reliable total scores and subscale scores.The reliability of scores on the K-DOCS seems to be resistant to different variables, including publication type, culture, population, focus of the study, publication year, and female percentage.However, a few other variables (e.g., ethnicity, test version, and mean age) may affect the reliability of certain subscale scores.Researchers and practitioners should be cautious about the effects of those variables when using the K-DOCS and focusing on the Scholarly, Performance, Scientific, and Artistic subscales.
This study has some limitations.The first is the type of reliability coefficient considered for the analyses.We conducted the RG analyses with the a coefficients because we could not conduct RG analyses for testretest reliability and omega coefficients due to the number of coefficients.The second limitation is the number of included studies.This limitation is a result of both reliability induction and the criterion for language -note that only the studies reported in English were kept.Another limitation is the missing data for certain moderator variables-note that some studies failed to report the values of certain variables.Finally, creating subgroups for each categorical variable is also a limitation.The reason is that certain information may have been lost when the subgroups were generated.
Future RG meta-analysis studies are needed to examine the K-DOCS.Future studies should be conducted with other transformation methods and multivariate models.Additionally, after new studies are published, RG meta-analyses should be conducted to investigate the impact of other moderator variables.Moreover, when the number of studies on each short version of the scale is sufficient, the 20-item and 42item versions should be examined separately.Finally, when there are a sufficient number of studies that report a reliability coefficient other than a, that type of reliability coefficient should be examined through the RG approach.
At this point, we want to emphasize that reporting omega coefficients will be a better practice for the K-DOCS in future studies.This is because the K-DOCS items are possibly not tau-equivalent, the response options are on an ordinal scale, and the scale is multidimensional.Note that the omega coefficient provides better estimates when tau-equivalency is violated (Deng & Chan, 2017).Furthermore, there is a version of omega to estimate a reliability coefficient for multidimensional instruments (S ßims ßek & Noyan, 2013), and the omega coefficient yields more precise reliability estimates when the data are in an ordinal scale (McNeish, 2018).

Figure S7 .
Figure S7.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the K-DOCS Total score.FigureS8.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Self/Everyday subscale.FigureS9.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Scholarly subscale.FigureS10.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Performance subscale.FigureS11.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Mechanical/Scientific subscale.FigureS12.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Artistic subscale.

Figure S8 .
Figure S7.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the K-DOCS Total score.FigureS8.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Self/Everyday subscale.FigureS9.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Scholarly subscale.FigureS10.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Performance subscale.FigureS11.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Mechanical/Scientific subscale.FigureS12.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Artistic subscale.

Figure S10 .
Figure S7.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the K-DOCS Total score.FigureS8.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Self/Everyday subscale.FigureS9.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Scholarly subscale.FigureS10.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Performance subscale.FigureS11.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Mechanical/Scientific subscale.FigureS12.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Artistic subscale.

Figure S11 .
Figure S7.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the K-DOCS Total score.FigureS8.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Self/Everyday subscale.FigureS9.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Scholarly subscale.FigureS10.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Performance subscale.FigureS11.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Mechanical/Scientific subscale.FigureS12.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Artistic subscale.

Figure S12 .
Figure S7.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the K-DOCS Total score.FigureS8.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Self/Everyday subscale.FigureS9.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Scholarly subscale.FigureS10.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Performance subscale.FigureS11.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Mechanical/Scientific subscale.FigureS12.Forest plot displaying the alpha coefficients (and 95% confidence intervals) for the Artistic subscale.

TABLE 2 .
Results of the categorical moderator analyses for the total scale

TABLE 3 .
Results of the categorical moderator analyses for the Self/Everyday Subscale

TABLE 4 .
Results of the categorical moderator analyses for the Scholarly Subscale

TABLE 5 .
Results of the categorical moderator analyses for the Performance Subscale Note. k is the number of a values; LL is the lower limit of 95% confidence interval; UL is the upper limit of 95% confidence interval; Q B is the heterogeneity statistic.*Showsthe significant p Values.13 ethnicity, mean and SD of the sample age, educational level, and the percentage of females and Caucasian.

TABLE 6 .
Results of the categorical moderator analyses for the Mechanical/Scientific Subscale

TABLE 7 .
Results of the categorical moderator analyses for the Artistic SubscaleNote.k is the number of a values; LL is the lower limit of 95% confidence interval; UL is the upper limit of 95% confidence interval; Q B is the heterogeneity statistic.*Shows the significant p Values.Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/jocb.620 by Marmara University, Wiley Online Library on [27/11/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License

TABLE 8 .
Results of the continuous moderator analyses for the total scale and subscales Note.k is the number of a coefficients; b j is the unstandardized regression coefficient; SE is the standard error; R 2 is the proportion of variance explained; Q E is the statistic to test for residual heterogeneity; *Shows the significant p Values.***p < .001.