Systematic review of measurement properties of methods for objectively assessing masticatory performance

Abstract The objectives of this study is to identify methods for objectively assessing masticatory performance (MP) and to evaluate their measurement properties. A secondary objective was to identify any reported adverse events associated with the methods to assess MP. Bibliographic databases were searched, including MEDLINE, Embase, Web of Science Core Collection, Cochrane, and Cinahl databases. Eligible papers that satisfied predefined inclusion and exclusion criteria were appraised independently by two investigators. Four other investigators independently appraised any measurement properties of the assessment method according to the consensus‐based standards for the selection of health measurement instruments checklist. The qualities of the measurement properties were evaluated using predefined criteria. The level of evidence was rated by using data synthesis for each MP assessment method, where the rating was a product of methodological quality and measurement properties quality. All studies were quality assessed separately, initially, and subsequently for each method. Studies that described the use of identical assessment method received an individual score, and the pooled sum score resulted in an overall evidence synthesis. The level of evidence was synthesized across studies with an overall conclusion, that is, unknown, conflicting, limited, moderate, or strong evidence. Forty‐six out of 9,908 articles were appraised, and the assessment methods were categorized as comminution (n = 21), mixing ability (n = 23), or other methods (n = 2). Different measurement properties were identified, in decreasing order construct validity (n = 30), reliability (n = 22), measurement error (n = 9), criterion validity (n = 6), and responsiveness (n = 4). No adverse events associated with any assessment methods were reported. In a clinical setting or as a diagnostic method, there are no gold standard methods for assessing MP with a strong level of evidence for all measurement properties. All available assessment methods with variable level of evidence require lab‐intensive equipment, such as sieves or digital image software. Clinical trials with sufficient sample size, to infer trueness and precision, are needed for evaluating diagnostic values of available methods for assessing masticatory performance.


| INTRODUCTION
A primary goal of dental treatment is to restore dental and oral function, including ability to masticate food. Masticatory performance is defined as ability to comminute or mix test food (van der Bilt, 2011) The most common method for assessing masticatory performance is a comminution method using a sieve. Test food is masticated, and then, food particles are separated using sieves with varying aperture sizes; the smaller the particles size, the better the masticatory performance. Dahlberg and Manley were among the first to introduce the sieve method (Dahlberg, 1942;Manly & Braley, 1950). They used test foods, such as peanuts and carrots, and later, silicone-based materials were introduced.
Degree of mixing, measured by degree of color change, is assessed subjectively with a color scale or objectively with a colorimeter/scanner and digital software. Bolus shape is assessed with a bolus scale (Schimmel, Christou, Herrmann, & Muller, 2007;Wada et al., 2017).
To our knowledge, the measurement properties of the many different methods for assessing masticatory performance have never been critically appraised and reported. The objective of this systematic review is to identify studies that describe measurement properties of one or more methods intended to objectively assess masticatory performance and to establish their methodological quality by using a validated appraisal tool. Consequently, our systematic review intended to: • Identify methods for objectively assessing masticatory performance; • Evaluate measurement properties of the identified methods; • Compare measurement properties of the identified methods; • Identify adverse events during development or validation of methods that were studied.

| Design
This systematic review is reported as per PRISMA guidelines (Moher et al., 2015). The protocol was published and registered in the PROS-PERO database (Ref: CRD42016037700; Elgestad Stjernfeldt, Wardh, Trulsson, Faxen Irving, & Bostrom, 2017). Some modifications of the original protocol were that the original aim, that is, "To evaluate psychometric properties (such as validity and reliability) of the identified methods"(Elgestad Stjernfeldt et al., 2017), was changed to "To evaluate measurement properties of the identified methods." The rationale was to clarify that the review intents on evaluating measurement properties and not specifically psychometric methods. Moreover, the original protocol stated "… describes development of a method that objectively assesses clinical masticatory performance or evaluates measurement properties," which was changed to "… describes a method that objectively assesses clinical masticatory performance and evaluates measurement properties in adults." The changes were made because the study's aim was to evaluate measurement properties of various methods, rather than briefly describing them.

| Information sources and literature search strategy
During each review phase, regular team meetings were held to discuss criteria. Several abstracts and articles were pilot-tested to ensure agreement. Discussion and consensus resolved disagreements among reviewers.

| Methodological quality assessment
The methodological quality of included studies was evaluated using a modular checklist, that is, Consensus-based Standards for the selection of health Measurement INstruments (COSMIN; Terwee et al., 2012). COSMIN contains 12 boxes that are used to assess methodological quality of studies of measurement properties. Four domains are specified in COSMIN: validity, reliability, responsiveness, and interpretability with related measurement properties and their characteristics. For each of the measurement properties, the COSMIN consists of five to 18 items that cover methodological standards. In addition, each item is rated on a four-point scale (i.e., poor, fair, good, and excellent; Terwee et al., 2012). By applying the lowest rating for each item in one box, an overall score is separately generated for each measurement properties. A study is rated as poor, fair, good, or excellent regarding methodological quality for each of the assessed measurement properties.

| Definitions
The COSMIN panel defines validity as "the degree to which an instrument truly measures the construct(s) it purports to measure" (HCWd, Terwee, Mokkink, & Knol, 2015;Mokkink et al., 2010). Criterion validity indicates degrees to which a measurement instrument's scores adequately reflect another method or instrument that is considered a gold standard. Criterion validity can only be assessed when a gold standard is available (HCWd et al., 2015;Mokkink et al., 2010). Construct validity is defined as "the degree to which the scores of an instrument are consistent with hypotheses." Validation requires the formulation of specific hypotheses to acquire evidence that the instrument is measuring what it claims to measure (HCWd et al., 2015;Mokkink et al., 2010). Responsiveness is defined as "the ability of an instrument to detect change over time in the construct to be measured" (Mokkink et al., 2010). Reliability is defined as "the degree to which the measurement is free from measurement error" (Mokkink et al., 2010). Measurement error is defined as "the systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured" (Mokkink et al., 2010).

| Measurement properties quality
The qualities of measurement properties were established according to criteria developed by Terwee and colleagues (Terwee et al., 2007; Table 1). According to this framework, measurement properties are rated as positive, negative, or indeterminate. In the current systematic review, one reviewer rated all measurement properties, while review team confirmed the ratings.

| Evidence levels
Data synthesis for each methods for assessing masticatory performance occurred by combining methodological quality of included studies and measurement properties quality Table 1. First, all studies were quality assessed separately, and then for each method. Studies that evaluated the same method were given an individual score, and the results were then pooled in an overall evidence synthesis. The level of evidence was synthesized across the studies with an overall conclusion, namely, unknown, conflicting, limited, moderate, or strong level of evidence. Table 1 describes rating criteria.

| RESULTS
The PRISMA diagram (Figure 1) illustrates the inclusion process of articles. The present systematic review included 46 articles that represent 46 studies ( Table 2). The updated literature search in December 2017 yielded no new eligible articles. Data S1 lists all excluded full-text articles. Data S2 lists non-English studies that were identified during screening of references, but were not included.
Solely correlations determined with unrelated constructs − Correlation <0.50 with an instrument measuring the same construct, or <75% of the results in accordance with the hypotheses, or correlation with related constructs is lower than with unrelated constructs Content validity + The target population considers all items in the questionnaire to be relevant, or considers the questionnaire to be complete ?
No target population involvement − The target population considers all items in the questionnaire to be irrelevant, or considers the questionnaire to be incomplete Criterion validity + Convincing arguments that gold standard is "gold" and correlation with gold standard ≥ 0.70 ?
for each sieve/total particle wgt. collected from all sieves (%).

Felicio et al. 2008
Evaluation of encapsulated fuchsine beads as a method to assess MP. Validity (hypotheses testing), reliability. n = 19 9 M, 10 F 18-28 Method: Comminution. Test food: Capsules containing fuchsine beads. Amount of (Continues)  linear relationship between CS and particle wgt. and numbers on each sieve.
Only studies with methodological quality rated as fair, good, or excellent are reported in the results section. Studies rated as poor are described in Table S1.

| Comminution methods
Comminution methods include all methods during which test food is comminuted into smaller particles, and particle sizes/volumes are assessed. Smaller particle sizes would indicate a better masticatory performance.
Definitions: Comminution methods fall into four categories: • Sieve or optical scanning methods that assess fragmentation and particle-size distribution with either single or multiple sieves or through some type of optical scanning and digital image analysis.
• Gummy jelly (GJ) methods that involve measuring glucose extraction released from chewed GJ; amount of released glucose is associated with the degree to which test food is fragmented and hence to masticatory performance.
• Fuchsin beads methods that use encapsulated fuchsin beads as test food to assess masticatory performance; fuchsin dye is release into the capsule when the beads are chewed, and the concentration of released dye, which is proportional to masticatory performance, is quantified with a spectrophotometer.
• Colorimetric methods that assess test food fragmentation through release or binding of dye from a solution; dye concentration is assessed with a spectrophotometer, which is proportional to masticatory performance. One fair-quality study (Mahmood et al., 1992) (Ohara et al., 2003). Another fair-quality study reported reliability of a method using carrots as test food and analyzing particle size with a single sieve [positive rating] (Kapur et al., 1964).

| Sieve and optical scanning methods
One fair-quality study reported measurement error; the study used

| Gummy jelly
Two fair-and good-quality studies evaluated construct validity of a GJ as a test food. Both studies assessed masticatory performance using a glucose meter (Ikebe et al., 2005) or visual scale (Nokubi et al., 2013), respectively [both positive rating].
One good-quality study reported reliability of a visual scale that was used with a GJ as test food [positive rating] (Nokubi et al., 2013).
No studies reported on responsiveness or measurement error.

| Fuchsin beads
No studies of fair, good, or excellent quality reported on validity. One good-quality study (Sanchez-Ayala et al., 2016)

| Colorimetric methods
No studies of fair, good, or excellent quality reported on any of the measurement properties.

| Mixing ability methods
For assessing masticatory performance, mixing ability methods involve two-color gum or wax (as test food) and color-changeable gum. The included studies described assessment of various digital analysis software apps and subjective color or bolus scales (Table S1).

| Two-color gum
An excellent-quality study reported construct validity regarding twocolor gum Using MathLab 2015b, [positive rating] (Vaccaro et al., 2016). Two fair-quality studies reported construct validity regarding use of ViewGum© for assessing masticatory performance with various types of two-color gums [positive rating] (Halazonetis et al., 2013;Schimmel et al., 2015). One study of fair methodological quality using Adobe Photoshop CS2 reported conflicting research findings based on the age of the study participants, that is, negative findings were noted for young participants and positive findings for the elderly participants Several studies have attempted to establish the reliability of visual color or bolus scales that are used to assess masticatory performance with two-color gums. One good-quality study (Silva et al., 2018) reported that a two-color gum visual scale enables reliable masticatory performance assessment as per visual and electronic colorimetric analyses [positive rating]. One fair-quality study (Schimmel et al., 2015) assessed the same visual scale [positive rating]. A fair-quality study (Endo et al., 2014)  3.10 | Two-color wax One fair-quality study (Sugiura et al., 2009) and one good-quality study (Ikebe et al., 2010) reported construct validity when a two-color wax was used in combination with a mixing ability test (Sugiura et al., 2009) /index (Ikebe et al., 2010) [positive rating for both]. One goodquality study (Speksnijder et al., 2009) (Asakawa et al., 2005) reported responsiveness of a two-color wax (Asakawa et al., 2005)   Reliability was reported in two studies using the same gum. Both methods rate the color change of the gum using two different color scales [positive rating] (Hama et al., 2014b;Kamiyama et al., 2010).

| Color-changing gum
Both studies were of fair-quality studies. No studies of fair, good, or excellent quality reported on measurement error.

| Best evidence synthesis
The level of evidence is based on combining the studies' methodological quality and measurement properties rating (Table 3).

| GJ methods
Limited/unknown level of evidence was reported for criterion validity  or construct validity (Ikebe et al., 2005;Nokubi et al., 2013). Unknown level of evidence was reported regarding construct validity when using a glucose meter . Moderate level of evidence was reported regarding reliability of a 1 to 10-point visual scale that was used with GJ test food (Nokubi et al., 2013),

| Colorimetric methods
Two studies reported unknown level of evidence regarding construct validity (Gunne, 1985;Huggare, 1997). One study reported unknown level of evidence for reliability (Huggare, 1997).        Two studies reported moderate level of evidence for construct validity when using two-color wax and a mixing ability index, to assess masticatory performance in fully dentate or partially edentulous (Sato et al., 2003;Sugiura et al., 2009). Moderate level of evidence for construct validity was also reported in one study that used a two-color, blue-red wax, and digital image software to analyze the standard of intensity of distribution (Speksnijder et al., 2009) in dentate or in persons with dentures or overdentures or full dentures. Yet, another study reported moderate level of evidence for construct validity regarding two-color wax (van der Bilt et al., 2012).
Unknown level of evidence was reported for assessment of a twocolor mixture of a food bolus using videoendoscopy (Abe et al., 2011).
Only three studies reported limited/unknown level of evidence for all mixing ability methods (Asakawa et al., 2005;Ishikawa et al., 2007;Wada et al., 2017).
Moderate level of evidence was reported for reliability of a visual color scale and a bolus scale used to assess mixing ability and masticatory performance (Silva et al., 2018). Limited/unknown level of evidence was reported for all other types of mixing ability methods, regardless of whether the method involved optical scanning/ photography and digital image analysis or subjective assessment using visual scales (Endo et al., 2014;Hama et al., 2014b;Kamiyama et al., 2010;Liedberg & Owall, 1995;Sato et al., 2003;Schimmel et al., 2007;Schimmel et al., 2015;van der Bilt et al., 2012;Weijenberg et al., 2013). Seven studies reported unknown level of evidence for measurement error (Endo et al., 2014;Halazonetis et al., 2013;Matsui et al., 1996;Prinz, 1999;Schimmel et al., 2015;Speksnijder et al., 2009;Sugiura et al., 2009). 3.17.2 | Other methods

| Eichner index and odor sensor device
Limited/unknown level of evidence was reported for construct validity regarding two different methods for assessing masticatory performance: Eichner index (Ikebe et al., 2010) and an odor sensor device (Goto et al., 2016). Unknown level of evidence was also reported for measurement error for the odor sensor device (Goto et al., 2016).
To summarize, the studies reporting methods using two-color chewing gums and digital analysis revealed moderate to strong level of evidence for construct validity (Halazonetis et al., 2013;Schimmel et al., 2015;Vaccaro et al., 2016), and moderate level of evidence for reliability using a visual scale (Silva et al., 2018). Moderate level of evidence was also reported for construct validity using two-colored wax (Speksnijder et al., 2009;Sugiura et al., 2009;van der Bilt et al., 2012

| DISCUSSION
The present systematic review investigated 46 studies that reported measurement properties of methods for assessing masticatory performance. These studies accounted for persons ages ≥18, with varying dentitions and tooth replacements. No study reported findings associated with all four measurement properties. The present systematic review found that for: • Construct validity, moderate-to-strong levels of evidence were reported for two-color gum or wax via digital software analyses.
Limited level of evidence was reported regarding comminution, GJ, and fuchsine beads.
• Reliability, moderate level of evidence was reported regarding a visual scale in a clinical setting with two-color chewing gum as test food. Moderate-to-strong level of evidence was reported for (a) silicone cubes and particle analysis with sieves for the comminution method and (b) a visual scale with the GJ.
Three reviews have addressed masticatory efficiency, performance, and function (Boretti, Bickel, & Geering, 1995;Oliveira et al., 2014;Tarkowska, Katzer, & Ahlers, 2017). However, these reviews have not attempted to identify specifically studies that use methods for objectively assessing masticatory performance or evaluated the measurement properties of methods for assessing masticatory performance with a validated appraisal tool such as COSMIN.
Our findings corroborate the conclusion in one of these reviews, where that a two-color chewing gum method is valid and reliable and can be used in different populations (Tarkowska et al., 2017). However, one of the other reviews considered the comminution/sieve method to be the gold standard when assessing masticatory performance in denture wearers (Oliveira et al., 2014). Finally, one older review from 1995 emphasized a sociopsychologic approach than a biomedical. Thus, assessment of patients subjective masticatory ability is stressed in contrast to masticatory performance, especially for patients using dentures (Boretti et al., 1995 Moderate level of evidence was reported for reliability of a 1-to 10-point visual scale (used with GJ test food; Nokubi et al., 2013). This method seems to be best suited for clinical settings.
One study compared two foods and methods (a) fuchsine beads and ultraviolet-visible spectrophotometry and (b) silicone cubes and multiple sieving as the gold standard (Sanchez-Ayala et al., 2016).
The study reported moderate but negative level of evidence for criterion validity in a younger study population where the sieve method, with Optosil Comfort® as test food, was used as gold standard. Here also, the methods require lab equipment.

| Mixing ability method
Regarding construct validity, six studies reported moderate level of evidence (Halazonetis et al., 2013;Sato et al., 2003;Schimmel et al., 2015;Speksnijder et al., 2009;Sugiura et al., 2009;van der Bilt et al., 2012), and one study reported strong level of evidence (Vaccaro et al., 2016). Regarding reliability, moderate level of evidence was reported in one study that used a visual bolus/color scale (Silva et al., 2018).
There seem to be evidence for construct validity and reliability for two-color gum and wax used in populations with (a) complete or compromised dentitions and (b) complete or implant-supported dentures.
That said, the method mostly requires optical and image processing.
A visual bolus/color scale is probably useful in a clinical setting.
The next section addresses measurement properties of methods for assessing masticatory performance.

| Measurement properties
The studies reported two types of validity: construct and criterion validity.
Construct validity is often tested with predefined hypotheses, but many studies reported vague or no specific hypotheses. Hypotheses often formulate the relationship of the scores of the instrument, compared with scores of other instruments that measure similar or dissimilar constructs (convergent and discriminant validity) or to differences between subgroups of patients. Similar constructs, in this case, often included bite force, other methods for assessing masticatory performance, electromyography activity, and chewing cycles. The studies categorized participants into age groups, dentitions groups, and or prosthetic treatment groups.
Hypotheses should state magnitude and direction of measurement scores, and this is a problem, because no quantifiable criteria or defined distinction exists that would allow to discriminate between different functional levels of masticatory performance. That said, efforts to develop such a universal indicator (Woda et al., 2010) occurred. The following questions are raised: What food particle size or color mixture should a masticatory performance test be able to discriminate? What magnitude of difference would be clinically relevant (i.e., minimal important changes) for patients? What is necessary for a method to be considered better than another?
Some methods were assessed for criterion validity, namely, the degree to which the score of the tested instrument correlates with a golden standard that measures the same construct. Studies that evaluated criterion validity used the comminution and sieve method as a gold standard (Eberhard et al., 2012;Eberhard et al., 2015;Kobayashi et al., 2006;Sanchez-Ayala et al., 2016) or a colorimeter when assessing color mixture (Hama et al., 2014b;Kamiyama et al., 2010). But criterion validity could be questionable because comminution and mixing ability methods may not measure the same masticatory performance characteristics of the masticatory performance process.
Only four studies reported on responsiveness. These provided limited/unknown level of evidence because of low sample size (Fauzza & Lyons, 2008;Ishikawa et al., 2007;Mahmood et al., 1992), vaguely formulated hypotheses (Asakawa et al., 2005), and insufficient clarity regarding whether or not, a change occurred among the study participants (Asakawa et al., 2005;Fauzza & Lyons, 2008), Level of evidence for responsiveness is a problem because need for adequate methods exists for assessing effects of interventions for enhancing masticatory performance, particularly in the aging population. Studies have revealed possible association between good nutritional status and oral health regarding dental condition in the elderly (Van Lancker et al., 2012).

Reliability and measurement error
Reliability indicates the degree to which an instrument can distinguish patients from each other, while measurement error addresses magnitude of measurement error (HCWd et al., 2015). Reliability is an important factor if the instrument is to distinguish between poor, mediocre, and good masticatory performance, while quantification of measurement error is important to discern if a change in score is real or caused by measurement error (de Vet, Terwee, Knol, & Bouter, 2006;HCWd et al., 2015). Although measurement error is an important parameter for assessments, it is clear from this review that reliability is the preferred measurement property to assess. Five studies assessed measurement error but none defined minimal important changes or smallest detectable change. Measurement error can be derived from the intraclass correlation coefficient formula, but this was usually not reported.

| METHODOLOGICAL CONSIDERATIONS
The publication period of the included studies ranged from 1964 to 2018. Articles published during the latter third of this period, especially during 2010-2018, tended to report study design and methodology (e.g., choice of included statistical models) in a more explicit way and more in accordance to the COSMIN standards. Hence, these studies were generally rated with higher methodological scores. Traditional methods generally received lower ratings for methodological quality (e.g., comminution/sieve methods); because, measurement properties were assessed in studies published during the earlier part of this period. It is possible that comminution methods would be rated higher if the methodology would have been more explicitly describe, as they usually are in studies published the last 10 years.
COSMIN was originally designed to assess measurement properties of health-related and patient-reported outcomes and has been used in other systematic reviews to evaluate diagnostic tests and methods to establish performance (Dunaway Young et al., 2016; Kroman, Roos, Bennell, Hinman, & Dobson, 2014). COSMIN was therefore considered relevant for assessing the measurement properties of methods for assessing masticatory performance.
In the included studies, sample size had to be considered because power calculation or confidence interval data were lacking and could indicate statistical precision. COSMIN requires a sample size of n ≥ 30 for a fair, and n ≥ 50 (Terwee et al., 2012) for good grade of methodological quality. In addition, two-thirds of the studies had low sample sizes, and the methods varied too much in their mechanics or study populations to pool studies that assessed similar methods.
Because the COSMIN guidelines were originally created to evaluate questionnaires, the sample size requirements do not necessarily apply to studies reporting on performance-based measures. Here, smaller sample sizes may produce a large enough effect size, but this review followed COSMIN requirements.