A semiautomated risk assessment method for consumer products

In this study, we develop a model that assesses product risk using online reviews from Amazon.com. We first identify unique words and phrases capable of identifying hazards. Second, we estimate risk severity using hazard type weights and risk likelihood using total reviews as a proxy for sales volume. In addition, we obtain expert assessments of product hazard risk (risk likelihood and severity) from a sample of high‐ and low‐risk consumer products identified by a computerized risk assessment model we have developed. Third, we assess the validity of our computerized product risk assessment scoring model by utilizing the experts’ survey responses. We find that our model is especially consistent with expert judgments of hazard likelihood but not as consistent with expert judgments of hazard severity. This model helps organizations to determine the risk severity, risk likelihood, and overall risk level of a specific product. The model produced by this study is helpful for product safety practitioners in product risk identification, characterization, and mitigation.

The CPSC is unable to monitor every product on the market and usually takes actions on consumer products with safety concerns after multiple hazardous incidents and actual injury events have already occurred (Nasri et al., 2018).By utilizing real-time data, automated approaches for detecting potential hazards and assessing product safetyrelated risks would allow regulators to make faster and more targeted recommendations in consumer product safety.As online reviews are real-time and an open-source platform to the public prior to online purchasing, regulators and consumers should take advantage of this resource to rapidly assess different risks associated with specific products.
The CPSC generates annual National Electronic Injury Surveillance System (NEISS) data.Each short narrative of NEISS mentions the specific hazard type (in bold), and some basic context, such as the victim's age (58), gender (M), and brief details of the circumstance in which the injury occurred ("went to bathroom; lost his balance").In contrast, the Amazon review contains not only the hazard type but also may contain some additional, free-form consumerwritten narrative such as the hazard circumstances ("mother … little unsteady"), product defect description ("not secured to rim…tips due to person leaning … cannot just slide"), and TA B L E 1 Examples of hazard reports (bold: hazard type; italic: product safety concerns) of a similar product type from two different data sources.

Data source
Example hazard report NEISS 58 YOM went to bathroom; lost his balance, fell to floor, hit upper back on toilet Amazon.comBought the item for my mother.She is little unsteady on her feet and fell off this due to fact not secured to rim.It tips due to the person leaning forward to get off causing them to go off balance … for older people it cannot just slide unit must have secure attachment other than few sticky pads Abbreviation: NEISS, National Electronic Injury Surveillance System.remediation suggestions ("must have secure attachment other than a few sticky pads").
Examining online reviews may serve as a fruitful alternative to consulting only NEISS narratives.In December 2016, a Pew Research Center survey 1 found that 79 percent of US adults purchased a product directly through online media.Customers post product reviews on retailer websites (e.g., Amazon.com and Walmart.com) and discussion forums (e.g., Patient.info and Toyotanation.com)for sharing their complaints with other customers regarding product defects (Abrahams et al., 2012;Zaman et al., 2021;Zhang et al., 2019) and safety hazards (Nasri et al., 2018).
Table 1 shows two hazard reports of a similar product type collected from two different data sources, namely, the NEISS and Amazon datasets.The portions, in italics, illustrate the specific safety hazard, for example, a fall hazard.Hence, integrating a text analytic approach in our risk assessment model would benefit safety regulators and product designers for extracting these hazard category-specific sentences (shown in italics) and estimating the risk level of the product.
The CPSC developed a strategic plan 2 for accomplishing a set of goals and objectives in identifying hazards, including a "risk assessment methodology to detect and prioritize hazards."Consistent with this objective, this study first untangles the various types of narratives describing safety hazard types across diverse product categories and then uses the finely labeled text to estimate an overall risk level for the product.We adopt a text-classification approach for revealing hazard-specific reports from online (real-time) reviews.We then incorporate this method into a risk assessment scoring model.This model assesses product risk using online reviews from Amazon.com.The risk scores generated by our model are compared with scores assigned by the product safety experts completing an online survey.This model also helps stakeholders prioritize risk by computing risk scores and risk levels of a certain product.In addition, this study aims to conduct expert assessments of product hazard risk (hazard likelihood and severity) for a sample of high-and low-risk consumer products identified by the computerized risk assessment model we developed.The experts' responses are utilized for calibrating and validating the computerized product risk assessment scoring model.

LITERATURE REVIEW
This section explains related work on using structured data sources, online reviews, and text classification.

Structured hazard data
Many developed countries have organized risk surveillance, risk assessment, and risk alerting systems for monitoring and estimating product risk considering reported safety incidents related to consumer products and informing stakeholders of these risks.Examples include NEISS and RAM 3 by the CPSC in the United States, CAP 4 by the BSI in the United Kingdom, Risk RCM 5 by Health Canada in Canada, Australian/NZ Nomograph by the ACCC 6 in Australia, and RAPEX 7 and RAG 8 in the European Union (EU).Table 2 shows that each system falls under risk assessment, risk alerting, or risk surveillance.
3 RAM.Report on international consumer product safety risk assessment practices.
Retrieved from: https://www.oecd.org/sti/consumer/Report%20on%20International%20Consumer%20Product%20Safety%20Risk%20Assessment%20Practices.pdf 4 BSI.Corrective action: The closed-loop system.Retrieved from: https://www.bsigroup.com/LocalFiles/en-GB/entropy/BSI-Corrective-and%20Preventive-Actions-Whitepaper-EN-GB-UK.pdf 5 Health Canada.Health Canada decision-making framework for identifying, assessing, and managing health risks.Retrieved from: https://www.canada.ca/en/healthcanada/corporate/about-health-canada/reports-publications/health-products-foodbranch/health-canada-decision-making-framework-identifying-assessing-managinghealth-risks.htmlThese national product safety agencies effectively store information on safety hazard-related incidents.Furthermore, the data on these incidents are useful for analyzing hazard risk.For example, the consumer safety complaints dataset9 (Saferproducts.gov)and hospitalization dataset10 (NEISS) are some structured data sources utilized by government agencies for identifying the product safety risks and informing decisions about recalls, policymaking, or updating technical standards.The attributes that are present in consumer complaints and hospitalization data sources are updated annually or biannually.In addition, it is a timeconsuming and expensive task for government employees to gather narratives mentioning hazard-related incidents from different hospitals around the country.Furthermore, to protect manufacturer confidentiality during investigations, the public NEISS dataset does not mention the manufacturer name, the specific product name that caused the injury, and the hazard category-specific reports of each product.Indeed, some have advocated for Section 6(b) to be rescinded so that consumers may be informed more rapidly by the CPSC of safety concerns presented by specific products.The present study introduces an alternative mechanism for using online reviews to rapidly detect hazard category-specific reports for specific products and adopts text-classification methods to automate the risk assessment of consumer products.

Text-based risk assessment
Past studies have focused on disentangling online reviews to reveal semantic trends (Abrahams et al., 2012;Goldberg & Abrahams, 2018).These works showed that general defects do not usually elicit strong emotions, and therefore, they are not emphasized, even if they are implicit and potentially hazardous.Similarly, the text of the online reviews may not always be emotive when it comes to explaining safety hazards, and hence, it becomes difficult for traditional text analysis, such as sentiment analysis for detecting safety hazards (Abrahams et al., 2012(Abrahams et al., , 2013(Abrahams et al., , 2015)).For example, online consumers repeatedly post their negative comments on product usage, but most of these negative comments are complaints about the product (e.g., value and effectiveness), and few of these negative comments are safety hazards.A series of past studies concentrated on applying text-classification methods to defect discovery and safety surveillance (Abrahams et al., 2015;Adams et al., 2017;Goldberg & Abrahams, 2018;Law et al., 2017;Mummalaneni et al., 2018;Nasri et al., 2018;Winkler et al., 2016;Zaman et al., 2021).Nasri et al. (2018) investigated online videos of customers who used multiple product categories to detect the existence of a safety hazard in general.
Rather than only identifying narratives related to safety hazards across multiple products, the current study also aims to classify reports indicating various categories of safety hazards, such as chemical, choking, fall, and fire hazards.In addition, natural language processing rules are utilized in the current study to classify these hazard-specific narratives into subcategories by adapting the semiautomatic n-gram scoring technique from earlier studies (Abrahams et al., 2012(Abrahams et al., , 2013(Abrahams et al., , 2015;;Goldberg & Abrahams, 2018;Nasri et al., 2018;Zaman et al., 2021).

RESEARCH CONTRIBUTIONS
Table 3 includes past research focused on the risk assessment of consumer products.
Past research has used conventional and structured datasets that do not update frequently.The researchers also retrieved risk factors and built the risk assessment framework based on the conventional data sources only.Conversely, online (real-time) reviews grow rapidly in number every day (Zhang et al., 2019).Most of the research studies focused on only one domain to identify risk factors; the one exception is Pan et al. (2014), who investigated various consumer products using the government agency's hazard surveillance data sources, such as EU RAPEX and CPSC NEISS.These datasets also represent conventional data, which are updated with incidents related to different hazard and injury cases once or twice a year.
Table 4 lists various risk metrics that are used in different regulators' risk assessment frameworks.The current study adopts the risk metrics of the risk assessment flowchart from the EU guidelines for classifying risk level, severity level, and the likelihood of injury.As shown in Table 4, our study adopts the risk metrics of the regulators' risk frameworks to construct our risk assessment framework for safety hazard surveillance in online reviews.The risk metrics (in italics in Table 4) are new elements that are only available from the online reviews, not in conventional structured data sources.For example, our framework considers the total count of customer reviews of a specific product from Amazon.com, as a proxy for product popularity, and consequently likelihood of injury.As Liu et al. (2017) mentioned, the number of online reviews is directly proportional to the sales volume of a product.Hence, we added the review count as one of the elements in our risk assessment scoring model for determining the overall risk likelihood level of a specific product.
Our model is computer-supported (partially automated) for the rapid discovery of subcategories of safety hazards from real-time (online) reviews.Henceforth, the novel overall product risk score computed in this study is referred to as the ZAGG risk score, in recognition of the various contributors to its conceptualization.A human coding (i.e., tagging) protocol is implemented in this study for classifying safety narratives into different hazard categories.We apply hazard category-specific "smoke" terms to this study for scoring an unseen (holdout) dataset of hazard-specific reports that indicate their relative frequency (prevalence) for a specific hazard type (e.g., chemical hazard) versus nonspecific hazard type  (e.g., no chemical hazard).To generalize our risk assessment framework, we are considering the smoke terms of each hazard type generated from the hazard-specific reports of different product categories.We focus on executing the following research contributions (RCs).
o RC 1: Constructing a semiautomated risk assessment model by adopting the risk metrics from regulators' frameworks.o RC 2: Creating and applying the smoke word dictionary of each hazard code type to compute the ZAGG severity and likelihood of each hazard type for determining the risk level of a specific product.o RC 3: Showing the applicability and validity of our risk assessment model by checking the human (expert) judgment of risk level against the automated judgment of the risk assessment of each product.

METHODOLOGY
The following subsections describe the data collection, data coding and processing, sentence-level tagging procedure, generation of smoke words, and risk scoring procedure.

Data collection and data coding
The availability of representative data is a primary requirement for effectively retrieving hazard-specific narratives.As a first step, a dataset of online reviews was obtained from Amazon.com.Our dataset contains reviews of 17 product categories.We randomly collected the customer reviews of these product categories; as a result, they represent a diverse section of products that customers purchase via the internet.This diverse sample allows us to study the product safety hazard occurrences across multiple industries, spanning from baby products to countertop appliances, health/personal care, sports equipment, and others.Reviews pertaining to safety hazards were manually coded (or "tagged") for further analysis.Students at a major research university manually tagged a random set of reviews by reading and labeling each review in accordance with whether a safety hazard was mentioned.In total, we studied 124,289 unique reviews (refer to Table A1) across 17 product categories.We gave instructions to each undergraduate student volunteer to tag 200 reviews, and we randomly assigned these 200 reviews to each volunteer.The undergraduates tagged a total of 181,999 tags (see Table A1, Column 2 grand total; some reviews were tagged more than once).A total of 4938 reviews, or 3.97% of the total, were tagged as referring to a safety concern (refer to Table A1, Column 4 grand total).
To verify the tagging accuracy of the students, 4938 reviews were randomly distributed to a second group of undergraduates, who indicated whether a safety concern incident was mentioned.A second set of undergraduates labeled only 1389 reviews as safety concern reports out of the 4938 reviews.Then in a third round, for assessing the tagging accuracy of the undergraduates, graduate students (including the lead author) verified that 740 out of the 1389 reports truly represented safety concern narratives.Table A2 (Amazon reviews subtotal) shows the distribution of the 740 safety concern reports among the product categories that were utilized in the analysis of this current study.
Each tagger received their randomized tagging assignments in five-review chunks through an automated online system.Taggers submitted their tags for those five reviews before receiving their next randomized chunk.In assessing the typical time required for taggers to analyze each review, we did not consider time intervals of over 15 min, which often coincided with a tagger taking a break and resuming later.Across the remaining tags, we observed an average time of 169.2 s for each five-review chunk or 33.8 s per review.
In a prior study, Mummalaneni et al. (2018) confirmed that the proportion of online discussions mentioning safety concerns explicitly is quite small.As such, this large pool of data collection is crucial for producing a sample size suitable for studying a set of hazard-specific narratives for a variety of products.Hence, these 740 hazard-tagged reviews were finalized to analyze the risk level of each product in the remainder of this study for the specific categorization of different hazard types.According to prior studies' suggestions (Zaman et al., 2020(Zaman et al., , 2021)), we focused on sentence-level tagging instead of review-level tagging.This study disentangles the hazard-tagged reviews by manually tagging each sentence of these reviews for revealing specific discussions of hazard categories across a diverse set of 10 product categories.

Sentence-level tagging procedure
This study adapted the coding instructions from Health Canada's RCM11 and the EU RAPEX12 guidelines for manually categorizing different types of hazards.To execute this tagging assignment, a group of students conducted tagging of the focal sentences of the hazard reports.In the Amazon dataset, a total of 5499 sentences (extracted from the 740 reviews with verified safety concerns) were manually tagged for each hazard type.We observed an average time of 109.5 s for each five-review chunk or 21.9 s per review.
Both tagger A and an authoritative tagger, tagger B, individually tagged a random set of focal sentences for each code value in each safety concern category.For the Amazon dataset, authoritative tagger B tagged all 5499 sentences for classifying the code values of the hazard type.Then, tagger A tagged the random sets for classifying the code values, as necessary, for validation purposes.First, tagger A tagged a small portion to check whether the protocol was clear enough to follow the tagging instructions.To verify the interrater agreement (i.e., reliability) between taggers A and B, we computed Cohen's kappa for each code value, which indicated a substantial agreement (Landis & Koch, 1977).After resolving conflicts between the taggers, we kept the tags with the agreement between taggers A and B for the smoke term generation.

Generation of smoke terms
Using the Amazon.comdataset, we performed significant word analysis for identifying what phrases (n-grams or smoke terms) were present in the individual code types of hazard reports.We ran a machine learning algorithm, which uses the prevalence measure or correlation coefficient (CC) score, developed by Fan et al. (2005).Past studies, such as Abrahams et al. (2012,2013,2015), Adams et al. (2017), and Zaman et al. (2021), discussed the smoke term methodology.
Several prior studies have conducted a comparative analysis of the smoke term scoring methodology against commonly used machine learning approaches such as deep learning and sentiment analysis in the context of detecting sparsely populated target categories.The findings of these studies demonstrate that the smoke term scoring methodology often outperforms or is on par with these "black box" approaches, including deep learning word embedding models, in various domains (Brahma et al., 2021;Goldberg et al., 2022;Zaman et al., 2021).Moreover, Goldberg, Gruss et al. (2022) noted that the primary advantage of the smoke term scoring approach over the black box methods is its interpretability, as the latter often lack in this regard, which could limit the adoption of the black box approaches by practitioners.
The CC scoring algorithm allocates each term to a value upon its document relevance, or the frequency of the term in relevant documents compared to the frequency of the term in irrelevant documents.Higher scores show terms that are better predictors of relevant documents, as these terms occur very frequently in relevant documents and very infrequently in irrelevant documents (Goldberg & Abrahams, 2018).See Appendix B for a detailed explanation of this process.Importantly, these methods are robust to imbalances in data if there are sufficient target class observations, which we ensured was the case in our experiments.
First, the reviews tagged with each code value were divided to include 80 percent in a training set.To avoid overfitting issues, we randomly divided 20 percent of the reviews as our holdout set according to past studies (Abrahams et al., 2015;Zaman et al., 2020). 13From the training set, this machine learning algorithm (Fan et al., 2005) rapidly identifies words that are more prevalent in a specific code value (e.g., chemical hazard) than in other code values (e.g., no chemical hazard).After extracting initial candidate smoke terms, we manually curated the terms to improve generalizability.Following the same approach as prior studies (Abrahams et al., 2015;Zaman et al., 2020), we removed common English words (e.g., "and," "but," and "it") and removed words containing specific brand names.Additionally, we removed words with CC scores less than or equal to 0. At the final stage, words with high prevalence in a specific code value are stored in the "n-gram dictionary" along with the CC scores assigned to each prevalent word.
We evaluated each smoke term scoring model based on the area under the curve (AUC) value (indicated in Table 5) for detecting the code values of hazard category across different data sources such as Amazon and Fortune 50 retailer's datasets.AUC represents the machine learning model's ability to distinguish between positive and negative cases.AUC is scaled from 0 to 1, where a random chance classifier scores 0.5.
For further robustness, in addition to AUC, we also consider AUPRC, or area under the precision-recall curve.This latter metric considers the tension between retrieving true positives while avoiding false positives, which is especially appropriate for imbalanced data, as a random chance classifier could potentially be quite accurate.Although each AUPRC value is scaled from 0 to 1, a random chance clas- 13 As a further robustness check against overfitting, we also implemented a fivefold cross-validation scheme.Each iteration of the model generates similar but slightly different smoke terms; so, for brevity, we focus on one specific iteration in our discussion.Overall, our results were quite stable across the cross-validation scheme; for instance, the Amazon unigram AUC value for falls varied between 0.86 and 0.89 compared to the value of 0.88 reported in Table 5.

TA B L E 5
Area under the curve (AUC); area under the precision-recall curve (AUPRC); number of positives; rate of positives; and AUPRC-lift for the smoke-scoring techniques in discovering code values of each hazard category across different data sources.-, -, 0, -, -sifier will have an AUPRC score that matches the rate of true positives (for example, in our data, if 5 percent of the reviews were labeled fire hazards, then the random chance AUPRC is 0.05).To provide further context for these scores, Table 5 also shows the number of reports of the target class available in the dataset (i.e., the number of positives for that code value in the dataset) and the proportion of the dataset that these records comprise.Finally, Table 5 shows AUPRC-lift, or the degree of improvement offered by a model relative to a random chance classifier, defined as actual AUPRC divided by random chance AUPRC.A random chance classifier would score 1; scores below 1 are worse than random chance, and scores above 1 are superior to random chance.For example, an AUPRC-lift score of 2 implies that the model's performance is double the random chance expectation.

n-Gram
A number of positives indicate that the AUC curve is robust where the total number of reports for the target class (i.e., positives) is greater than 20.However, the AUC curve is more brittle (less reliable) because the total number of reports in the target class is less than 20.In several cases, there is no AUC available due to no reports of that target class in that dataset (i.e., no positives, for that code value, in the dataset).Moreover, when considering AUPRC in the context of AUPRC-lift scores, it is apparent that the models are substantially more predictive than random chance, even when accounting for imbalance in the dataset.
As safety concerns are rare in online reviews, the severe class imbalance is typical of safety concern discovery problems (Goldberg & Abrahams, 2018;Nasri et al., 2018).Term prevalence computation is only sensitive to class imbalance if there are too few target class items to determine which terms are statistically significantly more prevalent in the target class.For example (see Table 5), we found the total number of reports in the target class is less than 20 for the strangulation, swallowing, and suffocation hazard categories.Future studies need more data collection and data coding until a sufficient number of target class items have been identified in each of the hazard categories currently having insufficient target class observations.

Risk scoring procedure
To assess the severity of food safety hazards, Goldberg et al.
(2020) considered the smoke term scores for reviews of products containing smoke terms.In this model, the smoke term score (amount of hazard-related language) was shown to be associated with severity.Similar to Goldberg et al.'s (2020) estimating the overall risk score, we applied (see Figure 1) the smoke term (word) generation of each hazard code value to our risk assessment scoring model for computing the likelihood and severity of a hazard code value.The results obtained from computing the specific hazard likelihood and severity were incorporated into determining the overall risk level of a specific product.
As delineated in the following formula, the word list of each hazard code value was used to score each focal sentence of a product from the validation set.For computing the standardized score of each focal sentence, we divided the score of each focal sentence by the maximum smoke score sentence for that product.This sentence smoke score standardization procedure was only used if the maximum product smoke score exceeded 10,000, and if no sentence score exceeded that threshold, all standardized sentence smoke scores were set to 0. We chose 10,000 as an arbitrary high threshold indicating the substantial prevalence of smoke terms.According to Goldberg and Abrahams (2018), we conducted a sensitivity analysis at different thresholds and found 10,000 is one of many suitable thresholds.The scored sentences were manually reviewed in order to identify false positives (i.e., positive-scored sentence that did not mention true hazard incidents were discounted).We set the standardized sentence smoke scores to zero for the false positives that we found during the manual review.Though this step introduced some manual intervention into the procedure, the overall effort is modest, as comparatively few sentences from the overall dataset need to be manually reviewed.The payoff from additional manual review is substantial, as false positives that may bias the risk assessment scores can be weeded out.After identifying the true positive sentences, we then took the summation of the standardized scores of the true positives, S, of the specific hazard code value, c, of a product, p, to arrive at a standardized product smoke score S c,p : where n is the total number of scored sentences that are true positives of a specific hazard code type, c, of a product, p.
For example, if n = 2, then S c,p = s 1 + s 2 .That is, if there are two hazard-mentioning sentences (s 1 and s 2 ), each with a standardized smoke score, for a particular product the standardized smoke scores of each of the two sentences can be added with simple addition, to obtain a standardized total smoke score, for the product, where the standardized total smoke score is denoted S c,p .

Estimating hazard likelihood for a product
To estimate the hazard likelihood, we mapped the count of Amazon reviews-which is a proxy for product sales (Liu et al., 2017) and hence may approximate product penetration-to the EU risk likelihood scale.To implement this mapping, and approximate a risk likelihood, we took the standardized smoke score (S c,p defined above) of reviews mentioning a hazard of a specific type (c) for the product (p) on Amazon.com and divided it by the total count of customer 4.4.2Estimating hazard severity for a product Goldberg et al. (2020) indicated that higher smoke term scores increase the likelihood of serious safety hazards present in a product.As a result, they used the cumulative value of the smoke term scores as evidence for severity.We note that other variables could supplement these measures in other contexts; for example, Graham and Chang (2015) used "annual injury count" as one of the input variables in their "switch-point cost model" to estimate the economic costs of automatic protection systems that could be designed to reduce the risk of injury from using a specific product, a table saw.To compute the severity level of the hazard code type of a product, we first considered the count of sentences mentioning a specific hazard code type across all product reviews for that product model.As shown in Table 6, we first count the number of sentences mentioning the specific hazard code type of a product and then assign a weight, w s , which escalates as the number of sentences mentioning a hazard increases, but plateaus when this count reaches 3 (since additional mentions likely have no marginal utility).
We combined the weight, w c , of risk for each hazard code type with the assigned weight of the sentence(s) mentioning the hazard code type of a product to compute the hazard severity level of the product.We assigned weight to each hazard type before the survey and compared our pre-survey weights, w c , with the weights assigned by five experts in the survey described in the next section.Hence, we incorporated the average weight of all five experts, for that hazard type of each hazard code type into our risk assessment model to determine the severity level of the hazard of each product:

Estimating severity for a hazard type
We assumed that the risk of each hazard code type (w c ) would be weighed differently.For example, for the code value of a hazard category, a fire hazard could receive a higher weight than the chemical hazard did.In contrast, a fire hazard could have a lower weight than a strangulation hazard would.We surveyed product safety experts, who then manually decided the weight of risk for each hazard code value.We emailed 54 expert attendees at a major international event for product safety experts: the 2019 Annual Symposium of the International Consumer Product Health and Safety Organizations (ICPHSO).A small subset of attendees was solicited to participate, and snowball sampling was used (participants were asked to recommend colleagues).Five out of 54 experts completed the survey.Similarly, by using the expert-level survey responses, Goldberg et al. (2020) validated the performance of a food safety monitoring system that used review-based factors to determine the overall risk score of a product.The average number of years of experience in product safety and product risk assessment for our respondents was about 21 and 19 years, respectively.We believe the hazard weights solicited from this small group of highly experienced experts hold greater validity than collecting a large set of human assessment results from nonexperts.To determine severity weights for each hazard code value, we simply averaged the weights assigned by each of the five respondents, for that hazard code value.

Procedure for evaluating the risk assessment model
Once the online reviews were classified into hazard categories, the risk model was then defined and applied, and finally, the risk model was evaluated, as shown in Figure 2.
The risk levels and coding schemes for the hazard code types were adapted from the coding manuals and risk assessment frameworks of Health Canada and the EU guidelines.Our model first included the automated approach in generating the words that were most prevalent in each code type.
Step (1): The automated approach of scoring the hazard-specific sentences was incorporated into our risk assessment model to score the sentences for specifying the hazard code type.
Step (2): The score of each sentence was standardized to obtain a value, scaled from 0 to 1.By using Formula (1), we computed the summation of all the standardized scores with sentences that mentioned an incident specific to a hazard code type and a sentence-smoke score having a threshold value more than 10,000.Step (3): Then we included the raw count of online reviews of a particular product from Amazon.com.Step (4): By using Formula (2), we computed the likelihood of hazard code type.Step (5): Referring to the assigned weights, w s , shown in Table 5, we used the existence of hazard-specific sentences by counting the number of sentences mentioning the specific hazard code type of a product.
Step (6): By using Formula (3), we combined the weight of the risk for each hazard code type, w c , with the assigned weight of the sentence(s) mentioning the hazard code type of a product, w s , to compute the hazard severity level of the product, SL p .Step ( 7): Therefore, the model measured two different risk scores-the likelihood of the specific hazard code type and severity level of a hazard-for determining the risk level (ranging from low to serious) of a product.This study adapted the risk matrix chart from the EU guideline to locate the risk level of the product (see Table 7).
Step ( 8): To calibrate our risk scores with the risk matrix of the EU guideline, we scaled the four categorical hazard

RESULTS
The following subsections report our results in each phase of model building and assessment.

Descriptive analysis
Each product review in the Amazon.comdataset was tagged for the existence of a safety concern.The safety concern in our study is primarily focused on discovering whether a hazard incident exists or not exists across the multiple product reviews (i.e., 740 hazard-tagged reviews) that contain 5499 focal sentences.
Figure 3 shows the hazard code types of multiple product categories, regulated by the CPSC (except for the Automotive product category, regulated by NHTSA), and the product reviews were obtained from the Amazon website for our study.All the hazard code types mentioned in the chart were obtained through manual tagging (see Figure 3).Figure 3 shows various code values of the hazard category mostly present among baby products.Note that although we initially considered 17 product categories, the prevalence of safety hazard reports was far greater in some categories than in others.After performing successive rounds of tagging and identifying 740 reviews with verified hazards, only 10 of the 17 product categories were represented.The remaining categories are excluded from further analysis.Likewise, some hazard types were far more common than others.We depict all hazard types observed in Figure 3, but in our ensuing analyses, we excluded biological and electrical hazard types due to insufficient sample size.
A high count of sentences mentioned the fall hazard code type for baby and health and personal care products.The hazard reports for baby products frequently mentioned that babies and toddlers tend to fall, and this mostly occurs from toddler beds, booster seats, and car seats.Compared with other product categories, there is also a high count of reports of health and personal care products mentioning the fall code value, as the reviewers mentioned the tendency to fall when using products like wheelchairs, walking sticks, and toilet safety rails.On the other hand, the lowest count of hazard code reports is present in automotive product category: three fire hazard-related and one biological hazard-related reports.We excluded the reports of hazard code types mentioned in the automotive product category, again due to insufficient sample size.In total, we included all the hazard code types of 9 product categories from Amazon.com along with 1 seasonal product category of a Fortune 50 retailer (see Table A2).

Severity weights for hazard codes
As we assigned a weight to each hazard type before the survey, we compared our results (i.e., pre-survey weight, w c ) with the weights assigned by the five experts on the survey described previously (see Table 9).Table 9 shows the weights assigned by each expert represented as w 1 , w 2 , w 3 , w 4 , and w 5 .To detect outlier(s) of each weight, we calculated the first quartile, Q 1 , and the third quartile, Q 3 , of the experts' assigned weights.As we did not find outliers for any one of the weights assigned by the experts, we, therefore, incorporated the simple mean weight (average weight of all five experts, for that hazard type) of each hazard code type into our risk assessment model to compute the severity level of hazard of each product by using Formula (3).
We note that the mean usefully considers all expert opinions even in instances of disagreement.It is perhaps expected that experts with varying backgrounds would arrive at slightly differing conclusions on hazard severity.Rather than choosing one specific expert's judgment, the mean reconciles the entire distribution of opinions together.

Positioning products on the EU risk matrix
After performing the calculations of the risk scores for the low-risk product G and high-risk product H, we mapped the likelihood and severity level of a fall hazard for locating the overall risk level of each product.For the low-risk product G, the ZAGG likelihood of a fall hazard was 0.0008, and the ZAGG severity level of the hazard was 0.20.For the high-risk product H, the ZAGG likelihood of a fall hazard was 0.003, and the ZAGG severity level of the hazard was 0.40.As a result, the risk matrix indicated a low-risk level for product G because the likelihood of a fall hazard was greater than 0.0001, and the ZAGG severity level was in the range of 0-0.25, corresponding to the severity level 1 according to EU guideline's risk assessment matrix.In contrast, the risk matrix (see Table 10) labeled a high-risk level for product H because the likelihood of a fall hazard for this product was greater than 0.001, and the ZAGG severity level was in the range of 0.25-0.50,which corresponds to the severity level 2 according to the EU guideline's risk assessment matrix.Hence, our study adapts the risk assessment matrix from the international standards of the EU guidelines (see Table 10).
We identified the risk levels of various products across multiple hazard code types (chemical, choking, fall, fire, etc.) after computing the risk scores (i.e., likelihood and severity level) and then mapping these computed risk scores on the risk matrix chart adapted from the EU guideline.Based on locating the specific risk level of a product, a safety regulator can respond to take proper action (as listed under Table 10).
We selected 14 random low (L) overall risk products using the ZAGG model and 16 random products with mediumto-serious overall risk using the ZAGG model.These 30 products were assigned at random to the 5 survey respondents to elicit their opinions as to each product's hazard likelihood, hazard severity, and overall risk.Each expert was presented with 1 product (for training) followed by 10 random products for assessment.Random subsamples of 10 random products for each expert were chosen to reduce assessment fatigue, as complete assessments for all 30 products would have been excessively burdensome for individual experts.Each expert was first asked to assess each product's hazard likelihood, severity, and overall risk by first consulting only the raw Amazon reviews for that product, directly from Amazon.com.Then, the expert was presented with specific hazard narratives for the particular product, which were highlighted by our smoke-scoring procedure, and asked to reassess the product's hazard likelihood, severity, and overall, given the specific narratives highlighted by us.

Comparison of expert-assessed risk level with ZAGG-computed risk level
We selected 14 low overall risk products using the ZAGG model-experts agreed in 11 out of 24 products (i.e., 46% of post-judgments) that the product was low overall risk (see Table 11).(Note there are more product assessments than products, as some products were presented to multiple experts, due to the randomization procedure for selecting products for presentation to experts.)

TA B L E 1 1
Assessing the agreement between pre-calibrated ZAGG-computed risk level with default weights of hazard codes and expert-assessed risk level.agreement between the pre-calibrated ZAGG-computed risk level and the risk level assessed by experts.These precalibrated ZAGG risk levels and their pre-calibrated ZAGG risk scores were determined by using the default weights of the hazard codes, w c .For instance, when we found 7 products with serious overall risk by using the pre-calibrated ZAGG risk model, we observed the experts agreed in 8 out of 17 evaluations that the product had a serious overall risk.(Again, there were more evaluations than products, as each product may have been presented to multiple experts, due to the randomization procedure.)Therefore, the experts agreed 47.1% of the time with the pre-calibrated ZAGG-computed risk model that a product had a serious overall risk.

Percentage
Table 12 shows the revised, post-calibration ZAGG scores that were computed using the experts' average weight of each hazard code, wc , as determined after the expert survey (rather than the default, pre-calibration weights).We then compared the agreement between the post-calibrated ZAGG-computed risk level and the risk level assessed by the experts.
For example, when we discovered 9 products with serious overall risk by using the post-calibrated ZAGG risk model, we found that the experts agreed in 9 out of 12 evaluations that the product had a serious overall risk.(There were more evaluations than products, as each product may have been distributed to multiple experts, due to the randomization procedure.)Therefore, the experts agreed 75.0% of the time with the post-calibrated ZAGG-computed risk model that a product had a serious overall risk.After using the expert's average weight of each hazard code, we observed an improvement (47.1%-75%) in experts' agreement with our post-calibrated ZAGG risk level in assessing the serious overall risk.On the other hand, there was a drop in agreement (6.7%-0.0%) between the experts' medium overall risk and the post-calibrated ZAGG overall risk level.
We observed a low percentage agreement of experts because the incidents are self-volunteered consumer reviews, and hence, incomplete information is available on each incident.For example, eventual consumer health outcome is often unknown.Further information on incidents can only be obtained via in-depth interview with the participant (e.g., in person, via telephone, or via email).These in-depth investigations would typically be undertaken by the manufacturer (who are unwilling to provide us with this information due to its sensitivity) or regulator (who are unable to provide us with this confidential investigation information by law-Section 15(b) of the Consumer Product Safety Act).The identities of the consumer complainants are not available to us, and we do not have the internal resources to conduct in-depth investigations.Thus, some amount of disagreement must be tolerated, due to the inherent incompleteness of available information on the incident.
Table 13 shows the pre-calibrated ZAGG-computed risk levels of each product using the default hazard code weights, w c .Moreover, Table 13 displays the post-calibrated ZAGG-

TA B L E 1 3
Risk level of each product for its specific hazard code value § contains 28 unique products that are anonymized by randomly assigning distinct product numbers to each product, that is, each product contains at least one hazard code.Human expert before-average risk score is given by reading reviews from Amazon.

Product
c Human expert after-average risk score is given after reading hazard snippet.
d Post-calibrated ZAGG-computed risk level using experts' average weight of each hazard code.
e Indicates the experts, on average, changed their risk assessment score after being presented with hazard snippets.
f Indicates the pre-calibration ZAGG-computed risk level changed when including the experts' average weight of each hazard code in the post-calibrated ZAGG risk scoring model.
computed risk levels of each product using experts' average weight of each hazard code, wc .Additionally, Table 13 shows the 5 experts' assessments on the likelihood and severity scores of the hazards, for each of the 30 products in our survey.For comparing the ZAGG (computer-assessed) scores to the experts' judgments, Table 13 displays the average scores of each risk metric (i.e., likelihood and severity) when the experts had only read the product reviews directly from Amazon and then also shows the average scores after the experts assessed the risk by reading the specific hazard narratives for the particular product that were highlighted by our computerized smoke-scoring procedure.
We show the conflict between expert opinions in Tables 12  and 13.The substantial conflict between expert opinions of hazard severity is inevitable.Future studies may resolve the conflict by various mechanisms, such as (1) taking the most conservative opinion (e.g., if any one expert regards hazard as severe, regard hazard as severe), or (2) undertaking additional investigation, with additional information gathering or additional experts, of incidents where experts currently disagree.
Out of 54 participants, 5 experts fully responded to the online survey in assessing the hazard risk of 10 randomly assigned products.As a result, the determination of risk level was not uniform across all the products (see Table 13).For instance, by random chance, only 1 expert assessed product number 7 determining the risk level for chemical hazards.That expert recorded the same risk level after reading the specific hazard snippets of the product.On the other hand, again by random chance, all five experts received the survey question on assessing the risk level for product number 21.Hence, we received a series of risk levels: low/low/medium/low/low from the judgments of five experts for product number 21.
Table 13 shows a controversial result regarding determining the risk level by the computer versus experts.For example, one expert determined a serious hazard risk level of product number 6, whereas our risk assessment scoring model determined a low-risk level of the same product after computing a low likelihood score of fall hazard, 0.001 along with a low severity score, 0.17.Obtaining a low hazard risk level for this particular product is due to discovering only one hazard snippet by our ZAGG risk model.Contrary to the ZAGG risk assessment, the expert's rationale for the product's risk assessment is given under the following statement: There is a hazard from it being too easy for a toddler to climb over and fall; there was one report of a child landing on his head after scaling the gate.There was also at least one report of a part suddenly breaking loose and shooting into the air, striking a child.Both the computer and the expert scored the same likelihood of fall hazard, 0.001.However, the statement given by the expert confirms to us that the expert gave a heavier weight (0.75) to the hazard severity of this specific product than the computer discovered.Our ZAGG-computed risk methodology was successful in rapidly detecting these highscored sentences prevalent in hazard-specific narratives as the expert gained knowledge of the fall hazard incidents after reading the particular hazard snippets of product number 6.

Risk model evaluation results
To evaluate the validity of our risk assessment model, we examine correlations between the computer algorithmderived score and the average (human) score of the experts for each risk metric.We conducted correlation analyses to compare the likelihood of hazard between the preand post-calibration computer (ZAGG) scores versus the human judgments of risk scores when reading only raw Amazon reviews ("Human Expert before") versus after reading computer-discovered hazard snippets ("Human Expert after").For likelihood values, we observed nonlinear functions, so we used Spearman's rank CC, which is robust to nonlinear functions.For severity values, we did not observe nonlinear functions, so we used the Pearson correlations (see Table 14). 14 The Spearman correlations between pre-calibrated ZAGGcomputed likelihood and the expert-assessed likelihood scores before reading and after reading the hazard-specific narratives were 0.51 and 0.52, respectively.These values indicate relatively strong concordance between the model and expert likelihood assessments.Conversely, the Pearson correlations between post-calibrated ZAGG-computed severity and the expert-assessed severity level before reading and after reading the hazard-specific narratives were 0.06 and 0.19, respectively.After assigning the experts' average weight of the hazard code to the post-calibrated ZAGG risk model, we observed an overall improvement regarding the agreement in assessing the hazard severity of a product with the expert's judgment.For instance, an increase in the correlation (from −0.02 to 0.19) indicates a boost in the agreement of determining the hazard severity level between the post-calibrated ZAGG model and the expert's judgment after reading the hazard-specific snippets of a product.Even though there is an increase in correlation, the low correlations of the postcalibrated ZAGG scores and human scores of severity (0.06 and 0.19) indicate that the computer scoring method is still restricted in assessing a product's risk level.However, our computerized method for detecting hazard-specific sentences from a large pool of data caused experts to frequently alter their overall severity scores.As indicated by the "e" superscripts in Table 13, our results indicate that the experts occasionally, though rarely, change their decisions on hazard likelihood, hazard severity, and overall risk level after reading 14 Only risk severity is shown pre-versus post-calibration, as calibration does not alter the risk likelihood score.
the hazard-specific sentences presented by the computerized method.
The computer model is limited, especially in assessing severity; this implies human judgment should be relied on for assessing severity.However, hazard snippets that the computer model highlighted frequently caused human to change their overall severity rating up or down, so the machine approach is still powerful.Experts do not often alter their severity rating after seeing the highlighted snippets, but they do change their likelihood and overall assessment.

DISCUSSION AND CONCLUSIONS
In this study, we developed a model for determining the risk level of each product and exploring the text classification of hazard code types.We tested our risk assessment model after conducting the semiautomatic categorization of sentences using online reviews of products across multiple product categories.In addition, we included risk factors by adapting them from practitioner risk assessment frameworks and integrating online reviews with the text analytics to determine the computerized risk scores (i.e., the likelihood of a hazard code type and severity level) and overall risk level of a product.Hence, these risk scores helped us map the risk level of a specific product onto the risk assessment matrix chart of the EU guideline.The semiautomated framework not only considered the content of the online (real-time) reviews but also utilized the gathering of raw counts of customer reviews from Amazon's online data source.Ultimately, the risk-level matrix may help support decision-making for stakeholders like compliance officers, risk-management officers, and other risk-prevention experts; the framework may help them decide whether they should mitigate the risks or hazards related to specific product models.
Our study is subject to several limitations.One limitation is that our sample was limited to 30 random products.Regarding sample selection, our sample consisted of products with various levels of risk according to our computer model.However, we did not select a control group of products with zero smoke scores for all sentences in their product reviews.This omission could have hindered our ability to observe stronger correlations between the computerized and manual risk assessment scores of products.Our survey also omitted very high-risk products (e.g., products that were recalled), as Amazon.comremoves listings for those products when the recall is announced.
Our survey was hindered by low participation, possibly due to the voluntary nature of the task.Moreover, though 10 participants started the survey, only 5 completed the survey and signed the re-consent at the end of the survey, possibly due to fatigue in being administered a long (90 min) survey, with no compensation.In future, we may provide a shorter survey (e.g., 5 random products, instead of 10), and/or offer remuneration to survey participants for their effort.
A prospective study may aim to explore an alternative set of research questions by examining the product char-acteristics that influence the ZAGG-identified level of risk.Specifically, this investigation may investigate the impact of various product attributes, such as product category, manufacturer, materials, and other related factors on the perceived level of risk.Such an inquiry may provide a valuable understanding of the specific product features that affect risk and could have significant implications for product development, marketing, and consumer safety.
Future research could update the risk assessment model by including severity weights for hazard code types.For instance, a review mentioning key terms, such as "infant," "fall," "doctor," "fracture," or other groups at heightened risk of product injury could vary the severity weight.The updated model could potentially focus on classifying various severity levels, such as minor, major, and life-threatening levels.
We observed high uncertainty in the hazard type as being a limitation in our study.Future work may gather further incidents of these high-variance hazard types and submit for additional expert coding to determine if factors relating to the hazard description (e.g., omitted diagnosis, omitted treatment, or other information type omissions or inclusions) explain the variance.Additionally, further background information on each incident would be helpful in reducing variance.However, such supplementary information is typically available only to retailers, manufacturers, and federal regulators, as the identities of the consumer complainants are not available to us as researchers.
To enlarge our sample size of hazard reports, we plan to merge multiple data sources in addition to Amazon.com for our future study.For example, Walmart, Target, and eBay.com also contain a large pool of online reviews for consumer products.Executing the semiautomatic classification procedures on a sample size gathered from multiple sources will increase the chance of collecting a high number of unique hazard reports.
Our findings have implications for both research and practice.First, this study opens the opportunity for safety hazard researchers to build a knowledge hub of hazard narratives by shifting from privately used hospital datasets to using real-time (e.g., online) datasets.Second, we introduced the application of categorizing the real-time hazard narratives by assessing the hazard risk levels of various products.Third, we executed the risk assessment methodology using the risk metrics adapted from the EU guidelines.In terms of the theoretical contribution of this study, safety hazard researchers can delve into extending "uncertainty theory" (Wu et al., 2017), as we incorporated risk metrics like likelihood estimation, severity estimation, and hazard identification.These risk metrics were computed for building our risk assessment scoring model for determining the hazard risk level of consumer products.We hope that the research results from this study can provide some reference for future researchers to extend uncertainty theory from the perspective of using the specific hazard incidents mentioned in online reviews of consumer products.As a result, regulators could use our risk assessment methodology for detecting and prioritizing the hazards of consumer products.By using the risk matrix chart, additional stakeholders, such as insurance underwriters providing product liability insurance, could estimate the insurance claim likelihood and magnitude, using the likelihood of hazard and severity level for a specific product model.Hence, our risk assessment methodology provides benefits to a wide range of stakeholders for rapidly identifying and mitigating the specific hazards of consumer products.

A P P E N D I X A
The smoke term scoring methodology employs an information retrieval technique, such as the CC score, to generate initial candidate smoke terms, as proposed by Fan et al. (2005).The CC scoring algorithm assigns a value to each term in the training set, with higher values indicating terms that are more predictive of relevant documents.Specifically, these terms appear more frequently in relevant documents than in irrelevant ones.To explicate the process of scoring the smoke terms, the mathematical equation and matrix operations from prior studies (Goldberg & Abrahams, 2018;Goldberg, Gruss, et al., 2022) are adopted.In this method, the identifier of each term in the smoke term dictionary is represented by , and the identifier of each review is repre-sented by .The number of hazard category-tagged reviews in which term  occurs is denoted as A, and the number of no hazard category-tagged reviews in which term  occurs is denoted as B. C is the number of hazard category-tagged reviews in which term  does not occur, and D is the number of no hazard category-tagged reviews in which term  does not occur.The total number of reviews, or A + B + C + D, is represented as N. Based on the chi-square distribution, the CC score expresses the relevance of term  as follows: The use of the CC score may be thought of as a hyperparameter for the smoke term model.Other relevance scores may be utilized in its place, and in certain applications, these may provide superior performance.In this work, we experimented with several methods discussed in Fan et al. (2005), including Robertson's selection value or relevance correlation value.In our application, we found that the CC score offered the best performance, and thus, we utilize it in our modeling.
We designate D as a row vector that corresponds to the candidate smoke term dictionary.Each term in the dictionary is represented as d  , with the terms being indexed from 1 through , where  denotes the total number of unique terms in the dictionary.Hence, C = [d 1 d 2 ⋯ d  ].Similarly, we also identify C as a row vector, where each entry is the CC score for the th term in the smoke term dictionary, denoting the weight applied to the th term.Each CC score is represented as c  , and the vector is indexed from 1 through .Subsequently, we define  as the document-term matrix that comprises the terms in the smoke term dictionary, which are relevant to the reviews in the dataset.The document-term matrix depicts the frequency at which each term occurs in each document.In this case, it shows the frequency at which each smoke term appears in each review.The columns of  are indexed by each of the terms in the smoke term dictionary, D, or from 1 through .The rows of  are indexed by each of the reviews in the dataset, or from 1 through .Thus,  is an  ×  matrix, with  rows (representing one row for each review) and  columns (where the th column in  pertains to the th term in the smoke term dictionary, D).Due to the relatively infrequent occurrence of hazard category-specific reports, it is possible that most smoke terms may not appear in most reviews.Hence,  is a sparse matrix, with a considerable number of its values being zero.An example matrix for  is illustrated as follows: A repository of Python code used to perform these operations is provided in Goldberg, Gruss et al. (2022).

F
Risk scoring procedure.TA B L E 6 Assigned weight of sentences mentioning a hazard code type., for that product (p) from Amazon.com:Likelihood of a hazard code type = S c,p ∕R p (2)

F
Overall study methodology: defining and evaluating the risk assessment model.
A number of the hazard-specific snippet(s) of a product.a Pre-calibrated ZAGG-computed risk score/level using default hazard code weights.b this article: Zaman, N., Goldberg, D. M., Gruss, R. J., & Abrahams, A. S. (2024).A semiautomated risk assessment method for consumer products.Risk Analysis, 44, 705-723.https://doi.org/10.1111/risa.14180 Thus, C = [c 1 c 2 ⋯ c  ].An example of vectors X and Y is presented as follows: D = [ ′′ fall ′′ ′′ popped ′′ ′′ slid ′′ ′′ frustrated ′′ ] completely slid off the chair before the fall happened and I could not get to him in time to prevent springs on the seat popped out and after i managed to get it back into it's place, about a week later the ball bearings piece broke while my wife was nursing our son, causing them to fall out of the things happen but I felt the need to share this due to what MAY have happened if he had been left in bed thinking he was asleep and we felt more secure now that we had railsrepresent Y as the column vector of smoke term scores for each review.We represent the smoke term score of each review as y  , indexed from 1 through .Thus, Y = [y 1 y 2 ⋯ y  ].The computation for Y involves multiplying  by  ⟙ , which increments the smoke term score of each review by y  every time d  appears in that review.An example computation for Y is provided as follows:Upon scoring the reviews in the holdout set, we sort all text documents based on the smoke term scores, from the highest to the lowest.Higher smoke term scores indicate the presence of hazard category-specific language: completely slid off the chair before the fall happened and I could not get to him in time to springs on the seat popped out and after i managed to get it back into it's place, about a week later the ball bearings piece broke while my wife was nursing our son, causing them to fall out of using it, my husband and I were so frustrated we decided to return and cup holder won't stay on 6 0 I know that things happen but I felt the need to share this due to what MAY have happened if he had been left in bed thinking he was asleep and we felt more secure now that we had rails!

Risk metric Risk assessment framework CPSC a Health Canada b EU guideline c BSI d Our study Hazard
Research studies on risk assessment for consumer products.
TA B L E 3

Union. Guidelines for the management of the European Union Rapid Information System 'RAPEX'. Retrieved from: https
c European

Probability of damage during the foreseeable lifetime of the product Severity of injury
TA B L E 7 Risk matrix chart.Source: Adapted from the EU guideline.
Experts' assigned weights for each hazard code type.
Count of sentences with hazard existence by hazard code type and product category.TA B L E 9

of hazard code type Severity level (EU) from EU guideline and a range of computerized severity level (s) of the hazard code type
Table 11 assesses the Mapping-computerized risk level on the risk assessment matrix.
Source: Adapted from the EU guideline.
Correlations of ZAGG scoring method with human judgment on risk scores.
Amazon dataset: Total reviews and unverified safety concerns in 17 product categories.Verified safety concerns by product category.
TA B L E A 1