An analysis of causative factors for road accidents using partition around medoids and hierarchical clustering techniques

Insufficient progress in the development of national highways and state highways, coupled with a lack of public awareness regarding road safety, has resulted in prevalent traffic congestion and a high rate of accidents. Understanding the dominant and contributing factors that may influence road traffic accident severity is essential. This study identified the primary causes and the most significant target‐specific causative factors for road accident severity. A modified partitioning around medoids model determined the dominant road accident features. These clustering algorithms will extract hidden information from the road accident data and generate new features for our implementation. Then, the proposed method is compared with the other state‐of‐the‐art clustering techniques with three performance metrics: the silhouette coefficient, the Davies–Bouldin index, and the Calinski–Harabasz index. This article's main contribution is analyzing six different scenarios (different angles of the problem) concerning grievous and non‐injury accidents. This analysis provides deeper insights into the problem and can assist transport authorities in Tamil Nadu, India, in deriving new rules for road traffic. The output of different scenarios is compared with hierarchical clustering, and the overall clustering of the proposed method is compared with other clustering algorithms. Finally, it is proven that the proposed method outperforms other recently developed techniques.

The impact of road accidents is far-reaching, affecting victims, families, and the economy due to premature deaths, injuries, disabilities, and lost income potential.Preventing accidents is crucial, but fatalities still occur despite everyone's best efforts.Therefore, data mining techniques, especially clustering algorithms, offer a promising avenue to uncover valuable insights from massive traffic accident datasets.Data mining, also known as knowledge discovery in data, enables the extraction of patterns and essential information from vast datasets.Unlike business intelligence, which focuses on analyzing business data, data mining employs various methods and algorithms to identify relationships and patterns within the data.
In the context of road accidents, data mining can assist in predicting future accident patterns based on historical data.For this research, the authors utilized clustering analysis to make predictions about road accidents, explicitly focusing on fatal accidents.Clustering analysis is one of the techniques used to study the contributory factors of road traffic accidents (RTAs).For identifying the contributing factors of RTAs, different clustering algorithms were proposed in the literature.The K-medoids method 1 is used to determine the critical pre-crash events at T-and four-legged junctions, which can be used to verify the safety of autonomous driving systems.The data set consists of 1056 junction crashes in the UK and resulted in 13 T-junction clusters and 64-legged junction clusters. 1he authors studied a cluster analysis of the accident-prone areas in Semarang city to find an area's vulnerability. 2 According to their findings, Semarang's highest level of accidents mainly occurred on weekdays.Data in New Mexico were considered to inspect the injury severity in intersection-related crashes for two-year crashes.The k-means cluster technique was used to cluster the road data.The hierarchical Bayesian random intercept models were developed to identify the contributing factors in every cluster.The findings reveal how the number of crash-level, vehicle/driver-level, and cross-level interactions significantly impact driver injury severity and how these findings help prevent crashes.They examine the understanding of crash potentials among teen drivers using a huge dataset (information on roughly 88,000 respondents) of teen survey data obtained in Texas.Taxicab correspondence analysis was used to analyze the data and discovered that males with provisional or unrestricted licenses are among the highest risk groups.
The authors concentrated on identifying potential factors that may be linked to varying levels of pedestrian injury severity resulting from train-pedestrian collisions (excluding suicides) at highway-rail grade crossings (HRGCs). 3To conduct their analysis, they utilized 10-year data from the Federal Railroad Administration and employed latent class clustering (LCC) as a method of clustering analysis.Results showed that regardless of the HRGCs' parameters, higher train speed was linked to a higher risk of severe injury.All other factors elevated pedestrian injury severity levels, with differing effects in different clusters.
Traditional statistical models in road safety research have limitations in handling complex datasets, leading researchers to adopt machine learning (ML) approaches.Clustering and classification algorithms like K-means, support vector machines, and decision trees are commonly used for accident severity prediction.However, more comparative analysis and exploration of hierarchical clustering's potential in road safety research must be done.In the related works section, a detailed literature survey has been done.From the literature survey, we found that all the proposed clustering algorithms applied to the entire dataset and then found the grouping based on the homogeneity of the attributes.The main contribution of this article is to analyze six different scenarios (different angles of the problem) in the road accidents dataset, which will help us analyze the more profound insights into the problem and help the transport authorities in Tamil Nadu, India, derive new rules for road traffic.To achieve this, the proposed work analyses causative factors for road accidents in Tamil Nadu using partitioning around medoids (PAM) and hierarchical clustering algorithms, and then it will be compared with other state-of-the-art methods.The flow diagram for the proposed method is given in Figure 1.
The article is organized as follows: In Section 2, we discuss the related works and address the gaps identified.In Section 3, we discuss the methodologies used in this research, such as Gower distance, silhouette width, PAM clustering, and hierarchical clustering.The performance analysis of the proposed methods with others is discussed in Section 5, and finally, a conclusion is given in Section 6.

RELATED WORKS
Recently, many researchers have been contributing traditional statistical model-based methods used to predict accident fatality and severity.Some of the conventional statistical model-based techniques are the logit model, 4 logic model approach, 5,6 and ordered probit model 7 to predict accident fatality and severity in terms of independent and dependent accident factors like bad road conditions, weather conditions, lack of traffic indication, drunken and driving, The flow diagram for the proposed method.
children driving, vehicle problem, driver's attitude and so on.The following provides an overview of several studies that utilize clustering and data mining techniques in road safety research.These studies investigate various aspects, such as identifying risky driving behaviors, understanding the relationship between dangerous behaviors and accidents, analyzing injury patterns, and predicting accident causes.Applying clustering and data mining methods in these studies has yielded valuable insights and contributed to developing effective road safety strategies.Lastrucci et al. 8 utilized cluster analysis to identify risky driving behaviors among adolescent drivers in Italy and their association with RTAs.This approach allowed them to identify distinct patterns of risky behaviors and their impact on accident occurrence, providing valuable insights for targeted intervention strategies.Similarly, Hassanzadeh et al. 9 investigated motorcycle riders' riding patterns and risky behaviors in a specific district in Iran using regression analytic methods.This analysis helped them understand the relationships between dangerous behaviors and other factors, contributing to a better understanding of the risk factors involved in motorcycle accidents.Fueyo et al. 10 focused on accident injury patterns, employing unsupervised clustering algorithms on crash data.By classifying seriously injured individuals into clusters, the study opened new possibilities for vehicle safety, potentially leading to improved safety features.In the survey of medical expenses and costs related to motor vehicle crashes in Puerto Rico, 11 K-means clustering played a crucial role in grouping the data, facilitating the identification of the best cluster that maximized distance among groups and minimized distance within groups.This approach contributed to a better understanding of the factors influencing medical expenses and costs associated with injuries in road accidents.Moreover, the survey of data mining methods for road accident analysis 12 presented various clustering and classification methodologies, with the self-organization map (SOM) being used to uncover multiple patterns and predict accident causes.The application of SOM led to improved analysis accuracy compared to k-means clustering, demonstrating the effectiveness of SOM in handling road accident data.In the context of road accidents in Haridwar, 13 India, Sachin Kumar et al. proposed a data mining technique that employed LCC and the k-mode clustering technique to reduce heterogeneity in the dataset.This approach helped reveal crucial facts about the accidents and paved the way for better solutions and targeted interventions.Furthermore, Kim and Yamashita 14 discussed the utility of K-means clustering in safety research and its application in analyzing spatial patterns of pedestrian-involved crashes in Honolulu.They suggested that both K-means and hierarchical clustering techniques are valuable tools in the arsenal of spatial analytic methods for road safety research.Clustering techniques are not only used in predicting road accident severity, but they can also be used in several other fields like management, arts, engineering, and medicine. 15ore recently, Sivasankaran and Balasubramanian 26 studied the patterns in road crashes in Tamil Nadu from 2009 to 2017 reported in the Road Accidents Database Management System (RADMS) to explore the injury severity levels of bicycle-vehicle crashes.Latent Class Clustering (LCC) models and binary logit models were combined to identify significant factors in demographics, vehicle, and environmental causes for the crashes.Sivasankaran and Balasubramanian 27 used the same RADMS database to identify associations between pedestrian hit-and-run causes.The same team used Multiple Correspondence Analysis (MCA) 28,29 to identify associations between various contributing factors of pedestrian crashes.Pedestrians of 25-34 age group were associated with crashes at traffic signals where the drivers exhibited non-respect for the right way of rules.In addition, driving violations such as driving against traffic flow and risky driving behaviours such as changing lanes without due care and dangerous overtaking were associated with pedestrian-vehicle crashes. 29Similar studies were conducted, where the factors associated with the overspeeding risky behaviour of drivers were studied using logistic regression. 30With a majority of crash fatalities in Tamil Nadu involving motorcycles, 31 ordered logit model was used to identify significant contributing factors in single vehicle motorcycle fatalities. 32In road safety research, traditional statistical model-based techniques have long been utilized to predict accident fatalities and severity.However, conventional statistical models have limitations, particularly in dealing with complex and multidimensional datasets.Nowadays, most researchers have turned to ML approaches to overcome these challenges due to their predictive superiority, efficiency, and ability to handle informative datasets.The notable works are given as follows: Kwon et al., 23 used decision trees and Nave Bayes to classify road accidents, and the data were collected between 2004 and 2010; they also compared the classification results with linear regression.Sharma et al. 24 demonstrated the road accident prediction through a support vector machine and multi-layered perceptron; they considered only two parameters, namely, drunken and driving and speed of the vehicle.AlMamlook et al. 25 utilized Nave Bayes, AdaBoost, random forest, and logistic regression methods for road accident predictions.Ester et al., 26 proposed a density-based algorithm for discovering clusters in large spatial databases with noise (DBSCAN) for clustering, and Ankerst et al., 27 proposed ordering points to identify the clustering structure (OPTICS) clustering algorithms.
ML has applications in various domains, including construction, occupational accidents, agriculture, education, sentiment analysis, banking, and insurance.Data mining, ML, and deep learning algorithms have been extensively used in road accident prediction.Notable clustering and classification algorithms have been employed to build accident severity models, such as K-means, support vector machines, K-nearest neighbors, decision trees, artificial neural networks, convolutional neural networks, and logistic regression.The literature needs a comprehensive comparative analysis of different clustering algorithms' performance and standardized evaluation metrics.From the literature survey, we found that all the proposed clustering algorithms applied to the entire dataset and then found the clustering based on the homogeneity of the attributes.The significant contribution of this article is to analyze six different scenarios (different angles of the problem) in the road accidents dataset, which will help us analyze the more profound insights into the problem and help the transport authorities in Tamil Nadu, India, derive new rules for road traffic.To achieve this, the proposed work analyses causative factors for road accidents in Tamil Nadu using PAM and hierarchical clustering algorithms, and then it will be compared with other state-of-the-art methods.

Dataset
This article uses the road accident data management system (RADMS) data and GIS-based software for collecting, comparing, and analyzing road accident data for testing.This database is maintained by the State Transport Planning Commission of Tamil Nadu and is the official source that offers complete information on accident circumstances (please refer: https://data.gov.in/catalog/road-accidents-india-2019).Trained police officials compile the crash data across the state with the same instruction manual.The World Health Organization has also advocated using RADMS as an ideal system for nations lacking databases that store accident data.The RADMS 2019 dataset has 48,470 data points and 34 attributes; another dataset contains 2821 data points with 32 attributes.

Maintaining the integrity of the specifications
Accidents involving grievous injuries and vehicle damage only (non-injury) on national and state highways of Tamil Nadu state in India in 2019 have been considered.Table 1 illustrates the data spread across the various variables for the RADMS dataset.

Techniques used
Clustering algorithms and techniques play an essential role in analyzing traffic accidents.They can identify groups of people on the road, vehicles, environmental factors, and other such attributes, which would help arrive at conclusions and appropriate countermeasures well. 14

Gower distance
Choosing the right metric to calculate the distance between two data points, especially while clustering the data, is very important.The RADMS data has both numerical and categorical attributes, hence mixed data.Mixed data have unique metrics for calculating the distance between data points.Gower distance is one such metric used on diverse data. 28Gower distance is a dissimilarity-based distance metric computed as the mean of partial dissimilarities between data points.
In the R programming language used for this study, the daisy function has been used to compute Gower distance.For calculating the Gower distance matrix, the daisy function does the following-each variable (column) or attribute is standardized by subtracting the minimum of the column from each data point and then dividing each data point by the range of the corresponding attribute.This standardization of each variable scales the data such that the range becomes [0, 1].We compute a measure for each pair of data.If these data are numeric, the measure is the absolute value of the difference divided by the range.If the data is not numeric, the measure takes the value of 1 if the data points are different or 0 if the data points are the same.Gower distance is the average of all these measures.

Silhouette width
Silhouette analysis decides the optimum number of data clusters. 29The silhouette value describes how similar an object is to its cluster compared to others.The silhouette plot, which represents the same, has the number of clusters on the x-axis and silhouette width on the y-axis, which is given in Figure 2A,B.The higher the silhouette width, the better would be the clustering.silhouette width values lie in the range of −1 and 1.A value of 1 indicates a considerable distance from this sample to its neighboring clusters.A value of 0 indicates that the sample lies on the boundary of two clusters.The optimum number of clusters is chosen with the help of the silhouette plot and used in the PAM algorithm, which is given in Figure 2 with two data subsets, namely, grievous injury subset (Figure 2A) and no injury subset (Figure 2B).Moreover, a negative value indicates that the sample has been classified into the wrong cluster.

Partition around medoids clustering
The partition around medoids (PAM) clustering algorithm finds objects called medoids around which clusters are built.PAM aims to minimize the average dissimilarity of data points to their closest medoid.The similarity coefficient can evaluate the similarity between the various attributes. 30If the value of the similarity coefficient is high, then the similarity between the attributes is more elevated.Otherwise, dissimilarity is more significant.In this case, the dissimilarity can be estimated by using the relation for i ≠ j, where a is the number of attributes which is equally importance between the clusters i and j; b is the number of attributes which are required in cluster i and not in j; c is the number of attributes which are required in cluster j but not in i, d is the number of attributes which is neither needed for cluster i nor j, and n is the total number of attributes.Equivalently, the sum of dissimilarities can also be minimized. 31he algorithm has two phases: a build phase and a swap phase.The "k" medoids are selected during the build phase, and  clusters are improved in the swap phase by exchanging selected medoids with better replacements from the non-medoids, if any.It is a more robust version of K-means.PAM clustering is more potent because it accepts a dissimilarity matrix and minimizes the sum of dissimilarities instead of the sum of squares of Euclidean distances.Figure 3 describes dendrograms for the grievous injury (Figure 3A) and the no injury (Figure 3B) cases.
The main advantages of the PAM clustering algorithm over the other clustering algorithms are (i) PAM can effectively deal with noisy and outliers information present in the given dataset, (ii) PAM uses medoid to partition attributes into clusters rather than centroids, and (iii) PAM achieves clustering on overall data rather than on selected samples from the given dataset.

Hierarchical (divisive) clustering
The clustering obtained by using hierarchical clustering consists of two approaches, namely, agglomerative and divisive clustering algorithms.Agglomerative clustering follows a bottom-up approach, where the individual data points are considered as "n" clusters, like a cluster on their own.Then, it finds similarities between them and groups them. 32All the data points aggregate and form one final cluster in the end.The divisive clustering algorithm follows the top-down approach.
The real data is one cluster, divided into sub-clusters until the end of the splits are the data points.Dendrograms are an essential tool that helps decide which of the two approaches in hierarchical clustering can be chosen by gauging the amount of balance/imbalance in the graph.A balanced dendrogram would indicate that that particular algorithm can cluster the data better.Figure 4 explains the number of hierarchical clustering obtained by using the silhouette dataset for the grievous injury (Figure 4A) and the no injury (Figure 4B) cases.

PERFORMANCE ANALYSIS OF THE PROPOSED PAM AND HIERARCHICAL CLUSTERING
The results can be split upon analyzing the data into six unique scenarios, four for grievous injuries and two for no injuries subsets.Each scenario consists of PAM clusters and the relevant cluster from Hierarchical clustering, which validates those results.Some scenarios are described in some clusters of PAM, which are unique and not shown by the hierarchical clustering method.All the factors in hierarchical clustering are common to the clusters of both algorithms, but PAM clusters give more details that are not described by the clusters of hierarchical clustering.This makes PAM a more robust algorithm.The total number of accidents with grievous injuries is 1643, and no injuries are 1178.Figure 5 represents the number of PAM clustering obtained by using the silhouette dataset for the grievous injury (Figure 5A) and the no injury (Figure 5B) cases.

Grievous injuries
In this section, we have discussed the results of the unique and utilizing factors in five different scenarios, which are given below: Hierarchical clusters validating this scenario: Cluster 1 (size = 713 accidents) list of uniting factors: Junction control, shoulder type, and road vertical characteristics.
List of unique factors: collision type, road category, traffic restriction, road narrow row, location type, landmark, collision type code, collision description code.
In PAM clustering, two collision types occurred: hit pedestrian (Cluster 1 = 30.9%)and hit from rear (Cluster 3 = 66.08%,Cluster 10 = 48.51%).Accidents involving hitting a pedestrian mostly happen on state highways (75.15%),where the roads are narrow (87.27%), with two-way traffic (81.21%) in the absence of police (95.15%), and heavy vehicles prohibited from entering (67.88%).Specifically, these accidents happened near bus stops (35.15%) in municipality areas (80.61%), and drivers collided with pedestrians who were walking along the road (20.61%).We infer that narrow roads and pedestrians walking along the roads are the misleading use of the accidents here.The presence of police creating awareness among the public to use footpaths, making traffic movement one-way on narrow roads, and initiating road widening activities wherever necessary can be suitable countermeasures to bring down accidents of this type.
Accidents involving collisions from the rear occur on national (Cluster 3 = 80%) and state (Cluster 10 = 78.3%)highways.There was no traffic restriction (58.26%).The national highways were near a panchayat (74.78%) area, and accidents occurred near a bridge (22.61%).Police were not present (82.61%), which leads us to suggest installing some police force, traffic rules, and signages so that entering or leaving the national highway can be smoother and without the risk of any accidents.The state highways were near a municipality (76.24%), and the accidents occurred near a school/college (24.75%).Careless driving (95.05%) was reported to describe the collision.Installing a police force to control and curb careless driving can help reduce these accidents.
In hierarchical clustering, the location was not a junction (81.48%), and if it was, there was not any control present (14.30%), or there was a give way sign (1.40%).The roads had paved shoulders (89.60%) and were flat (97.89%).All these factors are shared between PAM and hierarchical clusters, while PAM clusters further give more details as described above.Table 2 provides an in-depth analysis of Scenario 1: "Give way sign present, paved shoulder, Flat roads, taking inattentive turn" for grievous injury.In PAM clustering, three collision types took place-hit from the side (34.4%), head on (74.5%), and hit from the rear (51.15%).The accidents involving hitting from the side happened in daylight (70.4%), with speed restrictions (88.8%) present, near a traffic signal (17.6%), and the cause reported was careless driving (62.4%).It can be inferred that a sweep from the side could have taken place near the traffic signal despite the restrictions present.A suitable remedy that can be suggested would be to install rumble strips at regular intervals before the signal, as this can help slow vehicles and increase caution.
The accidents involving head-on collisions happened in darkness with street lights on (59.6%), with the entry of heavy vehicles prohibited (78.7%), and in a bazaar area (31.9%).Countermeasures for this situation include widening the TA B L E 2 In-depth analysis of Scenario 1 (give way sign present, paved shoulder, flat roads, taking in-attentive turn) for grievous injury.

HC-1
PAM- roads and placing barricades so that speed will automatically be slowed down.The accidents involving hits from the rear happened in daylight (83.2%), with the entry of heavy vehicles prohibited (76.3%) and near bus stops (22.9%).In such a situation, a suitable countermeasure is having designated parking spaces and imposing fines for parking in a no parking area.
In hierarchical clustering, it was found that the road's vertical characteristics were flat (97.40%) or had a gentle incline (1.94%).The accident causes reported were injuries due to human error (67.38%), dangerous overtaking (15.98%), inattentive turn (9.07%), and driving against the flow of traffic (3.45%).Accidents majorly occurred in urban (88.76%) areas where either junction was not involved (71.49%), no control (22.03%) was present at the junction, or there was a traffic signal (3.45%).All these factors are shared between PAM and hierarchical clusters, whereas PAM clusters further give more details as described above.This makes PAM a more robust algorithm.Table 3 provides an in-depth analysis of Scenario 2: "Traffic signals, darkness with street lights, paved and unpaved shoulders, roads are flat or have a gentle incline, dangerous overtaking, driving against the flow of traffic, and happening in urban areas" for grievous injury.

TA B L E 3
In-depth analysis of Scenario 2 (traffic signals, darkness with street lights, paved and unpaved shoulders, roads are flat or have a gentle incline, dangerous overtaking, driving against the flow of traffic, and happening in urban areas) for grievous injury.Hierarchical clusters validating this scenario: Cluster 3 (size = 467 accidents) list of uniting factors: Central divider, junction control, shoulder type, accident cause, contributory factor.

Urban
List of unique factors: collision type, road category, location type, traffic movement, traffic restriction, police present, footpath, landmark, collision type code, and collision description code.
Clusters resulting from PAM clustering can be divided into two cases based on the collision type: hit pedestrian (60.23%, 66.67%) in clusters 2 and 12, and head-on collision (96.7%, 72%) in clusters 5 and 9. Accidents involving hitting a pedestrian happened on state highways (67.05%, 60.5%) in panchayat areas (73.3%, 95.1%), with two-way traffic (89.2%, 98.8%) and footpaths present (76.7%, 76.54%).When these accidents took place near bus stops (39.2%), we observed that there was no restriction on traffic (59.66%), and pedestrians were walking along the road (30.68%).Simultaneously, when accidents occurred near a bridge (46.91%), pedestrians were crossing the road from left to right (29.63%).There was a restriction on the entry of heavy vehicles (87.65%).We infer from our findings that there is some inconvenience for pedestrians.Hence, a crossing signal, a traffic signal at the end of the bridge, or a skywalk (before or after the bridge) for the pedestrians to cross can be some countermeasures to mitigate these accidents.
Head-on collisions occurred on national (89%) and state (76%) highways.On a national highway, they came under a municipality area (98.9%).The roads had one-way traffic (98.9%) and did not have footpaths (71.43%).Careless driving (85.71%) was reported as a description of the collision.These accidents happened near a bus stop (34.1%).We infer from this situation that drivers/riders could have been more careful.Considering the state highway accidents, the factors were more alarming.Happening mainly in panchayat areas (76.8%), the roads had two-way traffic (89.6%) with no footpath (98.4%) present and head-on collisions (56%) reported as the collision description.A commonly occurring landmark was near a bus stop (35.2%).Imposing speed limits, fines for violations, widening roads, and building a bus bay wherever necessary can help reduce these accidents.Two-way traffic can be converted into one-way traffic if the roads are very narrow.Installing street lights on narrow main roads of villages can also contribute to reducing these accidents.
Hierarchical clustering found that the roads had a paved shoulder (62.31%) and the central divider was absent (91.01%).Accident sites were not junctions (69.16%); in cases where they were junctions, there was no control (27.19%) at the junction.The accident cause was reported as injured in accidents due to human error (95.29%), and the contributory factor was the fault of the driver/rider (95.93%).All these factors are shared between PAM and hierarchical clusters, whereas PAM clusters further give more details as described above.Table 4 provides an in-depth analysis of Scenario 3: "Central

TA B L E 4
In-depth analysis of Scenario-3 (central divider absent, no junction control, paved shoulder, non-respect of rights of way, pedestrians involved, fault of the driver, or driver of another vehicle) for grievous injury.List of unique factors: traffic restriction, location type, collision description code.This unique scenario is evident only in PAM clustering results and includes 2 clusters with accidents involving head-on collisions (59.64%, 82.87%).These accidents happened when central dividers (95.1%, 78.7%) and footpaths were present (91.03%,80.56%), police were absent (91.48%, 74.54%), and also mostly near bus stops (30.04%, 38.42%).In municipality areas (61.43%), there was a restriction on entry of heavy vehicles (68.61%), but the collision took place due to rash driving (74.44%).Whereas in panchayat areas (72.68%), there was no traffic restriction (68.05%), and the description of the collision was head-on (68.05%).Based on the whole scenario, the countermeasures can be the presence of police in municipality areas and the imposition of some traffic restrictions in panchayat areas.Table 5 provides an in-depth analysis of Scenario 4: "Central divider present, head-on collision, near bus stops" for grievous injury.

No injuries
In this section, we have discussed the results with two scenarios under the no injuries category, which are given below: List of unique factors: collision type, traffic restriction.

TA B L E 5
In-depth analysis of Scenario-4 (central divider present, head-on collision, near bus stops) for grievous injury.

PAM-4 PAM-7
Uniting factors We infer that most of the factors are not leading us toward any causative severe factors, which could imply careless driving.Measures such as driver education during license issues, renewal, or vehicle registration at RTA offices and on hoardings and advertisements are a few countermeasures that could be brought into effect immediately to curb this category of accidents.List of unique factors: collision type.
The PAM cluster has accidents involving head-on collisions (52%) and with an entry of heavy vehicles prohibited (52.41%).
The hierarchical cluster has accidents involving head-on collisions (42.60%) and hitting from the rear (22.23%), with no traffic restriction (44.97%).The remaining factors give the same results in both PAM and hierarchical clustering methods, as described here-central divider was absent (PAM = 71.5%,HC = 81.26%),footpath was absent (PAM = 72.9%,HC = 85.01%), head-on collision reported as collision description code (PAM = 28.11%,HC = 21.7%),two-way traffic (PAM = 86.75%,HC = 84.81%).We infer that footpaths and dividers are absent, which could be an essential factor responsible for such accident papers.Suitable countermeasures can be to make traffic one-way on roads that are seeing many such accidents and reduce some traffic.Table 7 provides an in-depth analysis of Scenario 2: "Central divider absent, footpath present, head-on collision" for no injury data.

TA B L E 6
In-depth analysis of Scenario-1 (central divider present, footpath present, careless driving) for no injury.

PERFORMANCE ANALYSIS OF THE PROPOSED METHOD WITH OTHER STATE-OF-ART METHODS
The efficiency of the clustering algorithms can be measured by the internal cluster validation metric (ICVM) and the time complexity.In most of the clustering algorithms, the researchers measured only ICVM because it is sufficient to test the performance of the clustering algorithm.There are three performance metrics for evaluating the significance of the clustering algorithms available in the literature: the silhouette coefficient, the Davies-Bouldin index, and the Calinski-Harabasz index.Silhouette coefficient measures usually lie between −1 and +1.It measures how similar an attribute is to attributes in its own cluster compared to attributes in other clusters.Higher, the silhouette value is well matched to its own cluster and poorly matched to other clusters.The Calinski-Harabasz index or variance ratio criterion is the ratio of the sum of inter-cluster and intra-cluster dispersion for all clusters.If the Calinski-Harabasz index is higher, then the performance of the clustering is higher.Davies-Bouldin index is the internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset.In contrast to Calinski-Harabasz, the lower the Davis-Bouldin index, the higher the clustering algorithm's performance.
The performance analysis of the proposed method with other methods is given in Figure 6 concerning all three metrics 6A-C.
For the purpose of the performance analysis, PAM and hierarchical clustering (HC) are estimated using the entire dataset.From Figure 6A-C, it is clearly found that the proposed PAM algorithm performs better than the other clustering algorithms.

CONCLUSIONS
Tamil Nadu, a state of India, records the highest number of accidents.Compared to other states, Tamil Nadu ranks among the top three in all types of accidents, including those involving fatal, grievous, and mild injuries.Therefore, analyzing Tamil Nadu's accident data and finding countermeasures for every scenario can help Tamil Nadu and other states understand and prevent the current problems that cause accidents beforehand.PAM clustering is a relatively new but robust, hard clustering unsupervised algorithm.It randomly selects medoids from the dataset, calculates distances from data points around them (using a distance measure of our choice), finds a cost, and recalculates these distances as necessary.
The algorithm works well with categorical variables, so we choose these from our dataset.Hierarchical clustering is also applied to the same dataset.
Our results show 14 different clusters that fall into six scenarios for our subset of data (accidents with the severity of grievous injuries and vehicle damage only (non-injury) on national and state highways), and we have suggested suitable countermeasures for each scenario.We used the Hierarchical clustering method (divisive approach) to validate the six scenarios' results.Again, the entire dataset has been used to obtain the clustering using PAM and hierarchical clustering, and then these values are compared with other state-of-the-art methods.From the performance analysis, the proposed methodology PAM performs better than the other clustering models.This article uses a novel and robust technique to contribute to solving a national issue of public interest.Our results and countermeasure suggestions will prove beneficial in mitigating rising accidents and saving more lives and property.
One limitation of the proposed PAM clustering algorithm is that it is unsuitable for large datasets due to its high computation requirements.Therefore, our study had to be restricted to a smaller subset of the data.Another limitation is that the algorithm produces new clusters each time it runs.We finalized our clusters after running the algorithm many times and observing a trend in the clusters and silhouette width.We then saved them to a file for further study.This study and algorithm can also be extended and used for any similar purpose involving unsupervised clustering.

2 3
The optimum number of clusters using the silhouette plot for grievous injury and no injury.(A) Grievous injury subset.(B) No injury subset.The dendrograms for grievous injury and no injury.(A) Grievous injury subset.(B) No injury subset.

4
The hierarchical clustering for grievous injury and no injury.(A) Grievous injury subset.(B) No injury subset.

5
The PAM clusters for grievous injury and no injury.(A) Grievous injury subset.(B) No injury subset.

2 "
Traffic signals, darkness with street lights, paved and unpaved shoulders, roads are flat or have a gentle incline, dangerous overtaking, driving against the flow of traffic, and happening in urban areas."PAM clusters in this scenario: Cluster 6 (size = 125 accidents), Cluster 8 (size = 94 accidents), and Cluster 11 (size = 131 accidents).Hierarchical clusters validating this scenario: Cluster 2 (size = 463 accidents) list of uniting factors: Junction control, road vertical characteristics, accident cause, rural/urban.List of unique factors: collision type, light condition, traffic restriction, landmark, collision type code, collision description code.

F I G U R E 6
Performance analysis of the proposed method with other methods (A-C).(A) Davies-Bouldin index-based performance analysis of the proposed algorithms with other clustering algorithms.(B) Silhouette coefficient-based performance analysis of the proposed clustering algorithms with other clustering algorithms.(C) Calinski-Harabasz index-based performance analysis of the proposed algorithms with other clustering algorithms.
Sample data for the road accident data management system (RADMS).
TA B L E 1
In-depth analysis of Scenario-2 (central divider absent, footpath present, head-on collision) for no injury.