GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.


INTRODUCTION
GitHub is the world's largest host of source code, with more than 150M repositories in 2021; moreover, the number of repositories increased by 60M+ in the previous year 1 .However, these repositories 1 https://octoverse.github.com/are not easy to find: while GitHub allows developers to annotate their projects manually and other users to search software via Topics 2 , not all projects are making use of it, or use it inefficiently, by just annotating with one or two topics.Additionally, developers are free to annotate a project with any string they want: this inevitably generates a very large number of specific, unrepresentative labels.
Various works in the literature have attempted to automatically classify the domains of software applications, many proposing their datasets with custom taxonomies [26,30,37], and more recently, using a subset of GitHub Topics [5,39].However, these resources suffer from various recurring problems, the antipatterns of software classifications [22].First, no current taxonomy explicitly defines a hierarchical relation among their labels, making it problematic when dealing with their single label annotation and 'IS-A' relationships among labels (mixed granularity issue).A second issue is the mix of different taxonomies in the same categorization (mixed taxonomies issue), for example, when labels from application domains (e.g., 'Security') are present as well as programming languages (e.g., 'Python').This is an issue when performing single task classification as opposite to multi-task [2], which would not be possible given the lack of separation between the taxonomies.Furthermore, these categorizations are not complete, as they do not cover the entire spectrum of software categories (e.g., having 'Compiler' but not 'Interpreter'), making it easier for a model to distinguish some classes.The incompleteness is also aggravated by the fact that no work is grounded to a knowledge base (KB): this is highly problematic because it does not resolve the ambiguity of an arbitrarily defined categorization (i.e., top-down), reducing its usability and the possibility to add new terms to it (ambiguity issue).All the issues above make these categorizations less valuable to be used in a real-world scenario.
As an alternative to the pre-defined taxonomies presented by previous works in software application domain classification, in the Natural Language Processing field, there are works focusing on taxonomy construction from data [24,35].While some solutions focus on creating a taxonomy for the Computer Science domain using papers from the bibliography service DBLP [25,36], our attempt at reproducing their results failed.Furthermore, current solutions are not deterministic, requiring multiple runs and a lucky seed to get a good starting point for annotators to work on.Also, these and also other solutions require a large amount of data to create the taxonomy [33]; however, this is not always available for GitHub Topics.
In this paper, to solve the issues outlined above, we present GitRanking, a framework for ranking software application domains.In contrast to previous work, we defined a pipeline for selecting the topics, using 121K GitHub Topics as our initial seed.GitRanking uses an active sampling method combined with a Bayesian inference algorithm to create the ranking of the topics.Furthermore, to reduce the intrinsic ambiguity of natural language, and make the taxonomy more usable, each term is linked to its Wikidata entity.One key feature of GitRanking is the ability to easily expand the ranked taxonomy with a minimal amount of examples (~15 for each new topic added).
Furthermore, GitRanking allowed us to extract insights regarding the usage of GitHub Topics by the practitioners, and answer the following research question: RQ: Are the topics used to annotate GitHub projects evenly distributed in the levels of a taxonomy?
The ambition of these results is to help developers in better annotating their projects, and to make them easier to find and more discoverable.This improved discoverability will also help other developers, as it will be easier and faster to find the best library for a specific task improving the reusability of software.
In summary, our contributions are: -An online framework for better-creating software categorization, and expand them; -A list of 301 application domains extracted from GitHub Topic and disambiguated by linking them to Wikidata; -A ranking of the 301 topics into discrete levels based on their meaning, -Using our ranking, answer RQ.
We made our code 3 and data 4 available.This paper is articulated as follows: in Section 2 we analyze the past works in terms of existing taxonomies, and the approaches that were used to extract one from data.In Section 3 we present the ASAP (Active SAmpling for Pairwise comparisons) algorithm, which we used for reducing the annotations required for the ranking, and the TrueSkill ranking system which creates a ranking of the annotated topics.Section 4 describes the pipeline of GitRanking, and its activities.Section 5 presents the results of the work performed by the annotators, and the TrueSkill output: we discuss these findings in Section 6.We analyze the threats to validity that we encountered in Section 7. We present the conclusion and future works in Section 8.

RELATED WORK
In this section we present the relevant related work.We address works regarding the software classification problem, in particular application domain classification, and works focusing on taxonomy construction or induction.

Software Classification Taxonomies
There have been various attempts in the literature focusing on software classification, from application domains [5,15], to bugs [18], and vulnerabilities [21].In this paper, we focus on works performing software application domains classification.
One of the initial works on software classification is MUD-ABlue [11].They propose a dataset of 41 projects written in C and divided into six categories.They also present a model based on information retrieval techniques, specifically Latent Semantic Analysis (LSA), to classify software based on their source code identifiers.
Tian et al. proposed LACT [28].As for MUDABlue, the authors propose both a new dataset and a new approach to classification.The dataset consists of 43 examples divided into 6 SourceForge categories.The list of projects is available in their paper.Their classification model combines Latent Dirichlet Allocation (LDA), a generative probabilistic model that retrieves topics from textual datasets, and heuristics.They use the identifiers and comments in the source code as input to their model.
Again, in [15], the authors propose a new dataset using Source-Forge as seed.The dataset consists of words extracted from API packages, classes, and methods names using naming conventions.However, the dataset containing 3,286 Java projects annotated into 22 categories is no longer available.Using the example in [29], the authors use information gain to select the best attributes as input to different machine learning methods.
LeClair et al. [14] propose a dataset of C/C++ projects from the Debian package repository.The dataset consists of 9,804 software projects divided into 75 categories: many of these categories have only a few examples, and 19 are the same categories with different surface forms, more specifically 'contrib/X', where X is a category present in the list.For the classification, they used a neural network approach.The authors use the project name, function name, and function content as input to a C-LSTM [38], a combined convolutional and recurrent neural networks model.
In [30] authors proposed an approach to generate tag clouds starting from bytecodes, external dependencies of projects, and information extracted from Stack Overflow.Unfortunately, their dataset is not available.Sharma et al. [26] release a list of 10,000 examples annotated by their model into 22 categories, evaluated using 400 manually annotated projects.It is interesting to notice that half of the projects eventually end up in the 'Other' category, which means that they are not helpful when training a new model.They used a combined solution of topic modeling and genetic algorithms called LDA-GA for the classification [19].The authors apply LDA topic modeling on the README files and optimize the genetic algorithms' hyperparameters.While LDA is an unsupervised solution, humans are needed to annotate the topics from the identified keywords.
In ClassifyHub [27], the authors use the InformatiCup 20175 dataset, which contains 221 projects unevenly divided into seven categories.For the classification, they propose an ensemble of 8 naïve classifiers, each using different features (e.g., file extensions, README, GitHub metadata and more).
In [37], the authors release two datasets spanning two domains: an artificial intelligence taxonomy with 1,600 examples and a bioinformatics one with 876 projects.The datasets have been annotated according to a hierarchical classification that is given as an input with keywords for each leaf node.Furthermore, they propose HiG-itClass, an approach for modeling the co-occurrence of multimodal signals in a repository (e.g., user, repository name, README, and more) to perform the classification.
Focusing on unsupervised approaches, we find CLAN [16], which provides a way to detect similar apps based on the idea that similar apps share some semantic anchors.They also propose a dataset (not available anymore) in previous work.Given a set of applications, the authors create two terms-document matrices: the structural information using the package and API calls, and the other for textual information using the class and API calls.Both matrices are reduced using LSA, then the similarity across all applications is computed.Lastly, the authors combine the similarities from the packages and classes by summing the entries.In [31], the authors propose CLANdroid, a CLAN adaptation to the Android apps domain, and evaluate the solution on 14,450 Android apps.Unfortunately, their dataset is not available.
Another unsupervised approach was adopted by LASCAD [1].However, unlike other unsupervised methods, the authors proposed an annotated dataset consisting of 103 projects divided into six categories (from GitHub Collections) with 16 programming languages (although many languages have only 1 example) and an unlabeled one which is not available.Their approach uses a language-agnostic classification and similarity tool.As in LACT, the authors used LDA over the source code and further applied hierarchical clustering with cosine similarity on the output topic terms matrix of LDA to merge similar topics.
More recent works focus on using GitHub as the source of their classification.In Di Sipio et al. [5], the authors released a dataset for multi-label classification annotated with 120 popular topics from GitHub.The dataset contains around 10,000 annotated projects in different programming languages.For the classification, their approach uses the content of the README files and source code, represented using TFIDF, as input to a probabilistic model called Multinomial Naïve Bayesian Network to recommend new possible topics for the project.
Similarly to [5], Repologue [9] proposes a dataset based on popular GitHub Topics; however, the dataset is unavailable.For the classification, they also adopted a multimodal approach.They feed as input to a fully connected neural network, the dense vector representation (i.e., embeddings) created by BERT [4], a neural language model.BERT creates the embedding of the project names, descriptions, READMEs, wiki pages, and file names concatenated.
GHTRec [39] has been proposed to recommend personalized trending repositories, i.e., a list of most starred repositories, by relying on the BERT language model (LM) and GitHub Topics.Given a repository, the system predicts the list of topics using the preprocessed README content.Afterward, GHTRec infers the user's topic preferences from the historical data, i.e., commits.The tool eventually suggests the most similar trending repositories by computing the similarity on the topic vectors, i.e., cosine similarity and shared similarity between the developer and a trending repository.They use the dataset of [5].

Automatic Taxonomy Construction
Automatic taxonomy construction or induction is a challenging task in the field of natural language processing as it requires models understanding of the hypernym relation.Hypernymy, or 'IS-A' relation, is a lexical-semantic relation in natural languages, which associates general terms to their instances or subtypes.
With the large Web data available, many taxonomies are constructed from human resources such as Wikidata.However, even these huge taxonomies may lack domain-specific knowledge.Therefore, many automatic approaches to construct domain-specific ones have been proposed.From hypernymy discovery and lexical entailment [35] approaches, to instance-based taxonomy [3,24], and clustering-based taxonomy methods [25,36].
An example of approaches focusing on the hypernymy discovery task is [35]; they propose a distributional approach that fixes some of the issues present with such methods, making them achieve comparable performance with respect to the simple, pattern-based methods.
While shifting a bit from the hypernymy discovery methods, [24], and [3] make use of the patterns matching for creating their datasets.In [3], the authors use a pre-trained language model and distantly annotated data collected by scraping the web.They finetune BERT to learn a hypernymy relation between words.In [24], they use the dataset built by using pattern matching to construct a noisy graph of hypernymy and train a Graph Neural Network [23] using a set of taxonomies for some known domains.The learned model is then used to generate a taxonomy for a new unknown domain given a set of terms for the new domain.
Examples of clustering-based approaches include TaxoGen [36], NetTaxo [25], and Corel [8].Their work is similar in nature; all focus on creating a taxonomy from DBLP's bibliography, making it relevant to our research.However, attempts at reproducing their results have failed 6,7 , and their results are not publicly available, except for the small samples included in the paper.Their approaches are similar and are based on learning semantic vectors (embedding) for the words of interest.They perform an iterative sequence of learning embeddings: perform clustering and subsequently, for each cluster, repeat the steps to create each time a better representation that is more discriminative.However, this requires a large quantity of data, which is hard to collect [33]; moreover, for each newly added term, a new run of the algorithms is required, which are heavily demanding in terms of computation and time.Furthermore, our attempts at reproducing their results failed.
A more comprehensive study of the taxonomy construction research area is presented in [33].

BACKGROUND
Modeling subjective characteristics of items, e.g., quality of an image or user preferences, requires subjective assessment and preference aggregation techniques to combine the human annotations.Usually, these consist of either a rating of a set of items based on some criteria or creating a ranking of a subset of the overall items.While the ranking is better suited for crowd-sourcing scenarios as it is less complex for the annotators [34], compared to rating, it requires the inference of latent scores representing the position of the items in the rank, which involves comparison pairs samples.The ranking task is defined as a comparison of  items that are evaluated using subjective features without ground truth scores.The most straightforward experimental protocol is to compare pairs, referred to as pairwise comparison; however, this will take too many evaluations, more precisely  2 = ( − 1)/2.Nonetheless, active sampling can be used to select the most informative pairs to compare, reducing the number of total comparisons while maintaining good results.

Active Sampling for Pairwise Comparisons
Active SAmpling for Pairwise comparisons (ASAP) [17] is a state-ofthe-art active sampling algorithm based on information gain that finds the best pairs to compare in ranking experiments.
Previous active sampling solutions reduce the computational complexity by taking a suboptimal approach of only updating the posterior distributions for the pairs selected for the subsequent comparison, which might not converge to the best optimum.Instead, ASAP reduces the overhead by using approximate message passing and only computes the information gain of the most informative pairs, updating the posterior distribution of all the pairs, making it efficient and correct.
ASAP consists of two steps: (i) compute the posterior distribution of score variables  using the pairwise comparisons collected; (ii) using the posterior of  to estimate the next best comparisons to be performed.
The use of ASAP in this paper is to support the work of the annotators since this algorithm minimizes the number of comparisons needed to obtain a full classification.

Ranking Algorithm
ASAP uses TrueSkill [7] for the ranking of the annotated pairs.TrueSkill is a ranking system for calculating players' relative skills in zero-sum games.TrueSkill is similar to Elo [6], one of the first algorithm developed for ranking in two-player games.Elo models the probability of the possible game outcomes as a function of the two players' skills represented as a single scalar.However, unlike Elo, TrueSkill uses Bayesian inference to evaluate a player's skill.Therefore, a player's skill is defined using a normal distribution N (, ), where , the mean, is the perceived skill, and the variance  represents how uncertain the system is in the player's skill value.As such, N () can be interpreted as the probability that the player's "true" skill is .
TrueSkill, given its nature of being an online game ranking system, supports the addition of new players, or terms in our case, without needing to recompute pre-existing players' scores.Moreover, for pairwise comparisons, TrueSkill should be able to place the newly added element with around 12 comparisons8 .We validate this for our case in the next section.

PROPOSED APPROACH
GitRanking is our proposed approach for generating a hierarchical taxonomy of software application domains.It is a bottom-up ranking based on GitHub Topics and grounded in Wikidata.It aims to solve some of the issues present in current datasets for software classification, including: mixed taxonomies, mixed granularity, and ambiguity.
In this section, we present the pipeline used to create the ranking of the GitHub Topics.The pipeline is visually represented in Figure 1, and its activities are described in more detail below.

Topic Collection -Scraping
We collected the GitHub Topics by following the approach used in [10].We scraped repositories containing (1) at least one GitHub Topic, (2) a README file, (3) a description, and (4) at least ten stars.In this way, we were able to retrieve 135K projects, with a total of 121K different topics, with a combined frequency of 1 million.We have made the list and metadata of the scraped projects available.Our data follows the format of GitHub's REST API 9 .
The distribution of the frequency and usage of the topics is highly skewed, as depicted in Figure 2, with around 50% of projects annotated with only one topic.Less than 2,500 topics represent 50% of the distribution of use.This is caused by many observable

Topic Filtering
In this activity, we attempted to reduce the number of topics by filtering and manual annotation.Given the large variety of terms, an automatic approach based on the semantics of the topics would be preferred.However, given the absence of a precise context, the ambiguity of the task and terms, this approach is not optimal and will produce poor results.Therefore, we opted for manually annotating a subset of the topics as a solution.
For the annotation, we selected the 3,000 most frequent topics.Their frequency covers 60% of the topics scraped in the previous step; each of those topics has a number of examples close to 50 (with a minimum of 44), meaning that there is enough data to train a machine learning model.
With the help of three annotators, we assigned a binary label (0, 1) to each of the 3,000 topics.The annotators were instructed to positively label (i.e., 1) the GitHub Topics that can be considered as general or specific application domains of software (e.g., 'deep learning', 'common line interface', etc.).At the same time, for programming languages, companies, technology, and any other case, the annotators were instructed to assign a null label (i.e., 0).
As the final step of this activity, the selection of the resulting topics was carried out using a majority voting (e.g., at least two annotators agreeing on a topic an application domain); this was done to ensure higher quality in the creation of the initial taxonomy and to remove noise, while still allowing for a good recall.This activity filtered an overall 368 topics that can be considered application domains, and that were carried forward to the next reconciliation activity below.

Topic Linking
The topics resulting from the filtering and the manual annotations are intrinsically ambiguous (e.g., 'rna-seq', or 'ci'); in order to help with the disambiguation of these terms, we linked each of the topics to Wikidata [32], Wikimedia Foundation's knowledge base.The linking is performed in a semi-automatic fashion using Wikidata reconciliation API and humans to check and fix any errors.
Wikidata offers a reconciliation API, a service that, given a text representing a name or label for an entity, and optionally additional information to narrow down and refine the search to entities, returns a ranked list of potential entities matching the input text (e.g., for 'rna-seq' returning 'RNA sequencing', with Wikidata ID Q254234710 , and for 'ci', will return the 'Continuous Integration' entity with ID Q965769).The reconciliation uses fuzzy matching to find the most likely entity in the knowledge base that matches the input string.Hence, the candidate text does not have to match each entity's name perfectly, meaning that we can go from ambiguous text names to precisely identified entities in a knowledge base.
The topics resulting from the previous activity were fed as an input to the Wikidata API: in order to increase the retrieval precision, we exploit the github-topic Wikidata property, with ID P910011 , that helps in the linking of terms that are already linked in Wikidata (e.g., the entity 'Convolutional Neural Network', has an entry with property P9000, where the value is 'convolutional-neural-network').
This reconciliation activity gives us a list of 10 candidates for each term.Each candidate has a list of types describing the candidate (e.g., for 'Science', there are various candidates with the same name, but different types, some include: 'academic discipline', which would link to the correct entity Q336, and the other will link to 'television channel' with ID Q845056).We manually annotate the highly irrelevant types to the task (e.g., human, television channel, or any location) and exclude candidates belonging to these types for a more automated process of linking.After filtering the candidates that are of an irrelevant type, we link the term by picking the first candidate in the filtered list.Lastly, we check the correctness of the linking and fix improperly linked topics.This resulted in the correction of 25 topics out of the 368.Now that our topics are disambiguated, we can use the unique ID from Wikidata to reduce duplicates, as some topics are just different surface forms, or aliases, for the same entity.The number of unique topics remained are 301.

Annotation
In this activity the resulting topics (filtered by the annotators and linked to the Wikidata API) were presented in pairs on a web application for the manual annotation.The 8 annotators working on this activity were presented with two topics in the list and were instructed to pick the most general term, considering their respective domain.A mock up of the interface is illustrated in Figure 3.The terms were also linked to the URLs of the Wikidata pages, in case the annotators were not confident with a specific topic.In case of doubt, the annotators could 'skip' the pair.
The 'Tie' option was also available, in case the annotators believe that the two terms represent domains of the same level.This option was also used to collect data to validate our results: since ASAP does not support 'Ties', we instructed the annotators to use it rarely.Instead, annotators were instructed to rather pick one of the options randomly, as having one term before the other in the continuous ranking does not affect the final discrete rank.

Science Computer Vision
Which term is more general?SKIP SEND Tie The ASAP algorithm was used to assist the work of the annotators: its main advantage is to reduce the number of pairwise annotations, and still achieve decent performance.The ASAP paper [17] shows in fact how reducing the number of comparisons on an example with 200 variables affects the performances.In their case, a reduction in the number of comparisons by a factor of 3 shows a good balance in terms of performance, resulting in a total of 1   3  2 annotations needed.The algorithm identifies the topic pairs, and it uses the previous annotation to find the best, most informative, new pairs to add to the list of annotations.The annotators are then presented with a random pair from this pool.The ASAP algorithm is a very memory-intensive task; we used an AWS instance with 64 GB of RAM and a 16 core CPU for our experiments.
In terms of performance of our annotation, and given the 301 terms we collected and an estimated amount of 25 seconds per pair on average, we expected the overall time required to be 1 3 301 2 * 25/3, 600 = 104 hours.Given our pool of 8 annotators, this would translate to 13 hours for each annotator.However, we would expect fewer samples required for our case, as the task is less challenging compared to modeling more abstract values like a player skill in an online game.

Ranking
The final ranking of the topics is computed by ASAP using TrueSkill.It uses the comparison matrix created with the annotations at the previous step, excluding the items marked with 'Tie'.The comparison matrix is a square matrix, where every entry is the number of times the term in the corresponding row was selected over the term in the corresponding column.This results in a mean and a standard deviation value for each topic, and by sorting by the mean of each topic, we obtain the ranking.

Clustering
Lastly, the final step of our pipeline is to create a discrete rank of the topics.Using the resulting ranking of ASAP, we feed the mean of the topics ranking as input to a clustering algorithm, KMeans.To find the optimal number of clusters, we used the Elbow method.The clustering is performed on the uni-dimensional data of the topics' mean score computed by TrueSkill.
Having a discrete set of ranks, instead of a continuous one, brings many possibilities: perform analysis to study developers behaviour, and use the discrete values to train models to predict topics at specific levels, aiding with the annotation of repository.

Ranking New Topics
We evaluated the number of annotations required to rank newly added topics.The experiment uses the annotations collected from the experiments with annotators.We simulate a new topic addition by removing their annotations and incrementally adding them, one by one, and checking for convergence.We measure the average number of annotations required to reach convergence for all 301 topics by individually removing each one.
We compare three different strategies of simulating the newly inserted topic: random, order, and informed.The random strategy samples annotations of the topic randomly; order selects the pairs in order as annotated and suggested by ASAP; the informed uses only the last 20 pairs suggested by ASAP, making it a more efficient, but not optimal way, to simulate the new annotations.
The convergence is defined as being in proximity of the final position that the topic holds in the ranking used with all the annotations at our disposal.We use a max of 3 positions difference for proximity, and convergence is reached when the proximity is hold for 2 consecutive annotations.

RESULTS
In this section, we present the results of our approach and discuss them.We present the statistics of the filtering and annotation.We also present the ranking and samples from it.Lastly, we show the results of adding a new term to the ranking.

Filtering
Starting from the 3,000 topics, covering 60% of the total topics distribution, the final list selected by the annotators contains 368 topics.
Table 1 shows the number of positively labelled topics from each annotator, with their positive rate.The three annotators had different ideas of what an application domain is.Annotator C was more strict, picking a minimal amount of terms, A was more relaxed picking many, with B in between.Furthermore, we measured the inter-rater reliability using Krippendorff's alpha [12], a general, reliable measure for inter-rater reliability [13] suited for any number of annotators.For the 3,000 topics and our three annotators, we obtained an agreement score of 0.68.If we measure it per pair of annotators, we have an agreement of 0.79 for pair A-B, 0.68 for the B-C one, and 0.52 for pair A-C.
The low amount of GitHub Topics that qualify as application domains suggests that using popularity as seed for a taxonomy, which previous work do, results in low quality labels.

Annotation
From the pairwise comparison data annotation, we collected 5281 annotated pairs from 8 annotators, including Professors, PostDocs, and PhDs in Computer Science with a mix of Software Engineering and Machine Learning backgrounds.The statistics about the annotators' contribution to the process is presented in Table 2.
Table 2: List of the annotators' IDs and their contribution to the total annotations with the number of ties assigned.
Annotator ID Annotations Ties As we predicted, we were able to converge to a stable rank in less than the 1 3 301 2 = 15, 050 annotations that we would have expected, as based on the case study in the ASAP paper.From Figure 4, which shows the average change in position at incremental amounts of annotation, we can see that the average change in positions in the ranking converged at around 1 9 301 2 = 5, 000 annotations.This also means that there is no further change in the positions of the elements in the ranking.
The curve has a steep decrease in the first 1,000 comparisons, and later plateaus at an average of 10 positions changed by the terms in the ranking.After 5,000, it immediately falls to an average close to 0. q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 60 0 1 , 0 0 0 2 , 0 0 0 3 , 0 0 0 4 , 0 0 0 5 , 0 0 0 Annotations Average Change in Position ASAP reduces the amount of annotation required to reach convergence in the ranking by a factor of 9, making it an effective way to aggregate domain expertise in qualitative ranking tasks.

Ranking
The final ranking is presented in Figure 5, where we see the topics' mean score computed by TrueSkill, and their position in the final rank.We can notice that, while the extremity of the ranking is very well separated, the central area is not as much.This is caused by the higher difficulty of comparing topics belonging to the middle area of a taxonomy.
A more qualitatively view of the resulting ranking is presented as a sample of the topics at different levels in Table 3.The Table shows a vertical view of topics that would belong to the same branch in a taxonomy.In particular, it presents the Artificial Intelligence / Computer Vision branch, from the top most general term to middle terms, all the way to the last terms in such a branch.From Table 3, we notice that at the top we have terms that we would come to expect with 'Science' being first one.As we go down the ranking, the terms get more specific, first 'Computer Science' followed by 'Artificial Intelligence'.After these, we find a limit case, where we have two terms 'Computer Vision' and 'Machine Learning' that for someone might need to be reversed; for someone, they are correct, and for others, they should be at the same level.As we get closer to the end, we find more concrete tasks like 'Image Segmentation'.At the bottom, we see methods like 'Convolutional Neural Network'.
The final list can be checked in the data replication package (see Section 1).q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −2 −1

Takeaway 3
TrueSkill's ranking captures the hierarchical relations among terms.However, there is still room for improvement at the middle levels as the separation is not as strong.

Clustering
Using the Elbow method for KMeans, we found the optimal number of clusters for our ranking at  = 8.In Figure 5, we can see the topics ranking distribution and their cluster.We evaluate our clustering using the 'Tie' labelled pairs from the annotation phase.We measure how many of the 'Tie' annotated pairs end up in the same buck in the cluster.Out of the 382 ties, 364 were unique, almost evenly distributed among the eight categories.The results show 30% of the tied pairs belong to the same cluster.When loosening the constraint of equality and setting a distance of 1 cluster, we reach 100% accuracy of our clustering.This result suggests that overall our method is effective; nevertheless, many cases are not precisely placed at the correct level.However, if we also take into account the results in Table 3, we can see that terms in the same vertical are correctly ordered.Still, across verticals, there might be less order, which is in line with the objective of the ranking: create a sorting of terms that belong to the same domain.

Topic Ranking Distribution
After obtaining our ranking, we are able to answer: RQ: Are the topics used to annotate GitHub projects evenly distributed in the levels of a taxonomy?
To answer this, we can use Figure 6, which shows the distribution of topics at different levels in the ranking.The top bar chart shows that the topics are normally distributed across the levels, with the mean at level four, right in the middle.However, if we take into account the frequency of the topics (bottom plot), the mean moves towards a higher, and more general level, level three.
This suggests a tendency on the developers in using a general term that only describes the area of application but without specific information.This has a negative impact on the retrieval of projects, and affects negatively the time required to find the appropriate repository, as not many are labelled with a more specific term.

Takeaway 4
There is a lack of specificity in the terms used by developers to annotate their projects.Future work in software classification needs to address this issue by suggesting topics at multiple levels.

Ranking New Topics
Using the different approaches to simulate annotations for a newly added topic defined in Section 4.7 ('random', 'order' and 'informed'), we measured the average amount of annotations required to reach convergence.
The results are presented in Table 4.We can notice that, independently of the scenario, the amount of annotations required is minimal, with a range from 22, in a non-optimized to around 15 when using a better approach (e.g., 'informed').These are more than the ones suggested by TrueSkill.However, the difference is negligible if we consider that the selection of the pairs was not online, as it would be for new topics.
This shows how easy it is to extend the classification to keep it up to date or adapt it to new subdomains.Furthermore, the required amount of annotation scales linearly with the number of new topics for batch additions, as the problem can be viewed as multiple singleterm increments.Our pipeline is flexible and allows for the insertion of new terms with a minimum effort, making our taxonomy a good starting point to build upon.

DISCUSSION
In this section, we discuss what are the unique features, in our opinion, of the framework that we have presented.We also discuss the repercussions of the choices made and what implications should be expected.
Unique features.Our exploration of a bottom-up, data driven taxonomy has shown that the labels used by developers in their everyday work can form a solid starting point for a taxonomy.Although this has been attempted in the past, we believe that this works adds at least two unique features to this quest: first, the approach that we developed is based on a seed that was annotated by 8 experts, and whose provenance is directly rooted in the development of thousands of active GitHub projects.The annotation part, albeit time-consuming and process-intensive, is a necessary factor for the quality of the seed: this activity has been mostly absent in all the past works that we analyzed for reference.
Extensibility.The second unique feature that this work offers to the research community is a flexible and dynamic approach to expand the taxonomy to further terms and labels.All existing taxonomies can be considered as flexible, in the sense that they allow for further terms to be included.The added value in the work that we propose is that GitRanking allows to dynamically allocate a new label by means of further annotations by anyone proposing such a label.For example, if a researcher wanted to add a new label, not previously present in our taxonomy, they would be expected to run some 15 pairwise comparisons.That would be necessary and sufficient to locate the new term in the correct place of our taxonomy.The ability to add new terms is crucial, as the field evolves and new terms might become popular quickly, or someone wants to adapt it with some terms of interest that are not currently present.
Initial seed.For our study we decided to use as a starting point the top 3,000 topics by frequency.This sample covers 60% of all the labels present in the overall set of 135K projects, and these labels do not include less popular or underrepresented topics.One might argue that this was the reason of our results regarding the distribution of the topics in the levels: the less popular are also the more specific terms.However, we perform the ranking and clustering on the topics available, hence the results would be even more skewed towards higher levels, if we considered the less frequent topics as well.Furthermore, the further down the list we go, the more noise and duplicates we find, making it more time consuming for the annotators.
Practical applications of the taxonomy.Taxonomies and classifications have an inherent utility in organizing the knowledge base around a specific area of expertise.In our case, we believe that having such a classification can be further used by GitHub to guide the developers in labelling their project with at least one term for each level.This would be similar to the description of an academic paper using the ACM Computing Classification System, that allows research to choose from high-level and lower-level topics to describe their work.Lastly, the taxonomy can be used to automatically suggest topics at all levels in the ranking to repositories on GitHub, improving retrievability.
Improvements.While our framework has shown its ability to create a good ranking of the selected terms, in Figure 5, as mentioned in the Results section, the separation at the middle layer is not as strong as we would like.This could be addressed by collecting more pairwise comparison data for only the middle area, however, contrary to the expectation, this might not be the case as the middle layer is also the hardest to separate for humans.One solution could be to perform linking among terms, making a tree like taxonomy.However, this is also not as trivial, and requires more research.

THREATS TO VALIDITY
We use the classification of Runeson et al. [20] for analyzing the threats to validity in our work.We will present the construct validity, external validity, and reliability.Internal validity was not considered as we did not examine causal relations [20].

Construct Validity
The first construction threat to our study is the initial filtering of the topics, which is highly subjective.However, we reduced this threat by having three annotators and only choosing the topics agreed on by at least two of them.For the ranking, while we have subjective input, the algorithm combines the input from all annotators, reducing the weight of annotation errors.
Furthermore, our ranking is open to evolution, by having a flexible pipeline, additions and changes can be done with a minimal effort, which can be open to the community.
Regarding the analysis on the distribution of the ranking and how developers label their projects, we mitigated this thread by using the validation data collected from the annotation phase.

External Validity
Our approach is independent of the domain, as it uses general methodologies from statistics and machine learning.Furthermore, we focus on words, making this approach applicable to all domains, not just software application domain ranking.

Reliability
For the initial selection of projects, we collected a high amount of different projects, resulting in a large pool of terms, making the collected pool a good sample for representing the population.Regarding the filtering, as discussed in the Construction Validity section, working with natural text is inherently subjective; we focused on having robust filtering of the topics used for our study.

CONCLUSIONS AND FUTURE WORK
This paper presented GitRanking, a framework for creating a discrete ranking of software application domains.Our work aims at solving some of the common issues present in current datasets for software classification, including: mixed taxonomies, mixed granularity, and ambiguity.
Using GitRanking, we analyzed the top 60% of a large sample of GitHub Topics, and selected a list of 301 that we considered application domains.We then disambiguated each topic by linking them to the Wikidata knowledge base.Furthermore, aided by the ASAP active sampling algorithm, 8 annotators compared more than 5,000 topic pairs: finally the TrueSkill algorithm used those annotated pairs to create a ranking of the selected application domains.GitRanking's pipeline allows the resolution of the previous issues.
As the last contribution, we answered our research question RQ: by performing clustering of our ranking, we were able to find that developers tend to assign high-level labels to their projects, making it harder to find specific projects.GitRanking proves as a viable option for developers to annotate their projects with more specific terms.
We plan to improve on our work in different ways: first, we would like to increase the number of topics in the list and ranking.Furthermore, we would create a hierarchical taxonomy and link the terms in our ranking.Moreover, we are interested in creating mappings for the terms in the lower end of the distribution, as there are many surface variant of topics already present in our taxonomy.This will allow for automatic, distant annotation of GitHub projects, which translates in the creation of a large-scale multi-label dataset for software classification that can evolve.Lastly, we plan to train classification models that are able to automatically recommend topics at specific levels, which will make it easier for developers to properly label their projects.

Figure 1 :
Figure 1: GitRanking's pipeline for creating the ranking of GitHub Topics.

Figure 2 :
Figure 2: Distribution (log scale) of the frequency of the scraped topics from GitHub.The blue dotted line represents the cut line we picked for our initial filtering at  = 3, 000.

Figure 3 :
Figure 3: The user interface presented to the annotators.The skip bottom allows the annotators to obtain a new pair when they are not confident with the current.

Figure 4 :
Figure 4: Average change in position of the terms every 200 new annotations.

Figure 5 :
Figure 5: Ranking of the GitHub Topics.The -axis represents the position, while the -axis is the mean score extracted by TrueSkill.The color represents the cluster that each topic is assigned to.

Figure 6 :
Figure 6: (Top) Distribution of the topics in the clusters.(Bottom) The number of projects (frequency of the topics) in each cluster.A low cluster number means more general, and a higher value means more specific.

Table 1 :
Number of positive labelled topics for each annotator and their positive rate.

Table 3 :
Rank of a subset of terms in the vertical of Computer Science, Machine Learning, and Computer Vision.

Table 4 :
Average number of annotations required to reach convergence when adding a new term in the ranking.