How are decisions made in open source software communities? — Uncovering rationale from python email repositories

Group decision‐making (GDM) processes shape the evolution of open source software (OSS) products, thus playing an important role in the governance of open source software communities. While these GDM processes have attracted the attention of researchers, the rationale behind decisions, that is, how decisions are made that enhance the OSS, have not received much attention. This work bridges this gap by extracting these rationales from a large open source repository comprising 1.55 million emails available in Python development archives. This work makes a methodological contribution by presenting a heuristics‐based rationale extraction system called Rationale Miner that employs information retrieval, natural language processing, and heuristics‐based techniques. Using these techniques, it extracts the rationale behind specific decisions (for example, whether a new module was added based on core developer consensus or a benevolent dictator's pronouncement). This work unearths 11 such rationales behind decisions in the Python community and thus makes a knowledge contribution. It also analyzes the prevalence of these rationales across all PEPs and three sub‐types of PEPs: Process, Informational, and Standard Track PEPs. The effectiveness of our contributions has been positively evaluated using quantitative and qualitative approaches (e.g., comparison against baselines for rationale identification showed up to 47% improvement in the most conservative case, and feedback from the Python steering committee showed the accurate identification of rationales respectively). The approach proposed in this work can be used and extended to discover the rationale behind decisions that remain hidden in communication repositories of other OSS projects, which will make the decision‐making (DM) process transparent to stakeholders and encourage decision‐makers to be more accountable.

work unearths 11 such rationales behind decisions in the Python community and thus makes a knowledge contribution.It also analyzes the prevalence of these rationales across all PEPs and three sub-types of PEPs: Process, Informational, and Standard Track PEPs.The effectiveness of our contributions has been positively evaluated using quantitative and qualitative approaches (e.g., comparison against baselines for rationale identification showed up to 47% improvement in the most conservative case, and feedback from the Python steering committee showed the accurate identification of rationales respectively).The approach proposed in this work can be used and extended to discover the rationale behind decisions that remain hidden in communication repositories of other OSS projects, which will make the decision-making (DM) process transparent to stakeholders and encourage decision-makers to be more accountable.The underlying governance mechanism of an Open Source Software (OSS) project is one of the most important aspects impacting its success. 1 A key artifact of governance is the process describing how a group makes (or should make) decisions.Of particular interest to academia and industry are the Group Decision-Making (GDM) processes practiced within these mostly volunteer-based communities, which contribute to their success and longevity.GDM processes involve "several stakeholders discussing the problem at hand, listing the alternatives through a process of brainstorming and arriving at a consensus that leads to the final set of decisions." 2 These GDM processes can be complex, often involving several variations and nuances, and they lie hidden or buried within large communication archives and therefore may not be transparent to everyone. 3though few other studies have focussed on highlighting these Decision-Making (DM) processes, 4,5 the details about how these decisions are made are translucid and hard to track.
According to the Merriam-Webster dictionary, rationale is the "the explanation of controlling principles of opinion, belief, practice, or phenomena, or an underlying reason."* Prior well-known works on rationale in Software Engineering [6][7][8][9] have noted that the focus of investigation has been on both the decisions themselves (i.e., what the decision is) and on the justification behind decisions (i.e., why and how the decision was made).For example, voting (how) on an issue was undertaken to obtain consensus (why).In GDM, these three aspects (what, why, and how) are referred to as decision-outcome (e.g., a proposal is rejected), decision-goal (e.g., reach a decision such as accepting or rejecting a proposal), and decision-scheme (e.g., voting to establish consensus or the lack thereof), respectively.
Literature on GDM in the domain of OSS 10,11 has focussed more on what the decisions are. 12Also, the goal of decision-making (the why) is often straightforward (e.g., to approve or reject a proposal).However, the decision-schemes associated with reaching the goal (the how) are more nuanced.For example, consensus among members for approving or rejecting a proposal can be based on full consensus, rough consensus, or even lazy consensus (these terms will be elaborated on later in this paper).The nature of decision-schemes (the how) that are enacted during decisionmaking have not received much attention and form the focus of this paper.In particular, prior work has not examined the developer discussions carried out within mailing lists to uncover these decision-schemes for the OSS design decisions, using a data-driven approach, and this work bridges this research gap.Prior works have merely focused on identifying rationales (i.e., a specific type of decision-scheme) based in bug reports, chat messages, and surveys of the Open Source Software Development (OSSD) community, 13 asking individuals about how decisions are made, instead of extracting the actual rationales based on developer email message discussions during decision-making.Thus, the focus of this work is on extracting the rationales (i.e., the how) behind decisions such as rough consensus within the community and project leader's choice from a large repository of decision-related communication archives.
One common source of information about decisions is the mailing lists used by OSS communities.One approach for highlighting the spectrum of rationales in OSS evolution is to manually analyze developers' email discussions.However, extracting these from the communication archives of OSS projects would be a mammoth task, considering the volume of email messages that would need to be manually analyzed.Also, the rationale behind design decisions in OSS projects can be spread across multiple messages over a longer time frame, sometimes spanning several mailing lists.Additionally, it is a challenge to extract the rationale behind design decisions since the rationale expressed in natural language may not be clearly stated in a message (i.e., it can be ambiguous or the context may not be explicit) and the rationale may need to be inferred by piecing several messages together, which will require both information retrieval and natural language processing (NLP) techniques.This work aims to bridge these gaps.We first outline what is currently known in the literature on how OSS decisions are made.Second, we extract the rationales on how decisions are made during the evolution of Python-an established OSS project-by performing a manual analysis of emails.Third, we propose an approach to extract rationales from developer email messages, to aid users in identifying the rationale behind decisions on Python Enhancement Proposals (PEPs).We operationalize this approach in the form of a software tool called Rationale Miner, which is a rationale extraction system based on heuristics.Finally, we evaluate the effectiveness of this automated tool in ranking candidate rationales.Also, we sought feedback from a Python Steering Committee member.† The rest of this paper is organized as follows.In the next section, we provide the motivations behind this work and thereafter present the research questions.Section 3 presents the background on what is currently known in the literature on the rationales for decisions in OSSD communities.Section 4 presents the methodology used to extract the rationales from Python email archives.The results are presented in Section 5, followed by a discussion of the contributions in Section 6.Finally, our conclusions are presented in Section 7.

| Transparency of decision rationale
In organizations, processes are established to make decisions.Tasks within these processes often trigger state changes (e.g., the state of a loan application changes as it goes through the approval pipeline because of specific rationale).Similarly, in OSSD communities, a technical proposal to advance the project may go through certain states (e.g., Draft, Accepted, and Rejected states).For instance, a proposal in the Draft state could be Accepted, and the corresponding rationale might be the favourable feedback received from the community.One of the main challenges in extracting the rationale for decisions is that it lies hidden in the communication between developers (e.g., emails, discussion forums, websites, and social media).Unearthing the rationale behind decisions will make it transparent to both internal and external stakeholders.It will also facilitate a better understanding of the overall decision-making process.Researchers have indeed noted that a lack of transparency in decision-making processes is one of the key concerns in open source software development. 3,15Transparency of rationale will also enable decision-makers to be more accountable for their actions. 16Thus, the first motivation of this work is extracting hidden rationale and making this transparent to all stakeholders.

| Quantifiable decision rationale
The second motivation of this work stems from the lack of data-driven, bottom-up studies where the rationales for design decisions are mined from actual OSS community discussions during decision-making.Pioneering works had typically employed qualitative approaches where OSS decision-making processes were extracted using a range of techniques based on (a) structured interviews, 16 (b) analyzing OSS project websites, [17][18][19] and (c) analyzing a selected number of proposals. 20Researchers further built on these qualitative approaches to unearth different forms of rationale during OSS development.Kurtanovic et al 8 focused on the rationale behind user' decisions on software applications, for example, about upgrading software applications.Through a qualitative analysis of online reviews, the authors investigated how users argue and justify their decisions.Al Safwan et al 21 carried out interviews and surveys with software developers to study their perspective of rationale for code commits.Tao et al 22 manually inspected reasons for patches being rejected stored in Bugzilla.German 23 took active part in the GNOME community and based on his experience described how proposals are approved in the community.
Complementing these qualitative approaches, few works have also undertaken quantitative approaches to automatically extract rationales from OSS discussion archives.Soares et al 24 extracted association rules from pull request data in GitHub to find factors that influence the decision of rejecting contributions.Rogers et al 25 built classification models for text mining of rationale within bug reports.Alkadhi et al 13 manually analyzed how OSS developers discuss rationale in IRC channels and then deployed machine learning approaches for automatically detecting and classifying them.Kunaefi et al 26 focused on reasons behind user app purchases.They manually annotated mobile app reviews and designed three classification problems to filter reviews containing both arguments and decisions from non-argumentative reviews.These qualitative and quantitative approaches have unearthed various facets of rationale associated with OSS development, and they are foundational to the understanding of OSS decision-making processes.However, the rationales behind OSS design decisions, i.e. how communities decide on technical proposals that improve the OSS functionality, remain unexplored.This could be because this task requires manually analyzing a huge amount of messages within discussions in developer mailing lists.The design of an automated mechanism to extract this knowledge from vast developer discussion from OSS projects can ease this mammoth task.
Recently, researchers have made progress towards extracting this knowledge.Savarimuthu et al 3,4 examined GitHub commit data and showed there were additional states in the Python OSS decision-making than those reported in literature.This study was extended in our previous work, 5,27 where sub-states between the decision-making states were unearthed.First, a manual investigation of decision-making discussions was carried out for a subset of all OSS design proposals within the main Python development mailing list.Based on the extracted patterns of how decision-making sub-states were stated in developer mailing lists, we then proposed a tool to mine these fine-grained decision-making processes.
However, that work did not focus on extracting rationales for decisions.This work thus aims to bridge this gap.

| Decision rationale in a mixed-model governance
Research has shown that OSS projects typically adopt one of three governance structures: 28 a well-structured hierarchical governance model with clear rules (e.g., Apache); a meritocracy-based model where the community as a whole makes decisions (e.g., Postgres); and a dictatorship model (e.g., Linux).However, there are certain OSS projects-such as Python-that employ a combination of two DM approaches, that is, meritocracy and (benevolent) dictatorship.While projects such as Python have thrived due to the contribution of voluntary developers, very little is known about the rationales behind decisions on proposals in this mixed-model governance approach.Thus, the third motivation of this study is to investigate the rationale for decision-making in a project setting that has elements of both meritocracy and dictatorship.This investigation can help answer further questions such as does the benevolent dictator influence all decisions, or do they only intervene in those decisions where a majority is not reached?Additionally, is it often the case that the dictator overrules a decision, even when a majority is reached?

| Automating decision rationale extraction
One approach to extract the different rationale is to manually examine documents relevant to decision-making.However, such approaches are resource and time intensive, and suffer from scalability issues.The field of Natural language processing (NLP) provides specific techniques that can be used to extract the rationale from OSS email repositories.Within NLP, extracting the rationale behind events falls under the sub-discipline of causal extraction.However, a well-accepted fact is that causal extraction is still difficult to capture perfectly. 29This is because natural language is ambiguous at times, and the cause, and effect, and their relationship could be expressed in diverse ways within and across sentences, paragraphs, and even separate documents.Thus, the final motivation of this work is to investigate, design and develop a suitable approach that can automatically extract rationale from decision-making archives.Subsequently, the effectiveness of the chosen approach is also evaluated.

| Research questions
Based on the motivations presented above, this research poses four research questions: RQ 1 What is currently known in the literature about the rationales behind OSS decisions?
The question above aims to contextualize this work and synthesize key findings from prior work.The remaining questions build on the first question by focussing on rationale mining using the Python case study.The Python project was chosen for several reasons: (a) It is a large, well established OSS project that utilizes a combination of meritocracy and dictatorship DM principles; (b) its widespread popularity; 30 (c) it is known for adhering to good governance practices; 31 and (d) there have been prior works on Python's decision-making aspects [3][4][5] that can serve as comparators for our current work.

RQ 2 How are decisions about Python Enhancement Proposals (PEPs) made?
This question is divided into two parts: • RQ 2a What are the different types of rationales for decisions on PEPs?
• RQ 2b What is the prevalence of rationale types across all PEPs and the three specific PEP types?RQ 3 How can the rationale for PEP decisions be extracted automatically from Python email archives?RQ 4 What is the effectiveness of our approach for extracting the PEP decision rationales?These questions will be answered in the following sections.

| RQ 1: RATIONALES BEHIND OSS DECISIONS AS REPORTED IN THE LITERATURE
This work focuses on how decisions are made in OSSD.As indicated earlier, this corresponds to the decision-scheme (the method) that is used to make decisions.Based on key literature on decision-schemes, 18,32,33 we have identified five main schemes as summarized in Table 1.While all five schemes enable decisions to be made in a group setting, the first three involve the participation of several members in a community, and the last two delegate the responsibility of the final decision to select individuals.
The consensus scheme is a democratic approach where decisions are collectively made by the community.A good discussion of an issue happens in the mailing list and the decision naturally emerges from this discussion.At the end of the discussion, someone usually posts a message along the lines of I assume we all agree that this bug needs to be fixed, and that this is the way to fix it. 32When there appears to be consensus for a decision, consensus approval is sought (see Table 2).Note that this process only applies for unanimous consensus.There are other forms of consensus such as lazy consensus and rough consensus (as shown in Table 2), which are arrived at using the voting scheme.When unanimous consensus cannot be achieved, formal voting on an issue is undertaken.Again, the results of the voting may be based on different criteria as shown in Table 2 (e.g., lazy majority or 2 3 majority).The polling scheme is used for more informal purposes, such as a quick assessment of whether there might be support for a new idea or proposal.If so, the proposal may be pursued.It should be noted that even though voting and polling are both decision-schemes, the former is formal and the latter is informal, and their associated operationalizations are different (e.g., the definition of a lazy consensus is different to that of a lazy majority).Conversely, some projects that have benevolent dictators delegate all final decisions to these members (project leader decision scheme).Some projects may also delegate decision-making to subject experts (expert decision scheme).We call these fine-grained decision objectives the rationale (see Table 2).
While prior works 18,[32][33][34] have identified some of the decision-schemes (in Table 1) and rationale (in Table 2), these were identified by reading project web pages and blogs, and surveying and interviewing project members.However, there are two main issues with these extant works.First, while the methodology employed may help build a picture of how decisions are made, the prevalence of the use of such rationales is currently unknown.Second, there may be other types of rationales that have not been reported by prior work.Our work bridges these gaps by developing an approach that unearths decisions quantitatively after finding different types of rationales in communication archives of decision-making (e.g., email archives).
We now discuss prior research that is most relevant to our undertaking.Mr owka 18 outlined and analyzed the DM process in OSSD through a study of various online resources, including previous literature and project documentation.The work described the strengths and weaknesses of GDM models in OSSD communities and outlined how decisions were made in the Apache project by presenting a high-level process of consensus-based DM.The components of the DM process included: (i) a proposal phase where the general idea was proposed; (ii) a discussion phase where the idea (or a problem) was discussed by the participants in the project (e.g., offering solutions); and (iii) a voting phase where members voted on their preference for a solution if there was no clear consensus from the discussion phase.Then, depending upon the decision rule (e.g., a simple majority of at least 50% of the votes, or a supermajority of at least 75% of the votes), a decision was made.While this work outlines the DM processes, it does not present all possible rationale for decisions, and it does not quantify the prevalence of rationales (e.g., how often simple majority voting happens versus supermajority).
T A B L E 1 OSS design GDM schemes

DM scheme
Description Source

Consensus based on discussion
Decisions are based on the collective community view.There are various ways to establish this view, Mr owka 18 and Krawczyk-Bryłka and Krawczyk 33 which are based on different circumstances and they are summarized in Table 2. Some sort of informal voting is generally used to source each member's preference.

Voting
Voting is a solution to reach a decision after consensus cannot be found in the community.Members Mr owka 18 and Fogel 32 are asked to formally convey their preferences.Since consensus was not attained before, some sort of majority is preferred.
Polling When it is difficult to ascertain members' preferences, one solution is to ask members to indicate their preferences through a poll.
Mr owka 18 and Fogel 32 While this is supposed to be informal, members may choose to treat the result as binding.

Project leader decision
All or most decisions are made by the project leader (mainly the project founder).Other community members Mr owka 18 may be invited to give feedback during the discussion stage.

Expert decision
Decision responsibility is sometimes transferred to a community member who has greater knowledge of the subject matter.

Krawczyk-Bryłka and Krawczyk 33
T A B L E 2 Rationales used in OSS design GDM to establish collective community views

Unanimous consensus
Must comprise only +1 binding votes a (i.e., no À1 binding votes).Mr owka 18 Lazy consensus An action with lazy consensus is implicitly allowed, unless a À1 binding vote is received.Mr owka 18 Rough consensus Aims to make a sufficiently good decision based on extensive discussion, rather than waiting for everyone's preference.
Eseryel et al. 34 Consensus approval Requires at least three +1 binding votes and no À1 binding votes.Mr owka 18 Lazy majority Requires more +1 binding votes than À1 binding votes.Mr owka 18 2 3 majority Some strategic actions require a 2 3 majority of community members; in addition, 2 3 of the binding votes cast must be +1.

Mr owka 18 a
The outcome of a binding vote determines the decision.
Fogel 32 outlined several important guidelines for running an OSSD community from a practitioner's perspective.The work included a detailed discussion of consensus decision-making where, among other aspects, details such as how and when to undertake polling and voting were presented.At a high level, the study presents an approach to how consensus-based decision-making should be undertaken.However, the work suffers from the same disadvantages as that of Mr owka 18 as discussed above.
4][5] These works have focussed on the decision-making processes associated with the evolution of Python Enhancement Proposals (PEPs).All major proposals to enhance the Python language, or to document processes that Python developers should adhere to, are made using these PEPs.More specifically, community members use these PEPs to propose new ideas, features, or patches to enhance the Python language.The state of these proposals changes to another state (e.g., from draft to accepted) when a decision is made (usually by the Python core developers).The official Python documentation in PEP 1 35 at the time showed eight (now nine) main decision-making states for the PEP process, as shown in Figure 1.The processes extracted by Savarimuthu et al 3 showed 13 main states that a Python Enhancement Proposal (PEP) goes through, instead of the 8 previously identified in PEP 1, thus showing a disparity between the "as-is" and "should-be" processes.Keertipati et al 4 further identified the decision-making processes for each of the three PEP types (Process, Informational, and Standards Track), and in our previous work, 5 we expanded on this to identify the fine-grained decision-making processes in Python.Their work revealed more sub-structures such as voting and consensus formation phases in the decision-making process, and in doing so, substantiated that the process was richer and more complex.
However, in these prior works, [3][4][5] the focus is on the discovery of additional decision-making processes, that is, states and sub-states.The main limitation of these works is that they did not focus on capturing the rationale behind state transitions (or decisions).That is to say, the question what is the rationale for a transition between states? is not answered.For instance, why did a PEP move from the draft to the rejected state while another moved from draft to accepted?Was it through consensus, BDFL decree-where a decision is made by the Benevolent Dictator For Life (BDFL), or some other rationale?
As an example, PEP 289's acceptance rationale as stated by the a developer was: "Based on the favorable feedback, Guido has accepted the PEP for Py2.4." 36 Similarly, the Python project leader (known as the Benevolent Dictator For Life or BDFL) stated the rationale while accepting PEP 285 and he wrote: "Despite the negative feedback, I've decided to accept the PEP." 37 Currently these rationale sentences are buried within Python's OSS mailing list archives, particularly the python-dev repository that consists of core developer discussions.While there has been prior work on mining Python decision-making processes from such sources, [3][4][5] no prior work has investigated the extraction of rationale behind decisions that are made during Python's evolution.In the next section, we describe the methodology to answer RQs 2-4.
To properly unearth the spectrum of rationales used in OSSD decision-making, we intend to investigate a OSSD community where both a combined approach that involves meritocracy and the dictatorship approaches are employed to make technical decisions.Thus, we examined the Python community, and limited our study to the period in which the community practiced a combined meritocracy and dictatorship governance model.The BDFL resigned in July 2018.The new governance model, which is made up of a Steering Council, was decided on 17 December 2018.
Our investigation considers all decision-making discussions until this point, thus capturing all instances of the decision-making carried out using a combined meritocracy and the dictatorship approach.

| METHODOLOGY FOR RQS 2-4
Research questions 2-4 concern the Python case study, where the overarching goal is to extract the rationales for decisions made during Python's evolution that remain hidden in email archives.Figure 2 outlines the methodology we used to extract and present rationales (thus answering RQs 2-4).These steps are embodied in Rationale Miner, our proposed heuristics-based rationale extraction tool.There are 9 steps in our approach, as outlined below.
F I G U R E 1 Python Enhancement Proposal decision-making process 4,35 Step 1: First, we downloaded all PEPs that were accepted or rejected.The PEP text includes its author, and the community member (BDFL Delegate) who decided on the PEP, among other details.These PEPs were obtained from Python's GitHub repository ‡ and were then stored in a MySQL database.We downloaded 268 PEPs from Python's inception (March 1995) to July 12, 2018.We chose this end date as this is when the BDFL resigned, marking the end of the benevolent dictatorship governance model.
Step 2: Next, we retrieved the 13 main states that each of the PEPs downloaded in step 1 transitioned through. 4Python's GitHub repository contained this information for every PEP document.The key details that we extracted were (i) the PEP state transitions (e.g., from draft to accepted), and (ii) the date of this change.The Python state commits data stored in GitHub was extracted between March 1999 through to July 12, 2018, as this is when the BDFL resigned, marking the end of benevolent dictatorship governance model.We only focused on accepted and rejected PEPs and there was a total of these 268 PEPs discussed in Python during this period.
Step 3: We then extracted all individual email messages from the Python email archives that fell within the date range, and stored them in the database.Six leading forums for Python developer discussions were used: python-dev, python-ideas, python-commits, python-checkins, pythonlist, and python-patches.Python-dev is the main forum used by core developers for PEP discussions.The resulting dataset contained 1,553,564 email messages.Where possible, we assigned each email message a PEP number.A script identified whenever PEP numbers (e.g., "PEP 13") or titles (e.g., "Python Language Governance") were mentioned within each message that could be mapped back to a PEP document.This number was stored alongside the message so that we could easily extract messages that were related specifically to PEP decision-making.
Step 4: We selected only those email messages from the above dataset that had a PEP number assigned and replies to these messages.The remaining emails were deemed not related to PEP decision-making.
Step 5: The ground truth-the underlying rationale behind PEP decisions-was then retrieved from the email messages of each PEP.Section 4.1 explains our approach for compiling the ground truth.The results from the ground truth obtained were used to answer RQ2.The ground truth was also used to evaluate the results obtained from the Rationale Miner tool which automates the extraction of rationales.
Step 6: To extract sentences that contain the rationale, we identified a set of 13 heuristics.We assigned scores to each of the sentences, based on how well it matched the relevant heuristics, which were aggregated into a heuristic score for each sentence.Section 4.2 describes these heuristics and the relevant operational details to extract candidate rationale sentences.
Step 7: We ranked candidate rationale sentences using two schemes: a Sentence-Based Scheme (SBS) and a Message-Based Scheme (MBS).
The SBS ranks candidate rationale sentences from the previous step, based on their heuristic score.The MBS groups the ranked candidate sentences at the message level and then ranks these messages.These ranks indicated the position of a sentence or a message enclosing the sentence, respectively, in the rank-ordered output.As we will discuss in Section 4.3, the top ranked sentences/messages are highly likely to contain the rationale.The rankings were then used to further optimize the heuristics based on the ground truth, as discussed in Section 4.4.
Step 8: Next, we evaluated the performance of the SBS and MBS by comparing their rank-ordered results with (a) the ground truth (step 5 above), and two baselines.This is discussed further in Section 4.5.
Step 9: In the final step, we use a graphical user interface (GUI) to present the rank-ordered results for both SBS and MBS.For examining and understanding the rationale behind a particular decision, these ranked results can be explored by a person (a user of the system).This is discussed further in Section 5.2.For evaluation, feedback on the tool was obtained from a former Python steering council member who was closely involved in decisions made in the Python project.This is discussed further in Section 6.

| Extracting ground truth
We created a ground truth dataset comprising rationales for PEP decisions to achieve our goal of automatically extracting these sentences.To this end, we manually analyzed 90,264 messages (spanning multiple mailing lists) relating to all 268 PEPs (152 accepted and 96 rejected PEPs) in a detailed manner.
To identify the rationale behind why each of these PEPs was accepted or rejected, the first author read the corresponding messages of these PEPs.To reduce the number of messages we had to analyze, we mainly focused on messages nearer to the decision date.A custom-built tool was used to retrieve and view all messages relating to a specific PEP in date order.A screenshot showing manual message analysis using the tool (along with other supplementary materials) can be found within a GitHub Repository.§ While examining each PEP's messages, we focussed on their decision date to understand how the discussions unfolded towards the final stages and how the decisions were made.Once a decision rationale was found (e.g., consensus among developers) within a sentence, these rationale sentences, including the enclosing messages, were included as part of our ground truth data.
A total of 107 out of 152 accepted PEPs had a clearly articulated acceptance rationale.From 162 unique messages that discussed, these PEPs we curated 179 sentences containing rationale.We were also able to identify and extract more than one rationale sentence for certain PEPs.Similarly, 86 of the 96 rejected PEPs had a clearly articulated rejection rationale.From 97 unique messages of these PEPs, we curated 121 sentences containing rationale.This process resulted in a total of 300 rationale sentences from 193 PEPs, and this formed our ground truth dataset.The second author examined and verified all rationale sentences in the dataset (i.e., 100% consensus).These rationale sentences (including the example sentences in Table A2) are available within the aforementioned GitHub repository ¶ so that it can be studied by the research community.
Manually extracting rationales from large datasets is laborious and time-consuming.In our case, three months of full-time exploratory analysis was carried out by the first author.Most messages relating to the 248 PEPs required reading, with multiple readings required for some messages to fully understand the nuances.On average, developers' discussion for each PEP spanned 200 email messages.Automation of the process of rationale extraction is described shortly.In the next sub-section, we describe the patterns by which the rationale behind PEP decisions are stated.

| Heuristics-based approach to extract rationales
To address the complexities of identifying rationale sentences, we employed a heuristics-based approach.9][40][41] For example, Kurtanovi c and Maalej 7 identify users' rationale for writing app reviews, while Williams et al 38 employed heuristics to extract arguments from software engineering practitioners' blogs.
The focus of these works, however, has not been on identifying and extracting the rationale behind how these decisions are made in the OSS development context using a data-driven approach.In our approach to develop the Rationale Miner tool, we used 13 heuristics.These heuristics indicated the rationale on how PEP decisions were made and were grouped into five categories.These heuristics were derived largely from the literature and supplemented by our manual analysis of PEP emails and rationale extraction.We created a scoring function based on these heuristics for both the Sentence-Based Scheme (SBS) and Message-Based Scheme (MBS).This function will be described in Section 4.3.
Category 1: Term patterns.In the first category, we consider whether a sentence consists of certain term patterns that indicate the presence of a rationale behind a PEP decision.For example: 42 "Accordingly, the PEP [Proposal Identifier] was rejected [State] due to the lack of an overwhelming majority [Rationale] for change."Here, the term pattern is made up of three term types (in square brackets) and the rationale for the decision (lack of majority) can be clearly inferred.Table A1 outlines the six different term types we have identified.The full list of rationale term types, their patterns, and their scores are included on the GitHub repository.¶ The term types Rationale Identifier, 12 Entity, 38 and Decision Term 7 were inspired by the literature on rationale identification.4][5] Proposal Identifier and Reason are unique to PEPs and were identified from our manual analysis.
Table A2 shows the different term type patterns we looked for in each sentence.Each pattern has a score between 0 and 0.9 that indicates the relative likelihood of a matching sentence containing rationale.These scores were chosen, based on our manual analysis which considered the relative importance of these patterns.For example, the first pattern captures sentences that include the Proposal Identifier (PI), State (S), and Rationale (R) terms, and sentences matching this pattern are assigned a score of 0.9.If a sentence is not assigned any score (i.e., if the sentence matches no patterns), then that sentence is not considered for further analysis.
We found that these term patterns can exist in: • the sentence currently being considered, represented by the heuristic Term Pattern in Current Sentence (TPCS); • the current sentence and the remainder of the paragraph, represented by Term Pattern in Paragraph containing the Sentence (TPPS); and • the email message subject line containing the current sentence, represented by the heuristic Term Pattern in Message Subject (TPMS).
Category 2: Proximity-based heuristics.We have also identified three proximity-based heuristics based on the dates and location of sentences in paragraphs.

Days From State Commit (DFSC):
We found that messages containing rationale tended to appear closer to the dates of making decisions.
These state changes are reflected in the PEP document committed to the version control system when a decision is made (e.g., a PEP going from draft to accepted).Thus, we assigned a higher score to PEP messages that are closer (number of days) to the date of the PEP's acceptance or rejection (as recorded in the commit messages).
4][45][46][47] In our manual study of emails, we discovered that sentences in the first paragraph conveyed the main idea of the message, whereas sentences in the last paragraph reinforced what had already been stated.As a result, in our heuristics system, if a sentence appears in the first or final paragraph of a message, it receives a score of 0.9 for this heuristic: otherwise, it receives a score of 0.
Negation Terms (NT): A PEP generally goes through multiple rounds of deliberation.Negation words such as "may not" and "should not" are often utilized at the earlier phases of the PEP rather than the final PEP decision, and these alter a sentence's meaning.As a result, if a phrase contains such terms, it receives a À0.8 negation penalty, lowering its overall score.
Category 3: Role-based heuristics.In an email message, a number of fields can be used to identify a rationale sentence.We have categorized them into three types, each of which is represented by a corresponding heuristic.
Message Type (MT): We found three types of email messages related to a PEP: the official PEPs, the state change commit messages of the PEPs, and all other PEP messages.The PEPs ¶ are updated after a decision has been made and subsequently placed on the Python website.We found that they occasionally contain the rationale behind PEP decisions.As previously mentioned, the state commit messages include the PEP state changes (e.g., draft to accepted).It is more likely that a sentence contains the rationale if it belongs to a PEP (document) or state commit message.Therefore, these sentences were assigned a score of 0.9; otherwise, they are assigned 0.
Author's Role (AR): Python members write many different types of email messages about the PEP during the different stages of its evolution.However, we found that only members from certain roles write messages containing decisions and their rationale.Therefore, for this heuristic, sentences from messages written by the BDFL or BDFL Delegate (can be separate for each PEP) were assigned a score of 0.9.A lower score of 0.6 was assigned to sentences from messages written by the PEP Author.It is only during a PEP's earlier stages of evolution that the PEP Editors are consulted.Thus, sentences from messages written by them are assigned a lower score of 0.5.Sentences in messages from a Core Developer were assigned a score of 0, since they too do not normally deal with decisions.While core developers can sometimes reflect on the decision and their rationale after they are made, there are too many of them involved in discussions and considering all their statements would negatively impact rankings of rationale sentences (i.e., generate more false positives).Sentences from messages written by anyone else get a score of 0. The inspiration for this heuristic was prior works 41,43,47,48 that took into account the roles played by individuals (e.g., the author of a message).

Specific Author Messages Containing Explicit Rationale (SAMCER):
This heuristic identifies messages that contain specific hints about rationale.
In the Python community, we found a pattern in which rationale sentences exist, and they include these three elements: the person (role) who wrote the email messages, the combination of terms that were used by the person, and finally, the proximity of the email posted date to the date of the conclusive state change (i.e.date of acceptance or rejection).1. PEP author's message containing rationale: The PEP author formally requests a review and pronouncement on the PEP via an email if he feels the PEP is nearing completion.Sometimes in this message the community's consensus on the PEP is mentioned.For instance: "As you said, consensus is reached, so just Guido's BDFL stamp of approval is all I can think of." 49Here, the community has reached consensus on both: PEP 386 and PEP 345, and a pronouncement from the BDFL was remaining.The aforementioned community consensus was the rationale for the PEP's eventual approval.
2. BDFL or BDFL Delegate PEP review: The BDFL or delegate may also mention their rationale for future acceptance of a PEP in messages that precede messages stating formal PEP acceptance.For example, "Assuming no material objection [sic] appear to the new syntax and semantics, I can approve the PEP later this week." 50This was a reference to PEP 448, which was accepted a few days later, the rationale being that the community had no further objections (i.e., lazy consensus).

BDFL or BDFL Delegate PEP acceptance or rejection:
The PEP is formally accepted or rejected by the BDFL or delegate, but the rationale for approval is not clearly stated.For example, "Given the feedback so far, I am happy to pronounce PEP 393 as accepted." 51This sentence implies that the PEP was accepted because of community consensus.
4. Community members reflecting on decisions: Core developers sometimes mention how the community received a PEP, either while summarizing the PEP or discussing the decision later on.For example: "Raymond suggested updating PEP 284 ('Integer for-loops'), but several people spoke up against it, including Guido, so the PEP was rejected". 52sentence from these four message types was assigned a score of 0.9, otherwise 0. The role of this heuristic is to boost the scores of sentences in emails where the 'role' of the email author is one of four roles.Here, apart from matching one of the four roles, two other attributes must match which are DFSC (proximity to committed state) and term pattern.Thus, the two heuristics play a supporting role for the author role.In other words, if the author's role is missing, then the sentence will be assigned a score of 0 even if the other two attributes are met.
Category 4: Responses to certain specific messages.This category comprises two heuristics related to specific messages.
Response Messages to the Same State Change Message (RMSSCM): It is also likely that replies to messages with PEP state change information will contain the rationale for the subsequent decision.For example, to a message with the state change "I have accepted the PEP," the reply message contained the phrase "there were no outstanding issues," which encapsulates the rationale that the community had no remaining objections on the PEP.We were mindful that there were not many instances of these.In our previous work, 5 we have used the Subject, Verb, Object (S,V,O) triples within a sentence to identify the decision-making states (e.g., accepted and rejected) and sub-states that occur between the main states (Figure 1).Therefore, using their approach, we assigned a higher score to messages with the same subject line as the message containing PEP state changes.This heuristic was inspired by a similar heuristic used by AbdelRahman. 53tionale Found Using Triple Extraction (RFUTE): From the sentences captured as part of ground truth, we found that some sentences contained noun-action-noun triples that can be used to establish them as containing rationale.Therefore, this heuristic represents rationale identified using clauses that appear in a sentence.For instance, if S, V, O triples are extracted from the ground truth rationale sentence "I think we've reached a consensus on those two PEPs," 49 one of the triples is ["we," "reached," and "consensus"].This clearly implies a consensus.Thus, we again adopted the triple extraction approach that we utilized in our previous work 5 to identify and extract the Subject, Verb, Object (S, V, O) triples from sentences and matched these triples against the triples captured from ground truth rationale sentences.Sentences matching this criteria were assigned a score of 0.9, otherwise 0. Since triple extraction is computationally expensive, # we only extracted triples from sentences (for matching against triples from ground truth sentences) which contained certain decision-specific terms, such as "consensus."Category 5: Special Identifiers.We included two additional heuristics for special identifiers that exhibited the rationale behind PEP decisions.
These were based on our analysis of PEP email messages and hence are a new contribution to the literature on rationale identification/extraction.

Decision Terms In Message (DTIM):
Sometimes there is a heading just before the paragraph that states the rationale for PEP acceptance or rejection.Message authors include these, and examples include "BDFL Pronouncement," "PEP Acceptance," and "PEP Rejection."These headers indicate that decisions are stated in the following paragraph-within which we found rationale.Therefore, any sentence from messages containing such terms were assigned a score of 0.9 for this heuristic, otherwise 0.
Decision Terms as Header of current Paragraph (DTHP): Based on the above-mentioned decision terms, a more focussed heuristic emphasizes the sentences in the immediately following paragraph, rather than all sentences in the message (which may be present in subsequent paragraphs).
As it is more likely that the rationale would be in the immediately following paragraph (i.e., the paragraph that follows the header), we assigned a score of 0.9 to all sentences in that paragraph.
The Final Scoring function.All these 13 heuristics were summed up to compute the Final Score ðFSÞ for each sentence: Table 3 summarizes the heuristics used for rationale extraction in this work.

| Ranking candidate rationale sentences
The use of heuristic scoring enabled us to rank both candidate rationale sentences and candidate rationale messages.We refer to the former ranking scheme as the Sentence-Based Scheme (SBS) and the latter as the Message-Based Scheme (MBS).The former computes candidate sentences, while the latter suggests only distinct candidate messages which contain these sentences.Thus, the first approach is fine-grained (sentence-level) and the second one is coarse-grained (message-level).
In the SBS, we produce a descending rank-ordered list of candidate rationale sentences for a PEP based on the sentence scores of all sentences associated with a PEP.Sentences that appear at the top of the list for a given PEP are more likely to contain rationale.In SBS, the rationale can be present in multiple candidate rationale sentences belonging to the same email message-appearing in different rank-ordered positions.On the contrary, a user may prefer to see results from the perspective of the entire message containing these rationale sentences.The MBS groups the SBS results at the message-level.This grouping results in two advantages for a user.First, as entire messages are shown, it provides a richer context around the rationale in relation to the candidate rationale sentence.Second, instead of multiple rationale sentences from the same message being shown in different rank-ordered positions, users can view related rationale sentences in the context of the same message.The user's effort required to scroll through the list of rationale sentences will be reduced as a result.
Figure 4 shows how these two schemes are designed.As shown in the figure, in the SBS ranking, there are two sentences from the same message (message ID 178244) which have ranks 1 and 3.In the MBS ranking scheme which ranks messages, the top-ranked result (i.e., the message with message ID 178244) containing these two sentences is presented as a single result record, which is the whole message.This grouping approach is applied to all sentences which are from the same message in the SBS.We believe that ranking candidate rationale sentences in their respective email messages, will offer an overall solution that is more information-rich than an approach which presents the user with only sentences.Section 5.3 includes an evaluation of both ranking schemes.

| Heuristic optimization based on ground truth
Two questions may arise regarding the heuristics that we identified in Section 4.2.First, are all 13 heuristics effective in identifying candidate rationale sentences?Second, are the scores assigned to each heuristic optimal?We therefore undertook a two-pronged optimization by (i) identifying the heuristics (variables) that strongly influenced the identification of rationale, and (ii) using a parameter sweeping approach to find the best values for these variables.
Using the initial values for the 13 heuristics as described in Section 4.2, we computed scores for every sentence across all emails.We then computed the total number of ground truth rationale sentences that appeared in the top 5 results using the SBS scheme.Having computed this baseline, we systematically removed one variable at a time from the FS (Equation ( 1)) and compared how that impacted the overall results.The goal of this approach was to identify variables that strongly influenced our results.indicates that including that heuristic increases the number of rationale sentences captured, so including that heuristic will have a positive influence on the results (see the Influence column).Similarly, if the values are negative, the heuristic will have a negative influence (i.e., adding the heuristic results in the identification of fewer sentences).For example, including SAMCER variable results in 4 fewer sentences (indicated by À4) being captured for accepted PEPs using the SBS scheme.A mixture of positive and negative values in a row indicates a mixed influence (sometimes positive and sometimes negative).These mixed influential heuristics either had strong influence-high magnitude of influence either way (positive and negative), or weak influence-low magnitude of influence either way.Some heuristics had no influence (see last three rows) at the top 5 ranking.Nevertheless, these heuristics have been included in our work, as they helped the identification of sentences at rank 10 and above.

| Identifying influential heuristics
The most positive influential heuristics were RFUTE and TPCS.When these variables were removed, the results suffered (e.g., up to 12 results for RFUTE did not show in the top five).The DFSC and DTIM heuristics had a weak positive influence on the results.SAMCER, MT, and TPMS had a strong but mixed influence across both types of PEPs and both ranking schemes, and thus were of particular interest during optimization.DTHP, NTP, TPPS, SLIM, AR, and RMSSCM had a weak mixed influence or had practically no effect on the top five ranks.Omitting these heuristics increased or decreased the top 5 rankings by at most one or two sentences.We did not remove these heuristics from the FS, since doing so reduced the number of sentences captured in the top 10 rankings by a small amount.

| Parameter sweeping
Best results can be obtained if we assign optimal scores for each variable.Using parameter sweeping, a technique commonly used in simulation systems, 55 we optimized the parameter values for the seven of the most influential variables out of 13.
T The seven variables considered were the ones that had a positive or mixed (strong) effect (i.e., first seven heuristics from Table 4).To determine the ideal values for each variable they were altered in increments and decrements of 0.3.For each heuristic, we considered five values: À0.3, 0, +0.3, +0.6, and +0.9.To find the best configuration of values for the chosen heuristics, we changed the value of one heuristic at a time, while keeping the values of the others unchanged.Next, this changed value was retained and another heuristic's value was altered in the same way in the same way.This was done for all seven heuristics resulting in all five different alterations being tested with every other heuristic's five alterations.The results were a configuration which was then used against the ground truth baseline.Using this approach, we discovered that for both SBS and MBS, changing the values for most variables did not make any significant difference.Only the initial DFSC and TPMS heuristic values needed to be raised to 0.6 to obtain the best results.Section 5 results are based on the optimal values for all these heuristic variables.

| Evaluating the Heuristics-based approach
Three approaches were used to compare the results achieved using the heuristics-based approach to the ground truth.First, we compared the results from sentence and message ranking mechanisms against the ground truth.In particular, we checked whether a ground truth sentence had been correctly identified, and if so, its rank order (i.e., where in the ranked-ordered list the sentence appears).Then, we collated the results using rank orders.Across all PEPs, we counted the number of sentences (and messages) that matched the ground-truth for various ranks (e.g.rank 1 to 100).For example, for rank 1, we counted the number of PEPs for which the ground truth sentences were identified as the top-ranked sentence in SBS and MBS schemes, respectively.For rank 100, we counted the number of PEPs for which the ground truth sentences appeared in the top-100 rationale containing sentences for a given PEP.
Second, we compared the ranking results for sentences and messages against their "ideal" ranking using normalized discounted cumulative gain (NDCG), a metric frequently employed in the information retrieval domain. 56It has been used to test an email ranking system based on search query results, for example. 53Most PEPs include one or two rationale sentences in the context of our work.Our approach will penalize the ranking quality for these two positions if these sentences are not ranked in the top two ranks.The following formula is used to calculate NDCG: 56 where IDCG k is the ideal DCG value (described shortly) of the ranking at position k, that is, the DCG value of the ground truth.NDCG is normalized to the interval [0, 1], with 1 indicating a perfect estimation of the ground truth.The DCG is the overall gain accumulated up to that rank. 56It takes positional importance into account and discounts the graded relevance value of the retrieved documents by the position at which they appear. 57call that Rationale Miner has two schemes, SBS and MBS, and ranks accepted and rejected PEPs separately.For each rank position k, we therefore need to calculate the average NDCG of the rationale sentences or messages that appear at rank k, across all accepted and rejected PEPs.
The average NDCG is calculated using the following formula: where R is the set of all ranked rationale sentences (under SBS) or messages (under MBS) for a PEP; and P is the number of PEPs of a particular state (accepted or rejected) for which we have found rationale sentences.Thus, we compute four sets of NDCG rsults: SBS-accepted, SBS-rejected, MBS-accepted, and MBS-rejected.The results are presented in Section 5.3.
Third, we compared the results obtained from Rationale Miner against the baselines reported in the literature.This comparison is useful to ascertain whether our approach does better than these baselines and if so, to what extent.As the foundation of our baseline experiments, we chose the patterns used in rationale extraction in gray literature 54 and its objectives. 7These works have used term categories similar to those used in our work: 1. Reasoning Markers 54 and Reason Clauses 7 that are similar to the Reason Identifiers in our study.
2. Named Entities are people and correspond to the Entities associated with decisions in our study.
3. Event (dates) have been represented in our work using proximity (the DFSC heuristic).
4. Verbs (also referred to as Decision Terms in Kurtanovic and Maalej 7 ) were represented using PEP state changes, that is, accepted or rejected.
Using these categories, we identified two baselines: where the term pattern in TPCS was State (S) + Decision Terms (DT) + Reason Identifiers (RI), and where the two term patterns in TPCS were: Decision Terms (DT) + Entities (E) and State (S) + Decision Terms (DT) + Reason Identifiers (RI).
The results of these baseline comparisons against our approach are presented in Section 5.3.

| RESULTS
This section presents the results for research questions 2-4.

| RQ 2: Rationales for PEP decisions
This subsection outlines the rationale types and the prevalence of rationale types across all PEPs.

| Rationale types
RQ 2a asked, What are the different types of rationale for decisions on PEPs?Our analysis has uncovered 11 rationale types used in Python decision-making, which are detailed below.Raw data for these sentences can be viewed online ¶ .Table 5 gives examples of rationale sentences for each of the 11 rationale types.
We define each of the rationale behind PEP decisions we found as follows: • Consensus is when the community unanimously agrees on a PEP's outcome (either to accept or reject a PEP).Conversely, no consensus is when the community cannot unanimously agree.
• Lazy consensus is when a lack of objections against the PEP is used as an indicator that all members agree with the decision.
• BDFL decree is when the BDFL intervenes during community discussions and makes a decision regardless of members views', while the community is still deciding.
• Rough consensus considers some form of community majority when full consensus is not attainable.
• Little support is when PEPs are rejected because there is little or no "interest" or "support" from the community.
• Majority and no majority are employed when consensus is not possible.Majority is similar to rough consensus; the difference is that majority is mainly used during polling or formal voting.

Rationale type Rationale sentence PEP
Consensus "The user community unanimously rejected this so I won't pursue this idea any further" 58 259 No Consensus "Although a number of people favored the proposal there were also some objections " 59 3128 Lazy Consensus "If anyone has objections to Michael Hudson's PEP 264: raise them now." 6064 Rough Consensus "Several people agreed, and no one disagreed, so the PEP is now rejcted.[sic]" 52 265 Little Support "It has failed to generate sufficient community support in the six years since its proposal." 61

268
A Majority "Comments from the Community: The response has been mostly favorable". 6279 08 Inept PEPs "This PEP is withdrawn by the author." 63

296
BDFL Decree "After a long discussion, I've decided to add a shortcut conditional expression to Python 2.5." 64

BDFL Pronouncement after No Consensus
"There's no clear preference either way here so I'll break the tie by pronouncing false and true." 65

285
26 Note: Most of these statements were made by the BDFL directly or are references to the BDFL's views.
• BDFL pronouncement after no community consensus is when consensus is not reached in the community due to disagreement and consequently, the BDFL makes the decision.
• BDFL pronouncement overriding majority highlights an instance where the community had reached a majority community view but it was overruled by the BDFL.
• Inept PEPs are PEPs that have inherent problems that naturally lead them to be unsuccessful.Discussion on these PEPs is not contentious as it is obvious these PEPs have some shortcomings, and decisions are made without too much deliberation.This category includes eight specific rationales: problems with PEP, superseded, obsolete, lack of champion (i.e., proposal author abandoned discussions), requires significant changes, no implementers, PEP benefits are marginal, and withdrawn.
Note that we did not include a separate rationale called BDFL pronouncement because it is quite common that based on collective community view the BDFL just rubber stamps his decisions.In these instances, we classify it as a consensus based decision rather than a separate criteria named BDFL pronouncement since real rationale is consensus.

| Prevalence of rationale types across all PEPs and the three specific PEP types
RQ 2b asked What is the prevalence of rationale types across all PEPs and the three specific PEP types?We begin by answering the first part of this question.
Prevalence of rationale types across all PEPs: Table A3 outlines the results of our manual analysis of rationale for PEP decisions as expressed in email messages.It highlights the decision-making schemes (column 1), the rationale (column 2) and the tally of PEPs that have employed each rationale (column 5).Each row in the table indicates that some number of PEPs of a particular type were either accepted or rejected (Decision) due to the stated Rationale.For example, the first row indicates that 41 Standards Track PEPs were accepted due to consensus.The full data underlying Table A3 can be found in the GitHub repository.¶ The worksheet "Rationale By PEP Numbers" lists PEPs in each decision-making scheme.Figure 5, a condensed version of Table A3, shows the prevalence of the 11 different decision rationale types for accepted and rejected PEPs.The most common rationale behind decisions were consensus (57), lazy consensus (44), inept PEPs (42), BDFL decree (18), and BDFL pronouncement after no consensus (15).Inept PEPs represents an aggregate of eight rationale in which there is a problem with the PEP, as previously described.
There are certain rationales for which it is clear what the eventual decision will be (e.g., accepted or rejected, but not both).For instance, little support, no majority, no consensus, and lack of champion almost always result in a PEP being rejected.However, we found that for some other rationales, the final decision was not always guaranteed.For example, even after lazy consensus or rough consensus, a PEP could still be rejected (possibly due to BDFL decree).We also observed the BDFL overrode the majority's opinion on one instance.This was for PEP 326 where it was rejected by the BDFL, despite it being preferred by the majority of the community, and he emphasized that "Python is not a democracy". 66The substantial F I G U R E 5 Rationale for accepted and rejected PEPs for all PEP types number of PEPs where the BDFL exercised his own preference (BDFL decree, BDFL pronouncement after no consensus, and BDFL pronouncement after a majority) implies that in the Python community, the BDFL is free to exercise his choice when necessary.However, most of the decisions taken by the BDFL are based entirely on a collective community view, such as consensus, lazy consensus, rough consensus, a majority, no majority, or little support.The Python project leader allows the community to come to a collective view on proposal outcomes, using any of these rationales, and then he (or sometimes his delegates) appear to mostly concur with the collective view.
Prevalence of Rationale Types Across Three PEP Types: The bar chart in Figure 6 depicts the rationale types used across each of the three different PEP types (Standards Track, Informational, and Process).The numbers for Process and Informational PEPs are smaller because, in general, the number of PEPs belonging to these two PEP types is small compared to the number of Standards Track PEPs.The full data underlying Figure 6 can be found in the worksheet titled "Rationale by PEP Type" in a spreadsheet within the Github repository ¶ .
The most frequently used rationales are consensus, closely followed by lazy consensus, Inept PEPs, and BDFL pronouncement after no consensus.
When comparing the distribution of rationale across the three PEP types, it is evident that there is a wider variety of rationale used for Standards Track PEPs than for Informational or Process PEPs.This is most likely because Standards Track PEPs are more contentious than the other two PEP types, and therefore the community as a whole requires more ways to reach a decision.The larger values for each rationale in Standards Track PEPs is primarily because they are the largest proportion of PEPs in the Python community, and so are what mainly gets discussed in the community.In general, decisions made via consensus and lazy consensus were the most common across all three PEP types.

| RQ 3: An approach for rationale extraction
RQ 3 asked, How can rationale for PEP decisions be extracted automatically from Python email archives?To answer this question, we developed the heuristics-based approach described in Section 4.2, and implemented it as the Rationale Miner tool.The SBS allows quick access directly to the candidate rationale sentences for each PEP, but may require a user to scroll through many results to find the actual rationale sentence.The MBS, on the other hand, may be more effective, as it displays the messages these sentences in SBS belong to, allowing a user to find the rationale sentence more quickly.The individual sentences containing rationale from SBS are underlined in the MBS (in blue) within the message panel (panel 3 in Figure 7).This prevents a user having to go through the entire message to locate the sentence and therefore, it should make the MBS even faster at finding the rationale sentence than the SBS.Note, sometimes the messages can be quite long, requiring the users to scan through to find the highlighted sentence.
The Rationale miner is not always able to capture the actual rationale containing sentence and message at the topmost ranks.Nevertheless, we found that the captured sentences and messages at these ranks are still useful in both schemes, as going through these enables the user to quickly grasp how discussion unfolded during the DM for a PEP.Moreover, the sub-states for each PEP-polling, voting, consensus-extracted in our previous work, 5 are presented to the user, as shown in panel 2. Upon selecting these sub-states, the respective messages containing these sub-states are automatically displayed in the message panel in the middle.
In any approach to explore the rationale (sentences in the SBS messages in the MBS), or the sub-states, a user can just keep pushing the up and down arrow on the keyboard and the respective messages are shown (elaborated in the video).This makes it a lot faster to quickly go through and locate the actual reasons, grasp how the discussion regarding the PEP decision unfolded, or view the DM sub-states in the PEP.

| RQ 4: Effectiveness of the approach
RQ 4 asked, What is the effectiveness of our approach for extracting PEP decision rationales?In this section, we discuss the results from three approaches employed to demonstrate the effectiveness of Rationale Miner.

| Evaluation approach 1: Comparing rank-ordered results
For evaluation, we first compared how the SBS and MBS ranked the sentences from the ground truth dataset for PEPs that were accepted or rejected.The number and percentage of ranked sentences at each position k are shown in Table 6.Within the SBS rankings, the topmost results (k ¼ 1) captures 15.7% (accepted) and 29.8% (rejected) of the ground truth sentences.In this regard, the MBS achieved even better: 20.2% and 47.1%, respectively.Focusing on the top 5 results, the SBS ranked 39.7% (accepted) and 48% (rejected) of ground truth sentences within the top 5 results.The MBS again improves these results with values of 61.5% and 74.4% respectively, highlighted in bold in Table 6.As expected, in both schemes the ground truth sentences captured within the top k increase with k.However, a large portion of ground truth sentences are ranked k https://www.youtube.com/watch?v%3DnrB9Jk1OFXo (for simplicity, in the video, the Rationale Miner is referred to as "Reasons Miner").

F I G U R E 7
The Rationale Miner GUI (for a larger version, view https://figshare.com/s/621bfd5c89826c9b3ba0)within the top 5, which is encouraging (highlighted in Table 6).In addition, for the MBS, the top 100 ranks capture most of the rationale sentences for both states.
Comparing the results of SBS and MBS, we noted two aspects.First, more rationale sentences are captured by MBS than SBS at each rank k.
The reason is that a ranked message in MBS can have more than one rationale sentence.For instance, if the SBS has ranked two sentences from the same message at ranks 3 and 8, the MBS will rank the message containing these sentences at rank 3. MBS thus has an inherent advantage over SBS.Second, there are fewer unmatched sentences in the MBS than the SBS.Unmatched sentences are those rationale sentences from the ground truth dataset that were not shown in the results of the heuristics-based system (see "No match" in 6).This is because these sentences did not have patterns matching any of the heuristics.Importantly, the MBS has fewer unmatched sentences because unmatched sentences in SBS may still appear in the MBS approach in the same message as sentences that are already matched.
T A B L E 6 Heuristics-based approach versus ground truth To formally evaluate our heuristics-based approach we used the NDCG metric, as discussed in Section 4.5.Under both SBS and MBS, we calculated the average NDCG for each rank k (based on Equation ( 3)) for accepted and rejected PEPs.We used different thresholds for this evaluation (k = 5, 10, 15, 30 and 50). Figure 8 summarizes the results.
The chart shows that the MBS outperforms the SBS for both accepted and rejected PEPs at virtually every This corresponds to the findings of the rank-ordering comparison presented in Section 5.3.1.The MBS scores for rejected, for example, are greater than the corresponding SBS scores for all rankings k for rejected.MBS for rejected, in particular, outperforms SBS by 50.2% at k ¼ 5, 46.8% at k ¼ 10, 42.11% at k ¼ 15, and 36.46% at k ¼ 50.

| Evaluation approach 3: Baseline comparison
Having manually identified the 300 ground truth rationale-containing sentences, ¶ we investigated which term types were included in these sentences.As shown in Table 7, we found that 83% of the ground truth rationale sentences contained Reason (R) terms.Also, there were several cooccurring term patterns in rationale sentences (see Table 8, with Proposal Identifier (PI) and Reason (R) occurring in 63% of these sentences).The first line of result in Table 7 reveals that 17% of sentences do not contain rationale terms.For those, we have to deduce the rationale behind PEP decisions implicitly, highlighting the necessity of a heuristics-based approach.Also, the presence of a rationale term alone cannot guarantee that the sentence contains rationale corresponding to a decision made.The results in Tables 7 and 8 together show the prevalence of the patterns we have identified, thus highlighting the importance of term patterns in rationale identification.The rationale (Rationale term type) is most prevalent in rationale sentences, as shown in Table 7.In terms of term patterns, the pattern consisting of Proposal Identifier together with Rationale is most prevalent, as shown in Table 8.
We next ran two experiments using the two baseline configurations defined in Section 4.5 (B1 and B2 in Table 8).It can be inferred from Table 8 that Baseline 1 occurs in only 1% of the rationale sentences.Baseline 2 is better (13%), but both baseline results are worse than the worst case results for SBS and MBS, i.e., top 1 results that range from 15.7% to 47.1%, as shown in Table 6.The main limitation in both baselines is that they did not include the rationale terms (R) that are specific to PEPs.Not having this as part of any pattern (i.e., term combination) significantly impacts the baseline results.We indeed anticipated that our approach would rank more rationale sentences than Baseline 2 as our approach integrated a range of 9 additional patterns (Table A2) that were sourced from the literature and our manual data Embedding these patterns enabled us to capture more rationale sentences from the Python developer mailing list discussions.

| DISCUSSION
We reflect on our research questions and discuss the key contributions in this section.

| RQ 1: OSS decision rationale in the literature
We discussed the RQ 1 results in Section 3. We have outlined the various schemes and rationales behind OSS design decisions found in the literature (see Tables 1 and 2) and have highlighted that there is a lack of studies that identify and quantify the rationales for OSS decisions using a data-driven approach.

| RQ 2: Rationales behind PEP decisions
Our work identified a total of 11 rationale types.Mr owka 18 previously identified five rationales ("approval types") based on a qualitative analysis of the Apache project.We present six more here.Also, Mr owka's work qualitatively described the process model for decision-making (e.g., "consensus" exists), whereas our work quantifies the rationales for decisions (e.g., 57/248 were "consensus" decisions).Furthermore, Mr owka also did not develop a practical tool.
It is generally perceived that the Python OSS community governance adopts a strict hierarchical and authoritarian decision-making structure.
However, our results show that the Python governance framework is not completely authoritarian and the community's view is largely adhered to.For most PEPs, the community view is acquired and represented via different mechanisms, and in this way the project leader transfers the burden of coming to a decision to the community.The most common form is through community consensus (including lazy consensus and rough consensus).This helps prevent situations where the decisions of the dictator constantly override the community collective view.
There were occasions when the BDFL overrode the established community view, but this was uncommon, occurring in only two PEPs, representing less than one percent of all conclusive state changes we considered.The first one was PEP 326 (A Case for Top and Bottom Values), where he overruled the community majority.In his rejection message for the PEP, the BDFL emphasized that "Python is not a democracy". 66Similarly, the BDFL overruled a community consensus on PEP 318 -Decorators for Functions and Methods.** He rejected the syntax chosen by the community, saying that he did not like it.In a similar vein, an official vote on PEP 308 (Conditional Expressions), could not find a community majority, and it was subsequently rejected by the community.However, two years later, PEP 308 was accepted by the BDFL.It is harder to say that the BDFL overruled a majority view here.Rather, it can be said that he resolved a long-standing issue by making a concrete decision.Overall, the BDFL uses this mixture of two approaches (consensus and authoritarian), which seems to work well, for the success of decision-making and the Python project.
To evaluate the accuracy of our extracted rationales (11 of them shown in Table 5), we presented our findings to a prominent Python core developer (name withheld to preserve anonymity).This person has been a core developer of the Python community since 2004, and has been identified as one of the top Python members in terms of PEP authorship or co-authorship, and is a proxy for the BDFL (i.e., BDFL delegate). 67The member is also a former Python steering council member.He replied, I think your list of reasons looks good, suggesting that the extracted rationales which are used in Python decision-making are good (i.e., reasonable).

| RQ 3: An approach for rationale extraction
Our study has produced a computational tool that can extract the rationale behind conclusive state changes (accepted or rejected) of technical proposals in OSS projects.Using the Rationale Miner, we have extracted and ranked the rationale behind decisions on proposals to enhance the Python language (PEPs) based on email discussions.In doing so, we identified the rationale behind these conclusive states for PEPs.To identify the rationales behind decisions on design proposal in OSS projects stored within email repositories, we have addressed two major challenges: the **The state of PEP was made final rather than accepted or rejected, a practice used by the Python project to save selected PEPs for future reference.
variety of ways rationale can be stated in a sentence (a challenge in causal extraction in general), and the fact that the state change and the rationale behind it are often reported in completely different email messages.In designing a solution to address these complexities, we proposed a heuristics-based Rationale Miner.The Rationale Miner includes two schemes for finding the rationale behind the conclusive states of a proposal: SBS and MBS, both of which offer distinct capabilities.This architecture enables Rationale Miner users to find the rationale behind decisions on accepted and rejected proposals and also offers tools for a deeper analysis of how the decisions on the proposal unfolded during community discussions.
The SBS presents a user with candidate rationale sentences, providing a more direct and quick way of seeing just the main information on how the decision was made.The MBS, in contrast, displays entire messages, and in doing so, shows the same rationale sentences grouped at the message level, allowing a user to find the rationale sentence more quickly than the SBS.Since entire messages are shown, this scheme also allows a user to understand the context of the decisions.The rationale sentences in these messages are underlined in the Rationale Miner GUI (Figure 7) and this practice of highlighting key information has been reported to reduce the cognitive load of users. 68This also reduces the amount of time needed to go through the candidate rationale sentences in these messages to find the actual rationale sentence.Furthermore, the Rationale Miner GUI supplements the above-mentioned schemes (SBS and MBS) with additional decision-making information regarding each PEP, i.e., the main states and sub-states we extracted (shown in panel 2 in Figure 7).Selecting these displays the corresponding messages, and this facility helps a user to better grasp the context behind decisions.
Having the rationale miner tool will enable the new users joining the project to quickly grasp how decisions are made in the project.The extracted rationales can also be a point of reference to how to make decisions (i.e., leaders can employ different types of rationales in decisionmaking based on the need).It helps identifying the rationale in the sentences faster.Thus, instead of reading through 100s of messages, they find the rationale inside the top 15 results about two thirds of the time (as highlighted in Section 5.3.2).
This usefulness point has been supported by the Python evaluator in the context of the Python project.We asked the same Python core developer (see the previous section) whether a tool that extracts how PEP decisions are made would be useful.In his response, the core developer highlighted the importance of such a tool: Yes, I think the idea is valuable, as it should be possible for at least Steering Council members to compare their subjective impressions of outcomes and rationale with the output of the tool, and potentially use that comparison to help assess how well the PEP process is actually working.
There are other OSS projects that follow the Python model of formal proposals that are openly discussed and debated via email mailing lists, for example, the Java Community Process, which proposes enhancements to the Java language via Java Enhancement Proposals (JEPs) † .The Rationale Miner tool can be extended to mine rationales from such projects by reconfiguring it to cater for the conclusive decision-making states specific to the target project, which may be different from those used in Python.For instance, the conclusive main states for JEPs are Candidate (same as accepted) and Rejected.‡ ‡ Our approach can also be used to mine rationale in other projects that have PEP-like processes such as the Bitcoin and Ethereum projects, which use Bitcoin Improvement Proposals (BIPs) § § and Ethereum Improvement Proposals (EIPs), ¶ ¶ respectively.
In essence, we have a tool that can be applied to other domains by changing certain parameters (e.g., states).In a typical organization, decisions are conveyed and recorded in email messages and can be extracted and ranked for decisions on matters urgent to the organization.Heuristics can be modified (added or removed) and the scores adjusted to extract the rationales behind decisions.All in all, we believe that the Rationale Miner provides a good foundation for future work due to its ability to be customized to mine rationale from other domains.

| RQ 4: Effectiveness of the approach
Comparing the results of the two schemes described in this work, the MBS outperforms the SBS.It correctly ranks 20% of ground truth rationale sentences belonging to accepted PEPs and 47% of sentences belonging to rejected PEPs at the top-ranked results (see Table 6).Focusing on the top 10 ranked results, MBS captures 74% (accepted) and 86% (rejected) of rationale sentences, and the top 15 increases this to 82% (accepted) and 91% (rejected).Not all rationale sentences are captured at the first rank.There are two main reasons behind this.First, there may be little or no evidence of a rationale in certain sentences.If we consider the sentence "Augmented assignment is scheduled to go in soon (well, before 2.0b1 at least) and if you don't spot the loony now, we'll have to live with it forever:)." 69This sentence implies that lazy consensus was used to make a decision on the PEP.To a human, this might be clear (especially if they are familiar with Monty Python tropes), but from the viewpoint of heuristics, there are no obvious patterns that can be identified here.
The second and most influential reason is that some rationale containing messages appear a long time before or after the PEP's decisive state change (accepted or rejected) is committed.For instance, we found that sometimes a community member conveys the rationale in a message, but it is only much later that the formal acceptance or rejection takes place (sometimes six months or in one extreme case, up to two years later).Also, † https://openjdk.java.net/jeps/0‡ ‡ https://openjdk.java.net/jeps/1§ § https://bips.xyz/¶ ¶ https://eips.ethereum.org/sometimes committing the PEP decision (accepted or rejected) may be postponed by the core developer responsible for committing the PEP states, in order to batch commit several outstanding PEPs' state changes all at the same time.Since our work focuses on rationale sentences that are closer to the decision dates (within the DFSC heuristic) and ranks them higher, sentences within messages written much earlier or much later are given a lower score.Consequently, this impacted the performance of the system in some cases.
We also used the NDCG metric to evaluate our MBS results, as outlined in Section 5.3.2.For all ranks beyond the top 5, the average NDCG value was larger than 0.5 (see Figure 8).For example, the average NDCG value for accepted PEPs was 0.6, and 0.73 for rejected PEPs for k ¼ 15 (i.e., top 15 results).
MBS properly ranked rationale sentences 60% of the time for accepted PEPs and 73% of the time for rejected PEPs, according to this metric.If the exact order of the results is ignored, the match is considerably better within the top 15: 82% and 92%, respectively (refer to Table 6).
As a side note, we also developed machine learning models (classifiers) based on several algorithms (e.g., Naive Bayes and Random Forest) to classify whether a sentence contained rationale.The 13 heuristics were used as feature markers and the output label was whether or not a sentence contained rationale (classes were Yes or No).Separate models were trained for both accepted and rejected PEPs.However, compared to the heuristics-based approach, the accuracy of these ML models was poor (less than 50% of sentences were labeled correctly).This may be largely because there were relatively few ground truth rationale sentences (300), meaning that the majority of sentences that included a term pattern (8497) were non-rationale sentences.Thus, our dataset was class imbalanced and the algorithms used were not able to cater for this issue.Also, deep learning techniques may not be suitable because of the small ground truth dataset.Given our goal to support users in rationale identification, we believe that the heuristics-based approach is still the better approach.Browsing through 10-15 messages to find rationale (in the worst case scenario) versus reading over 100 messages on average is a significant reduction in both time and cognitive load for users.

| Demonstrating the practical utility of our approach
We use two examples to demonstrate the usage of our rationale extraction tool, both of which are also shown in the video mentioned in Section 5.2.First, we focus on the decision rationale for PEP 572 (Assignment Expressions)-a very contentious PEP, and soon after its acceptance the BDFL resigned.During our ground truth extraction we identified three rationale sentences for its acceptance.Of these, the most insightful one was: "It's really hard to choose between alternatives, but all things considered I have decided in favor of 'NAME:= EXPR' instead," 70 which is an example of BDFL decree.The SBS approach ranked this sentence 10th, whereas MBS ranked the corresponding message 8th.
The second example focuses on the highly contentious PEP 308 (Conditional Expressions).The PEP's disputed nature is evident by the several decision-making phases it underwent, including a poll, a vote, and a complementary vote-and being rejected based on the voting outcome.However, the BDFL eventually accepted the PEP two years later.The PEP underwent many decision-making phases, and the best sentences conveying its acceptance decision rationale was in the PEP 308 summary message. 71Two of these rationale sentences from this message were ranked 8th and 9th in the SBS, whereas MBS ranked the corresponding message 2nd.
During our evaluation we also found some surprising results.In comparison with our ground truth, better sentences were captured by Rationale Miner tool on four instances.For instance, the following rationale sentence (manually labeled) for PEP 465 (A dedicated infix operator for matrix multiplication) was ranked 26th by SBS: "Because this way is compatible with the existing consensus, and because it gives us a consistent rule that all the built in numeric operators also apply in an elementwise manner to arrays; the reverse convention would lead to more special cases."A more insightful sentence, however, which provided more details of the "consensus" was ranked 7th by SBS: "the result is a strong consensus among both numpy developers and developers of downstream packages that numpy.matrixshould essentially never be used because of the problems caused by having conflicting duck types for arrays."Both sentences were from the same message, 72 which MBS ranked 5th.While these surprises were observed in four instances, one could argue these four cases were due to human error in the ground truth data (i.e., the human coder failed to correctly label a rationale sentence).This highlights that our system does a good job of correctly identifying rationale sentences.This, we feel, demonstrates the effectiveness of our approach.
The generalized version of the Rationale Miner framework is described next.

| The generic rationale miner framework for mining rationales from other OSSD repositories
This section describes the steps in operationalization of the generic Rationale Miner framework for extracting DM processes from OSS projects.
The main stakeholders of the tool are newcomers to the Python OSS community of developers, users of Python wanting to understand opensource decision-making and researchers interested in OSS decision making, not only in Python but also in other similar OSS projects.While the framework has been designed and built for the Python project, extra functionality has been added to the framework to make it generic and extensible for extracting the rationale behind decisions from other similar structured OSSD communities ## that also uses the concept of unique ## The Rationale miner tool is available at https://github.com/sharmapn/RationaleMiner.Please contact the first author for assistance with the set-up.
proposal identifiers to refer to proposals, for example, BIPs (short for Bitcoin Improvement Proposals) in the Bitcoin OSS community.Figure 9 presents the high level architecture of the framework.
Step 1: Reading OSSD repositories: The first part of this step is in defining parameters.The framework includes the following extensibility parameters in relation to the repository being analyzed: (i) mailing list directories which stores the raw email messages which will be read by the program, (ii) proposal identifier, for example, "JEP" for JDK Enhancement Proposal, and (iii) URL of the database where the mailing list messages will be stored and on which all subsequent queries and analysis will be performed using the framework.
The second part of this step is the actual data extraction.Using the above defined parameters, the process begins by reading the email messages from the mailing list archives, assigns a proposal number, and stores the messages in the database tables.The metadata about a proposal (e.g. proposal author) is read separately and is also stored in a database table.
Step 2: Automatic rationale extraction and data exploration: The stored data can be analyzed automatically to retrieve rationale related to the proposals as discussed in the paper.However, not all rationales may be identified due to differences in the application domain.The next step will help to bridge the gap.
Step 3: Semi-automated extraction of remaining rationales: To get a concrete overview of DM in an OSS project, we propose that a human analyzes a small portion (10%) of the dataset to identify the patterns of how rationale behind decisions are expressed in the particular OSS project.In particular, the Term types patterns in Sentence (TTPS) will need to be properly established.For instance, a special identifier may precede rationale.The GUI we developed to explore the messages for each proposal in the repository, as shown in Figure 7, can be used for this task.
Step 4: Remodelling heuristics: Based on these patterns, the heuristics in the framework may need to be remodeled.These include changing the values in each of the 13 heuristics, as well as adding and removing heuristics.For OSS projects similar to Python, only the three term-type heuristics (TPCS, TPMS, and TPRP) will need modifications.
Step 5: Ranking candidate rationale sentences and messages: Once the heuristics are computed, a user can be presented with the ranked candidate rationale sentences and messages for a proposal via the SBS and MBS through the GUI shown in Figure 7.

FIGURE 9 The operationalization of the generic Rationale Miner framework
As future extensions, to extract rationale from other OSS projects, we will first consider projects that are structured similarly to the Python project, such as the Open JDK community.Only minimal changes will need to be made to our framework to study the rationales in this community.Other similar OSS projects that can be considered in the future that have similar structure as the Python community include Bitcoin Improvement Proposals (BIPs) and Ethereum Improvement Proposals (EIPs).* * * The tool has already been used to analyze some BIP discussions. 73 also intend to unearth the rationales used in OSS projects that do not have a BDFL role or a formal decision-making process (as indicated by Scacchi et al. 74 ).To carry out a similar and comparable data driven study, we intend to apply our Rationale Miner tool on developer discussions in these projects to extract the rationale behind OSS design decisions.We have earmarked them as part of our future work where these aspects can be analyzed and compared.
In this work, we have laid down the foundation for a generic extensible framework for unearthing DM processes from OSSD email repositories.Through some extended work to generalise the tool, a robust framework can be made available.

| Implications
This section provides an overview of the knowledge contributions and the implications for research and practice.
Our findings make several knowledge contributions.First, by unearthing rationales using a manual approach, it makes the decision-schemes transparent.This work identifies 11 different decision-schemes used in Python decision-making, adding six more to what was previously known (11 compared to Mr owka's 5).Second, it quantifies the uptake of these decision-schemes (i.e., frequency of use of these schemes) in the Python context.Third, it demonstrates that the decision-schemes used are more nuanced (i.e., there are only subtle differences between some of the decision-schemes-e.g., consensus vs. majority vs. lazy consensus) and these are employed for specific purposes within decision-making thus highlighting the richness and complexity of decision-making schemes in OSS development.This information will be useful for both new and existing Python developers, project leaders, researchers, and Python users as they can develop a better understanding of the nuanced schemes at work in decision-making.Fourth, by presenting a framework to extract rationale automatically by employing 13 heuristics, this work serves as an exemplar to extract rationales from developer communication.The set of heuristics developed can be used as a template for other works to identify rationales in developer communication artifacts in other OSS and proprietary software development projects.Fifth, an approach for extracting decision-schemes from other OSS communities has been discussed.
Our work also lends support for the conceptual (theoretical) understanding of OSS projects that has elements of meritocracy and dictatorship approaches.First, our work demonstrates that in an OSS project where contributions come from several core members, the project leader still has an important role in resolving conflicts and to address situations where there is no clear consensus about decisions being made.While Python mostly operates as a meritocracy based approach where decisions are mainly driven by discussions and consensus, there are instances where the leader intervenes and makes a decision considering his vision for the language, and thus may override a majority decision.Second, while the majority of the OSS enhancement decisions are based on the collective view of the community, these are seldom easily approved.There are robust discussions had, and decision-schemes (e.g., lazy consensus) are used for making decisions.Third, our work reveals that the robust decision-making culls inept PEPs.More than 40 inept PEPs that have been proposed have been weeded out as a part of the discussions.
Researchers can use the Rationale Miner to carry out comparative studies across communities, and to study the nature of the decision rationales that are employed (similar to what we presented in Figure 5).Beyond identifying rationales, the Rationale Miner tool developed can be used for other purposes such as conducting a detailed retrospective analysis to learn when, why, and how decisions were taken, as specifics of these conversations are not contained within PEP documents.For example, why a PEP is declared to be inept can be investigated using this tool.Furthermore, using our Rationale Miner in other OSSD repositories (e.g., Bitcoin Improvement Proposals) would allow us to gain a better understanding of decision-schemes used in those projects, as well as uncover parallels and differences in decision-schemes within these projects.Rationale Miner can be employed in other domains such as tender processing.For example, a company soliciting tenders from different vendors may implement a process with a definite closing date for the tender.The rationale for applications meeting or not meeting specific criteria for further may be captured in email messages.Such criteria can be captured using the tool (as the meeting or not meeting is similar to acceptance or rejection in PEPs).Similarly, it could be applied in the recruitment domain to find rationale behind candidate selection or rejection.
A number of interesting research questions can be answered in the future, for instance: (a) what is the nature of arguments for and against acceptance (e.g., strong/weak, evidence-driven/anecdotal), (b) what signatures facilitate PEP acceptance, and (c) how can bad PEPs (e.g., Inept PEPs) be avoided at the outset?Thus, our work will spur further investigations.
There are three main implications of our work beyond the Python community: (i) reuse of our methodological approach (i.e, the 13 heuristics) as a template for rationale extraction in other communities, (ii) discovery of more rationale types (11 compared to Mr owka's 5) that may be of interest to different OSS communities, and (iii) learning why PEPs fail, and using that information to avoid failures in other communities.

| Threats to validity
This section discusses the main threats to construct, internal and external validity in our study, and how we have handled them.Threats to construct validity refer to the appropriateness of our evaluation measures.Based on prior research and our manual analysis of Python discussions, we used 13 different heuristics in our scoring construct.We admit that we may have missed additional heuristics that might be useful in finding rationale sentences.We used the NDCG metric for the formal evaluation of rationale sentences.Since this metric has been previously used to evaluate the ranking of emails, the threats to the use of our scoring construct are minimal.
A threat to the internal validity of the study are messages that cannot be assigned to a proposal.Capturing all messages for each PEP is a difficult task, particularly those that did not mention a specific PEP number or PEP title.However, we do not believe this to be an issue as our investigation shows that once a PEP number is assigned, developers tend to include PEP numbers in the email messages when decisions on proposals are stated, especially for acceptance and rejection.Moreover, messages with decisions are often quoted and referred to by developers in discussions after the decision and if some of these messages had been missed, it would have been obvious in our ground truth exploration.Nevertheless, we concede that we may have missed some messages that contain a PEP-related discussion.

K
E Y W O R D S consensus, decisions, decision-making, open source software development (OSSD), PEPs, python

Figure 3
Figure 3 illustrates how the variables interplay and point towards messages containing the rationale.It shows the timeline of a typical PEP discussion and its DM, from the initial state change to the conclusive state change where the PEP is either Accepted or Rejected.It emphasizes the

F
Figure 3 in blue and the patterns (combination of terms) that indicate a rationale are shown in green.These four message types are elaborated below.
-based Response Messages to the Same State Change Message (RMSSCM), Refs. 5,53Rationale Found Using Triple Extraction (RFUTE) Special ids.Decision Terms In Message (DTIM), Decision Terms as Header of current Paragraph (DTHP) New F I G U R E 4 The MBS groups the results of SBS by showing only the messages of sentences ranked in the SBS results

Figure 7
Figure7shows the graphical user interface (GUI) of Rationale Miner.This tool is designed to help users grasp the rationale behind how PEP decisions are made.For each PEP it allows the users to explore the outcomes of both sentence-based and message-based approaches for retrieving rationale.Search parameters such as the PEP number are entered in Panel 1 (see green numbering).Thereafter, a timeline of states traversed by the selected PEP (e.g., draft or accepted) is shown in Panel 2. To obtain all the candidate sentences or candidate messages, the user then chooses either the sentence-based or message-based approach respectively, in panel 4, together with whether they are interested in accepted or rejected PEPs.The results are shown in panel 5 when the Extract Reasons button is pressed.Upon selecting a result row in panel 5, the

Table 4
shows the variation in rationale sentences captured in the top 5 rankings for accepted (Acc.) and rejected (Rej.)PEPs as we include each heuristic.The influence of the variables are divided into five groups as shown in the table.If all the values for a heuristic are positive (e.g., see first row), this

7
Relative contribution of term types in the 300 ground truth rationale sentences Relative contribution of term patterns in the 300 ground truth rationale sentences B2: (State & decision terms & rationale identifier) or (decision terms & entity) 41 13%