Automated class diagram elicitation using intermediate use case template

Class diagrams, being more illustrative, provide an easier way of understanding software requirements compared to use case descriptions. Both manual and automated methods are used for the extraction of class diagrams from requirements. The automated techniques employ certain extraction rules and natural language processing methods. Each use case description template introduces a small set of extraction rules. However, when all types of templates are considered, the number of rules becomes large and the procedure becomes tedious. Thus, researchers restricted the class diagram extraction procedure to some specific use case description templates. However, such a restriction negatively affects the software developers as they get restricted to working with only limited templates. The proposed work in this paper strives to remove this restriction on developers by introducing an intermediate template. The traditional use case description templates get transformed into the intermediate template and the rule extraction procedure is then applied to this intermediate template. This reduces the total number of extraction rules and hence, brings down the extraction complexity. The class diagrams extracted from use case description templates of different domains using proposed technique show more accuracy in terms of completeness and correctness when compared with the state of the art approaches.


| INTRODUCTION
The Functional requirements of the software are generally documented in natural language (NL) following either an unstructured format or a semi-structured format. In an unstructured format, functional requirements are written in simple paragraphs while in a semi-structured format they are documented using some specific templates and keywords. A major drawback of using NL for documenting functional requirements is that such documents tend to become more ambiguous and complex [1]. As a result, software developers tend to prefer a graphical approach of representing software requirements such as unified modelling language (UML) class diagrams. These graphical notations avoid ambiguities in specifications, provide ease of readability, and thus facilitate easy discussion between software analyser and stakeholders.
Literature suggests that the extraction of class diagrams from the requirement text is done either manually or in an automated manner. The automated extraction utilizes natural language processing (NLP) techniques for analysis of the text followed by the formulation of extraction rules based on the syntactic structure of the sentence. These techniques work well for unstructured formats; however, several difficulties arise while employing the same technique for semi-structured text such as the inaccurate association of methods to extracted classes. Consider, for example, a situation where a semi-structured requirement contains sentences such as 'System asks for the security name'. In this case, 'system' gets extracted as class according to an existing rule say Rule: 'subject is extracted as class' [2]. However, words such as 'system' consider the entire software as a single unit. Thus, all other methods get associated with this 'system' class and thus software development becomes more complex. Another difficulty that usually arises while using NLP techniques for class diagram extraction from semi-structured text is that a large number of rules get generated. Use case descriptions have various templates and each template has a different set of keywords. Each set of keywords generates some rules and eventually, a large number of rules need to be considered for the overall extraction procedure. In order to avoid working with a large set of rules, researchers (e.g. [3][4][5]) have proposed working with specific templates and have applied rule generation using these specific templates. However, this is a restriction for developers since they get restricted to a certain number and type of templates. As a result, there is a need to develop techniques that would be able to extract class diagrams from all kinds of available templates yet considering only a limited set of rules. In this paper, we propose an automated procedure for class diagram elicitation from use case templates. Our proposal currently supports 20 use case templates which are obtained from the industrial white papers (e.g. [6,7]), technical reports/research papers (e.g. [8,9]) and book chapters (e.g. [10,11]). Our contribution can be summarized as follows: � To convert the use case template into an intermediate template and then to derive the extraction rules based on the keywords present in the intermediate template. The generated template after the conversion contains lesser keywords due to conversion and as a result, reduces the complexity of the class diagram extraction procedure. � To formulate the extraction rules considering negative and passive sentences which ultimately increase the accuracy of the extracted class diagram. The results obtained on a number of use case templates using our approach show significant improvement compared to the state-of-art approaches of class diagram extraction which has been discussed in the result section of the paper.
The rest of the paper has been organized as follows: Related work in this field is summarized under Section 2 whereas Section 3 consists of a comprehensive description of our proposed methodology which also includes discussion on the rules for extracting class diagram elements. Section 4 describes the evaluation of our proposed methodology and finally, in Section 5 we present the conclusion alongside some future work.

| RELATED WORKS
Extraction of conceptual models such as UML, entity relationship diagram (ERD), and data flow diagram (DFD) models from NL functional requirement has been well researched and is in practice for quite some time. We have presented a brief summary of some of the research work found in the literature. The summary starts with tools proposed for the extraction of UML and domain models followed by a survey of research work proposed for DFD and ER diagrams.
While considering the extraction of UML models from NL requirements, researchers have considered requirements available in both structured and semi-structured formats. Linguistic Assistance for Domain Analysis (LIDA) [12], GOOAL [13], and CM Builder [14] are tools which have been proposed for extraction of the class diagrams from unstructured functional requirements, while AnModeler [3], UCDA [4], and aToucan [5] consider extraction of the class diagrams from semistructured functional requirements.
The LIDA [12] tool assists users in the analysis of software requirement specification (SRS) and extraction of class diagrams. However, the tool requires substantial user intervention in the extraction process. Another tool named as GOOAL [13] has been developed by employing semi-formal language such as 4W language for the extraction of static models. The CM Builder [14] and a comprehensive model, as proposed in [15], have been developed for the generation of class diagrams by formulating the rules using lexical processing and parsing of the sentences.
Literature suggests that different techniques such as SUGAR [16] and Object-oriented Analysis and Design Approach for Requirements Engineering (OOADA-RE) [17] have also been employed over the years for automatic extraction of use case diagrams/class diagrams. SUGAR can extract the use case diagram and class diagram from the functional requirement by formulating heuristic rules using NLP techniques whereas OOADA-RE [17] employs story cards as an intermediate level for the representation of requirement and then uses this intermediate level for automatic extraction of the class diagrams.
In addition to the above-described tools, researchers have also proposed techniques for automatic extraction of class diagrams from specific use case template. These techniques have been implemented for semi-structured documents such as AnModeler [3], UCDA [4]. A domain model extractor [18] has been developed which overcomes the limitation of previous proposed works. This has been achieved by combining different model extraction rules which are often used in software engineering and information retrieval. In [19], authors have proposed a Restricted Use Case Modelling (RUCM) technique which encloses a Use case Description template and 26 restriction rules for representing the requirements in RUCM. The template is further used to automatically generate class diagram using extraction rules and NLP technique by a tool called aToucan [5]. This tool forces restrictions on the text so that the extraction rules can be used.
Domain-specific languages (DSLs) facilitate automation as well as productivity and also help to ascertain communication standards, despite their restricted terminology. In addition, to use NL for specifying requirements, DSL has also been used for specification of use cases. LUCAM [20] and LUCSED [21] are two popular DSL, which are used for the specification of use cases and for automatic extraction of class diagrams.
Some tools such as CIRCE [22], ER generator [23], Association-Based Conceptual Systems (ABCS) [24], and MAR-ITACA [25] have been proposed for the extraction of ERD and DFD models from NL functional requirements. CIRCE [22] parses sentences to generate tuple encoded parse trees which are used to extract different models. The ER generator tool involves specific rules linked to the semantics of words and generic rules extract entities and relationships [23]. ABCS approach extracts the entities and attributes from the business description using extraction rules [24]. The tool MARITACA [25] is able to generate state machines from use case description using NLP technique and extraction rules. A Model-to-Model (M2M) transformation technique extracts structured business vocabularies and business rules from use case models using transformation rules [26].

Limitations of existing approaches
� The existing automated class diagram extraction procedures mentioned earlier are based on specific use case description templates. These procedures primarily use NLP based extraction rules which are formulated using keywords. Thus, different templates result in a different set of extraction rules which ultimately leads to a large number of extraction rules. To figure out the notation of different templates is also overhead for software analysers and developers. The existing work necessitates software analysers and developers to specify the software requirements into some standardized format of the template. � The existing procedures are not capable of extracting class diagram elements from negative sentences. For example, the statement 'The item shall give status messages at regular time intervals not less than every 30 minutes.' is a negative requirement statement. It contains ambiguity regarding the time interval between status messages. Considering the time interval mentioned in the statement assumed to be at least 30 minutes, providing a new status message every 50 h may seem to be fine. However, the objective is to have no more than 30 s interval between two status messages [27]. Thus, it becomes necessary to covert such negative statements to their affirmative form in order to remove ambiguity. Apart from this, there is also a need for further improvement in passive sentences.

| PROPOSED METHODOLOGY
The two-phase methodology of our proposed extraction procedure has been illustrated in Figure 1. The first phase is the conversion phase where the use case description template gets converted into an intermediate template using a set of predefined conversion rules. In the second phase, class diagram element identification rules get applied to the intermediate template generated from phase 1. The proposed methodology relies on the following assumptions: � The sentence structure should be grammatically correct. It is an important constraint since NLP tools depend on the correct grammatical and syntactical structure of sentences.
In case, sentences are grammatically incorrect it causes erroneous results during extraction. � The proposed methodology works well with all simple and compound sentences and with all types of tenses. However, it does not support the use of anaphora, so human intervention is required to transform the anaphora to the corresponding subject. It is expected to avoid the use of anaphora during documentation since anaphora resolution leads to ambiguity in sentences. However, if anaphora is present in a sentence, it should refer to the subject of the same sentence, else its resolution would create ambiguity. � The same term should be used in the whole text for the representation of an entity to avoid ambiguity. Words that indicate the same element may lead to misinterpretation and erroneous results.

| Phase 1: conversion of given use case template into intermediate template
In this phase, a use case template gets converted into its corresponding intermediate template on the application of certain conversion rules. Figure 2 illustrates a sample intermediate template.
To formulate the conversion procedure, we consider a set of use cases belonging to 20 use case templates obtained from various research articles (e.g. [5,8,9,28,29]) and industrial white papers (e.g. [6,7] 3.1.1 | An illustrative example Figure 3 illustrates the intermediate template generated out of the purchase mechanism use case [10] using Rules RI-RXII discussed earlier.
Many conversion rules have been applied to the use case template purchase mechanism [10]. For example, rule RI on line no. One which replace 'Use case' with 'Use case Name'. 'Primary Actor' and 'Stakeholders and Interests' keywords denote the stakeholders; thus these keywords are enveloped under a single keyword that is 'Stakeholders' according to rule RIII. In line no. Three and 5, rule RII has included 'Goal in Context' and 'Scope' within one single unit that is 'Description'. 'Level' represents the level of use case such as user-goal, Summary, or sub-function. In line no. 7, the summary level has been mentioned but this information is unproductive in the class diagram context. So, this keyword has been removed according to rule RXII. 'Minimal Guarantees' is a set of affirmations that must be satisfied at the end of any path. Thus, this affirmation should be included in the 'Main Scenario' or 'Alternate Scenario'. Like in this use case in line no. 13

| Phase 2: extraction of class diagram from intermediate template
Once the intermediate template gets generated, the next step would be to extract class diagrams from it. We have developed rules for such an extraction procedure and these rules have been implemented using the Stanford CoreNLP tool [30][31][32][33][34] which parses the sentence. This tool provides the universal dependencies for each sentence of requirement text. These Universal dependencies hold a binary relationship between a head and reliant. The head and reliant are words of sentences and the relation is labelled with the help of dependency's abbreviated term.

| Paraphrasing of negative sentences
In this step, we have applied the paraphrasing rules to the negative sentences. The Stanford NLP tool has been used to identify the negative sentences. In addition, we have used WordNet 2.0 [35,36] to replace the negation word with its antonym. In the next step, we have checked whether the introduced antonym has any synonym present in the text. In case, such a synonym is present then introduced antonym has been replaced by the corresponding synonym. The developed rules (Rules 1-5) for paraphrasing are as follows:

Rule 1 If a sentence includes negation word and its modified
word is an adjective then replace the adjective word with its antonym and remove the negation word in the sentence.
F I G U R E 2 Intermediate use case description template for text analysis SHWETA ET AL.

-29
For example Negative sentence: The course offerings not marked as 'enrolled in' remarked as selected in the schedule.
Paraphrased sentence: The course offerings unmarked as 'enrolled in' are marked as selected in the schedule.

Rule 2
If there is a negation word present in a sentence and its modified word is a verb then replace the verb with its antonym and eliminate negation word in the sentence.
For example Negative sentence: The professor is not eligible to teach any course offerings in the upcoming semester.
Paraphrased sentence: The professor is ineligible to teach any course offerings in the upcoming semester.

Rule 3 If a sentence contains negation word and its
modified word is has or have then replace the modified word with its antonym and remove negation word in the sentence.
For example Negative sentence: Course offerings that do not have enough students are cancelled.
Paraphrased sentence: Course offerings do miss enough students are cancelled.

Rule 4
If there is a sentence pattern such as if no A, then no B, then eliminate negation in A and B and replace 'then' with 'then only' in sentence B. A and B denote sentence.
For example Negative sentence: If no alternates are available, then no substitution will be made.
Paraphrased sentence: If alternates are available, then only substitution will be made.

Rule 5 If a sentence includes negation word and its modified
word is a noun then replace the verb with its antonym and remove the negation word in the sentence.
For example Negative sentence: No Course offerings are available.

| Multi-word term identification
During the pre-processing of sentences, we have identified the multi-word terms and replace the compound word with a single word in universal dependencies. According to Algorithm 1, we have used the dependencies that is compound (m, n) and amod (m,n), where the compound (m,n) shows 'm n' is a multi-word term and amod (m,n) signifies 'n' is adjective for 'm'. This algorithm covers all the multi-word terms whether it contains two words or more than two words because it will combine all the words until one compound word is not completed. For example, we employed Algorithm 1 on the statement 'The requester updates the meeting date'. After identification of multi-word terms, the updated universal dependencies are-[det(requester-2, The-1), nsubj(updates-3, requester-2), root(ROOT-0, updates-3), det(meeting_date-6, the-4), compound(meeting_date-6, meeting-5), dobj(updates-3, meeting_date-6)]

| Extraction rules for class diagram elements
After pre-processing of requirement text, we have implemented keyword based rules for identification of class diagram entities from intermediate template. We have used the universal dependencies during the implementation of extraction rules and corresponding implementation rules are defined further. We begin by presenting syntax of the proposed intermediate template as a tuple UC: These sets are used in the implementation form of extraction rules along with following dependencies, sets and functions: -nsubj(n, a): nominal subject [37] shows that 0 n 0 is a verb or complement of the copular verb for subject 0 a 0 of active voice sentence.
nsubjpass(n, a): passive nominal subject [37] shows that 0 n 0 is a verb and 0 a 0 is subject of passive sentence. -dobj(n, a): direct object [37] shows the relationship between verb 0 n 0 and object 0 a 0 in a sentence. -poss(n, a): possession modifier [37] holds relation between noun phrase and possessive determiner 0 a 0 . -neg(n, a): negation [37] shows the relationship between negation word 0 a 0 and negation modifier 0 n 0 in a sentence.
shows that 'c' is extracted attribute for class 'b', relation(a,n, b) shows that 'n' is extracted as relation between classes 'a' and 'b', whereas methods(a,n) shows that 'n' is extracted as method for class 'a'.

Class extraction rules
C1-Rule C1 is generated from the argument that generally a use case name consists of a noun phrase and this noun phrase will be extracted as a class. For example, In use case name 'view Report Card', 'report card' is a noun phrase that has been extracted as a class. C1-∀n∀a(a ∈ N ∧ n ∈ NP ∧ n ∈ a ⊢ class(n)) C2-Stakeholders of use case are extracted as classes. C2-∀nðn ∈ S ⊢ classðnÞÞ C3-If sentences belong to 'Description', 'Postcondition', 'Precondition', 'Main Flow', or 'Alternate Flow'. Then, subjects present in these sentences are extracted as classes. If the subject is 'user', 'organization' or 'system', then in place of the subject, the noun phrase included in the corresponding use case name is extracted as a class.
C4-If passive sentences belong to 'Main Flow', 'Alternate Flow', 'Description', 'Postcondition', or 'Precondition'. Then, the object in the sentence is extracted as a class. If words 'user', 'organization' or 'system' act as an object in the sentence then the noun phrase which belongs to the corresponding use case name is selected as a class. For example, in a use case named 'Student Registration', the functional requirement 'The course will be selected by Student.' is present in 'main flow' section then, the student will be extracted as a class and if the functional requirement is 'The username will be asked by the system.' here, the object is 'system' thus the noun phrase 'student' present in use case name will be extracted as a class.
C5-Noun phrase of use case name present in 'Generalization' fields is considered as a class. For example, if a use case named 'pay by credit card' is present as generalization use case then the noun phrase 'credit card' is considered as a class.
C5-∀a(A ∈ G ∧ n ∈ NP ∧ n ∈ a ⊢ class(n)) C6-Noun phrase of use case name present in 'Include' relationship is considered as a class. For example, if a use case named 'customer authentication' is present in include use case then noun phrase 'customer' is considered as a class.

Attributes extraction rules
We have applied extraction rules A1-A5 using implementation rules which are mentioned in [38]. Further, we have devised some more attribute extraction rules explained below: A6-The verb in a sentence such as 'enter', 'type', and 'input' with object then object will be an attribute. For example: 'User enters username and password.', 'username' and 'password' are attributes for 'user'.
If an object is a multi-word term in the form of NP1_NP2 where NP denotes noun phrase then, NP1 will be an attribute for NP2.
If 'enter' verb is used with preposition then the object will be extracted as an attribute for the NP of preposition phrase. For example: 'Customer enters the item id.', 'id' will be attribute for 'item'.
If the object is not a multi-word term and not used with preposition then the object will be an attribute for the subject. For example: 'Passenger type the date for the reservation.', 'date' will be an attribute for the 'reservation' class.
For example: 'Students give their enrolment number for registration.', 'enrolment number' is attribute for 'student' class.

R1-∀m ∀a
R2.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a relation between subject and object. For example, if the requirement statement is 'Vendor does not deliver to the requestor.', then 'not_deliver' will be considered as a relation between 'Vendor' and 'requestor'.
R2.2-If the subject in a sentence is 'user', 'organization' or 'system', then the relationship is extracted for the corresponding object and noun phrase present in the use case name. For example, if statement 'System asked verification from the registrar.' is present under the use case 'Registration' then 'asked_verification' will be considered as a relation between classes 'registration' and 'registrar'.
R2.2.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a relation. For example, if the requirement statement is 'System does not acknowledge the customer.' under 'Pay bill' then 'not_acknowledge' will be considered as the relation between classes 'bill' and 'customer'.
R2.2.1-∀n ∀a(nsubj(n, a) ∧ a ∈{ 0 user 0 , 0 organization 0 , R2.3-If the subject in a sentence is 'user', 'organization' or 'system', and use case name does not contain any noun phrase then the relationship is extracted for the subject and the object. For example, if the statement is 'System asked password from the student.' under the use case 'Login' then 'asked' will be extracted as a relation between 'system' and 'student' because there is no noun phrase present in the use case name.
R2.3-∀n∀a(nsubj(n, a) ∧ a ∈{ 0 user 0 , 0 organization 0 , relation(a, n, b)) R2.3.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a relation between subject and object. For example, if the statement is 'System could not complete the transaction.' under the use case 'failure' then 'not_complete' will be extracted as a relation between 'system' and 'transaction' because there is no noun phrase present in the use case name.
R3-'Include' shows the relationship between noun phrase of use case name and noun phrase present in 'include' field. For example, if a use case named 'customer authentication' is present in include use case under use case 'deposit cash' then 'include' is considered a relation between class 'customer' and class 'cash'.
R3-∀n∀a(Include(n) ∧ a ∈ n ∧a ∈ N P ∧c ∈ N ∧b ∈ c ∧b ∈ NP ⊢ relation(a, Include, b)) R4-'Generalization' shows the relationship between the noun phrase of use case name and noun phrase present in the 'Generalization' field. For example, if a use case named 'pay by credit card' is present as a generalization use case under use case 'pay bill' then generalization will be considered as a relation between class 'credit card' and class 'bill'.
Methods extraction rules M1-The sentences explain the 'description', 'postcondition', 'precondition', 'main flow' or 'alternate flow' in use case description represent methods of classes. If in the above type of sentence, the subject is class and object is not extracted as a class then the verb is extracted as a method for the subject. methods(a, m)) M1.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a method for the subject.
M1.2-If the subject is 'user', 'organization' or 'system' then the verb is extracted as a method for noun phrases in the corresponding use case name. For example, if the statement is 'The system retrieves the list of course offerings.' under the use case 'Select Courses to Teach' then 'retrieve' will be extracted as a method for class 'course'.
M1.2-∀m∀a(nsubj(m, a) ∧dobj(m, b) ∧ a ∈{ 0 user 0 , 0 system 0 , 0 organization 0 } ∧c ∈ N ∧n ∈ N P ∧n ∈ c ⊢ methods (m, n)) M1.2.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a method for the noun phrase. For example, if the statement is 'The system does not verify account information.' under the use case 'course registration' then 'not_verify' will be extracted as a method for class 'course'.

M1.3-
If the subject is 'system', 'organization' or 'user' and there is no noun phrase present in use case name then the verb is extracted as a method for the subject. For example, if the statement is 'The system provides a new professor id.' under the use case 'Increment' then 'provide_id' will be extracted as a method for class 'System' as there is no noun phrase present in the use case name.
M1.3-∀a ∀n (nsubj(n, a) ∧ dobj(n, b) ∧ a ∈{ 0 user 0 , methods(a, n)) M1.3.1-If any negative sentence cannot be paraphrased into affirmative sentence and negation is associated with the verb, then 'not'þ'_'þverb is extracted as a method for the subject. For example, if the statement is 'The system does not authenticate employee.' under the use case 'login' then 'not_authenticate' will be extracted as a method for class 'system' because there is no noun phrase present in the use case name.

M1.3.1-∀a ∀n
Usually, use case description includes sentences such as 'System will ask the user for a password.' In these types of sentences, words such as 'organization', 'system'. consider the entire software as a solitary unit. Thus, these words cannot be considered as a class for class diagram and the corresponding attributes, methods, and relations have been attached to the class extracted from the corresponding use case name.

| Implementation of extraction rules on case study
Here, we have used the Purchasing mechanism Use case [10] to show the class diagram elements extraction procedure. Table 1 shows affirmative sentences corresponding to the negative sentences and extracted class diagram elements are described in Table 2. In certain situations, WordNet was unable to replace the antonyms of a word that would represent its negative form. In those cases, we have introduced a 'not' before such words to mean the negative sense of the word. For example, Use case [10], in Line No. 29, WordNet was unable to replace the antonym of 'deliver' to mean its negative sense and thus we have replaced it by 'not_ deliver'. It has been found that negative sentences play a significant role in the extraction of class diagram elements.
We have eliminated redundancy and ambiguity in the extraction procedure by applying lemmatization on all SHWETA ET AL. extracted class diagram elements. From line no. 1-9, we have extracted only classes according to the rules C1-C3. Then, further, we have eliminated the classes that belong to the stop word list. The Stop word list includes words that are frequently used in the text but cannot be considered as a class. For example, the noun phrase 'something' cannot act as a class but it has been extracted as class according to rule C1, thus, we have to eliminate word 'something' from the class list. Further, we have applied rules for the extraction of attributes, methods, and relations. Then we have identified those classes that do not contribute to any other class diagram elements and eliminated these trivial classes from the final class list.

| RESULT AND EVALUATION
Our objective is to examine our extracted class diagram elements using three evaluation metrics completeness, correctness, and redundancy. These three evaluation metrics are explained below along with measures used for evaluation which are driven from [3,5]. We begin by presenting a syntax of class diagrams as a tuple CD: < C, R > comprising of two disjoint definite sets of classes and relations that is and relation R j is a tuple R j : < C m , C n , t > where C m , C n ∈ C and t ∈ T and T is set of types of relation between classes C m and C n that is T: { Association, Directed Association, Generalization, Aggregation}Here, referenced class diagram is denoted by CD r : < C r , R r > and class is C i : < A r , M r > whereas extracted class diagram is symbolized as CD e : < C e , R e > and extracted class C i : < Ae, Me >.

| Completeness of class diagram (CM cd )
The completeness of the class diagram refers to the class diagram elements that have been extracted correctly over the total number of class diagram elements in the reference class diagram. We have evaluated completeness in terms of average class completeness (CM c ) and relation completeness (CM r ) that is i¼1 CM ri =|R e | and CM c and CM r are explained below.

| Class completeness (CM c )
It is defined as the number of correctly extracted classes over the number of classes (C i ) in the reference class diagram. Correctly extracted classes are defined as: As C i is a tuple that is C i : < A, M > thus, correctly identified classes depends on the number of correctly extracted attributes (N ac ) and methods (N mc ). Thus, Class completeness is calculated in terms of attributes completeness (CM a ) and methods completeness (CM m ):

| Relation completeness (CM r )
It is defined as the number of correctly extracted relations over the number of relations in the reference class diagram. Correctly extracted relations (R i ) are defined as: As R i is a tuple that is R i : < C m , C n , T > thus, R i is equal to R j if extracted C m , C n and T is equal to referenced C m , C n and T in R j : < C m , C n , T > respectively.

| Correctness of class diagram
The correctness of the class diagram refers to the class diagram elements that have been correctly extracted over the total number of extracted class diagram elements. We have evaluated correctness in terms of Average class correctness (CR c ) and Average relation correctness (CR r ).
and CR c and CR r are explained below.

| Redundancy in class diagram (R cd )
The redundancy in the class diagram (R cd ) can be described as the incorrect class diagram elements which are extracted as compared to the reference class diagram. The redundancy is calculated in terms of class redundancy (R c ), attribute redundancy (R a ), method redundancy (R m ), and relation redundancy (R r ). Redundancy in the class diagram is calculated as: In addition, we have also presented the results to measure the utility of our formulated extraction rules. Our proposed methodology does support all the 20 use case templates and a table showing comparative results have been presented in Section 4.5. However, in Section 4.4, due to space constraints, we have used only three templates to show the reduction in no. of rules using our proposed methodology and thus the same applies to other templates also.

| Advantage of intermediate template
We have demonstrated our methodology on three different use cases templates as given in earlier research works [10,40,41] and these templates consist of different keywords and elements.
Thus, different sets of conversion rules are applied to an individual template and that ultimately form a complete set of conversion rules which has been described in Section 3.1. Hence, these use case templates become a good choice to demonstrate our methodology's advantages over earlier work.
We have considered a different set of extraction rules for the above three use case templates, we ended up with the formulation of seventy-three extraction rules in total which got reduced to thirty-two on the removal of common rules.
Whereas using the proposed intermediate template technique, we needed to formulate only 20 extraction rules for these templates as compared to thirty-two rules. This clearly shows an improvement of about 36% in terms of less rules generation. This shows that the number of extraction rules will be reduced for all other seventeen use case templates also.
Besides this, we have also found identical results as when we have implemented individual extraction rules for each template which are shown in Table 3. The acquired outcomes demonstrate that there is no distinction in the accuracy, moreover, complexity for rules implementation has also decreased due to a smaller number of extraction rules.

| Comparison of newly formulated extraction rules with the existing approach
This section includes the evaluation of our proposed approach in comparison with three existing approaches that is AnModeler [3], aToucan [5], and another approach [2] which is applicable for unstructured functional requirements. These approaches identify the comparable class diagram elements without using any ontology and glossary from active and passive sentences as our proposed methodology does. In contrast, these approaches do not include negative sentences and multi-word terms extraction and thus these reasons make this comparison more valuable. We have taken 20 different use cases from various software engineering books (e.g. [10,11]) and research papers (e.g. [8,9]) which are based on different 20 use case description templates for the evaluation of our methodology. The extracted class diagram elements are evaluated against the reference diagrams which are created by human experts such as industry experienced persons, postgraduate students, and Ph.D. students with specialization in software engineering. They are well aware of UML models because software engineering and object-oriented programming have been part of their basic course structure. Moreover, postgraduate and Ph.D. students had also been given extensive training on software modelling with UML through other courses. Thus, the participants have constructed class diagrams in collaboration with each other and these diagrams have been used as reference diagrams for evaluation purposes. Additionally, these participants are unaware of the proposed methodology and class diagram extraction rules and thus the reference diagrams are unbiased towards the proposed methodology.
Since, we have paraphrased negative sentences into affirmative sentences; furthermore, we have also identified multiword terms that show a relevant difference in the results. Table 4 shows that the proposed methodology has significant improvement in the extraction of class diagram elements from different use case templates as compared to existing methodologies. The proposed methodology has average class and relation completeness as 0.86 and 0.85 respectively, that leads to high class diagram completeness in comparison to other methodologies. Same as completeness, our methodology provides average class and relation correctness as 0.95 and 0.94, respectively, thus the ratio of extracted relevant class diagram elements is higher than other methodologies. Whereas the average redundancy corresponding to classes, attributes, methods, and relations are 0.05, 0.00, 0.05, and 0.04, respectively, which are comparatively low for the proposed methodology. We have also used box plots to show the variability of quality metrics of class diagrams generated by the proposed methodology and AnModeler tool. (Due to space constraints, we have shown comparison with only one tool i.e. AnModeler tool using box plot as this is most recent approach.) In Figure 4, boxplot (a) and (b), the median lines of the proposed approach are much closed to 1 in comparison to the AnModeler tool; it shows that our approach is more accurate. Whereas in boxplot (c) and (d), the

-
median lines of our approach are much closer to zero, that signifies our approach generates less redundant diagram elements. The comparatively small boxes show that the proposed approach has a high level of similarity in correctness and completeness as compared to AnModeler. The tool demonstrating our methodology is available at https://github.com/ 09shweta/semi-struct.git. This tool takes text files of whole use case description as input and provides extracted class diagram elements as the output such class names, attributes, and methods corresponding to each extracted class, and relations between the classes. These extracted class diagram elements will be stored in different output text files.

| Internal validity
The main threat to internal validity is whether participants have sufficient knowledge, understanding, and experience in objectoriented programming and methodology. The participants include industry experienced persons, postgraduates, and Ph.D. students and these students have done a course on objectoriented software engineering. They have also been trained on object-oriented programming and modelling before they constructed the class diagrams. Moreover, industry experienced person also have joint effort in the formation of class diagrams.

| Construct validity
The threat related to construct validity is measurement instruments. The approaches have been compared based on class diagrams which are generated from 20 use cases of different templates. Moreover, generated class diagrams from the different approaches are compared based on the same quality measures. Therefore, no bias is introduced in the result during the evaluation of these class diagrams.

| External validity
One of the threats to external validity is whether the generated class diagrams obtained from the participant generalize to software professionals. The postgraduate students are well trained in object-oriented programming and modelling as they have object-oriented software engineering as a part of their basic course structure. Moreover, we have trained them about the class diagram modelling and the obtained class diagrams are a joint effort from all the participants. Literature such as [52,53] suggest that students can be used in place of professionals as there is no significant difference in the performance of trained software engineering students and professional developers. The industry experienced persons have experience in programming and software development of more than 2 years, thus the obtained results are much more generalized to software professional. The next threat to validity is the size of the use case document that can be supported by our proposed approach. In our study, we have also included three entire use case documents from the industry and the results obtained are very close to the reference diagram.

| Discussion
We have also evaluated our methodology on Use case Description documents of different domains such as Personal Investment Management System (PIMS) [54], 'web accessible alumni database' [55] and course registration domain [56] as shown in Table 5. Here, we will discuss one of the above use case document that is PIMS and the corresponding gold standard class diagram has been shown in Figure 5. We analysed the extracted class diagram in Figure 6 and found that some extracted classes such as 'field', 'ROI', 'installation', 'authorization', and 'error_message' do not belong to gold standard class diagram. Although it can be advantageous for software developers as a sufficient number of classes can reduce the complexity of software development. Some classes such as 'error_message' and 'authorization' do not associate to any other class because they have been used as a subject in the sentence that has a pattern as 'subject þ verb'. Thus, the verb has been extracted as a method for the subject according to method selection rules. These types of extra class diagram elements can be reduced by formulating more rules while considering various types of sentence structures. Since, any NLP based procedure extracts only. explicitly defined class diagram element thus, some implicit information such as classes 'net loader', 'nameUI', 'data repository', and aggregation cannot be extracted through the proposed methodology. Moreover, our goal is to reduce human effort and not to replace human decision making, thus the extracted class diagram has to be reviewed by a software developer for further modifications.
The existing methodologies possess three major drawbacks in extraction process: (1) They are unable to extract class diagram elements from negative sentences, (2) They are not able to extract multi-word terms although AnModeler can extract multi-word but there is a limitationit is not able to extract F I G U R E 5 Gold standard class diagram of PIMS [54] for evaluation F I G U R E 6 Class diagram for PIMS 40multi-word terms with more than two nouns, and (3) These methodologies work for specific use case template, thus it leads to lower correctness and completeness for other templates.
The proposed approach is able to attain a higher score because: (1) it works on the negative statement and passive sentences, (2) it extracts the multi-word terms which can be formed by two or more than two noun phrase or adjective phrases, (3) our proposed approach works for entire use case document, and (4) it is suitable for 20 different types of use case template. But our methodology has also some limitations which are as follows: (1) our proposed approach does not support anaphora resolution, (2) the accuracy is bounded with the NL parser used that is Stanford NL Parser, (3) it can extract only explicit information from the functional requirement, and (4) the proposed approach unable to handle misspelling and abbreviations. The spelling correction can be done by applying third party tool on the text before giving input to the proposed methodology.

| CONCLUSION
In this paper, we have exhibited a methodology for the transformation of any given use case template into an intermediate template that helps in automatic extraction of the class diagrams. The automatically extracted class diagrams obtained from the intermediate template show identical results as extracted from the individual templates using different sets of rules. This indicates that no information loss occurs during the conversion process. The extraction of class diagrams from different use case templates requires a large number of extraction rules. However, using our proposed methodology, we have overcome the use of different set of extraction rules for different use case templates. Instead, rules need to be extracted only for the intermediate template generated. Thus, the total number of extraction rules get decreased which ultimately reduces the complexity of the extraction procedure. Presently, our methodology supports 20 use case templates. However, it can be further extended for other templates as well.
Moreover, our class diagram extraction procedure considers negative sentence and multi-word terms within the templates during the formulation of extraction rules. This leads to more accurate results as compared to state-of-the-art work.
Even though we could automate the class diagram extraction procedure, we could not get away from human intervention entirely. For example, while paraphrasing negative sentences into affirmative sentences, some sentences can be paraphrased by humans only. Thus, even in our proposed approach, we need human intervention in order to further correct the automated class diagram.