Web of cybersecurity: Linking, locating, and discovering structured cybersecurity information

Cybersecurity is one of the main concerns of many organizations today, and accessibility to cybersecurity information in a timely manner is crucial to maintaining cybersecurity. Various repositories of cybersecurity‐related information are publicly available on the Internet. However, users are unaware of many of them, and it is impractical for them to keep track of all of them. Cybersecurity information stored in these repositories must be able to be located and accessed by the parties who need it. To address this issue, this paper proposes a mechanism of linking, locating, and discovering various cybersecurity information to improve its accessibility in a timely manner. This mechanism allows us to locate various cybersecurity information having different schemata by generating metadata with which a list of cybersecurity information is managed. The information structure incorporated in this mechanism is unique, and it makes our mechanism flexible and extensible. The structure consists of categories and formats that are linked to each other. The mechanism can propagate information updates to minimize the risk of obsolete information. This paper also introduces a prototype of the mechanism to demonstrate its feasibility, and it analyzes the mechanism's extensibility, scalability, and information credibility. Through this study, we aim to improve the accessibility of cybersecurity information on the Internet and facilitate information sharing beyond organizational borders, with the eventual goal of creating a web of cybersecurity.


INTRODUCTION
Cybersecurity has never been a more important issue than it is now as our society has become dependent on the Internet for providing services in virtually any area of life. It is often said that information sharing is key to manage cyber risks. Cybersecurity information obtained from experience and/or derived from analyses ought to be shared across organizations and beyond national boundaries. This sharing of information can avoid situations where, for instance, countermeasures against a certain software vulnerability are available, yet an organization remains vulnerable to cyber attacks.
To facilitate the sharing of such information, private companies, public organizations, and national entities started to publish vulnerability repositories in the late nineties. For example, MITRE 1 introduced the Common Vulnerabilities and Exposures (CVE) initiative and launched a repository in 1990. In 2005, National Institute of Standards and Technology set up the National Vulnerability Database (NVD). 2 This repository contains annotated versions of CVE entries, like fix information, severity scores, and impact ratings. Besides those repositories, various other repositories have been made available by for-profit and non-profit entities. Many more repositories may emerge in the future.
Measures to deal with issues affecting cybersecurity are often available through these repositories, and individuals and organizations, upon discovering such measures, implement them to make their software and systems more secure and robust to cyber attacks. The people involved in software security, for instance, should be aware of all available cybersecurity information resources and take steps to incorporate the measures found in them in the countermeasure they develop. However, it is impractical for users to keep track of all of the repositories, and there are even users who are unaware of these repositories. With the growth in cybersecurity-related information, it has become practical to create a mechanism that automatically identifies, discovers, and locates cybersecurity information from online information repositories distributed across the Internet.
One may argue that we need to use one trusted repository, but it is not always the best solution. This is because none of existing repositories contain all types of cybersecurity information. Each of the existing repositories is designed to accumulate certain types of information using differing schemata. Moreover, information from these repositories may provide the same types of information (e.g., same vulnerability information), but the content (e.g., severity score of the vulnerability) may differ. It is thus beneficial to devise a means to efficiently access needed and related information across these repositories.

Contributions
In this paper, we propose a new way of identifying, discovering, and locating structured cybersecurity information and define a mechanism that allows us to share and retrieve cybersecurity information distributed in various online repositories. We deal with Extensible Markup Language (XML)-based information * and describe the architecture, protocol, and information structure of our mechanism. The architecture consists of several roles communicating with each other to publish or retrieve information from different repositories by following defined protocols. It provides a means to update obsolete information so that security risks can be minimized. The information structure of this mechanism is unique and makes the mechanism flexible and extensible. This structure consists of categories and formats that are linked to each other. It uses the reference ontology from Takahashi and Kadobayashi 3 for the categories and industry specifications † defining XML schemata for the formats. It makes the mechanism flexible enough to handle schema differences that exist among various information sources and extensible enough to cope with future specifications that define new information schemata. We implemented a prototype of the mechanism in which users can register XML-based security information and search for relevant information by using a web interface. Our analysis shows that the mechanism is feasible, extensible, and scalable.
We believe our mechanism will eventually realize a web of cybersecurity and produce an Internet-scale knowledge base. Various repositories from all over the world can be linked and integrated through this mechanism. We hope this mechanism will facilitate the exchange and circulation of cybersecurity information beyond organizational borders and contribute to global cybersecurity. ‡

Organization of this paper
The rest of the paper is organized as follows: Section 2 describes work related to the proposed mechanism, and Section 3 depicts the design principles. Section 4 describes the architecture of the mechanism by defining necessary roles, and Section 5 defines the structure of the protocol messages. Section 6 describes the protocol procedures, and Section 7 introduces the information structure that identifies information over networks. Section 8 introduces the prototype of the mechanism, and Section 9 evaluates and analyzes the mechanism. Section 10 concludes this paper. *Non-XML information is outside the scope of this paper, but arbitrary HTML information can be easily handled by the proposed mechanism. † The industry specifications referred to in this paper include specifications defined by industry organizations and standards developing organizations. ‡ This paper is an extended version of work published in Takahashi and Kadobayashi. 4

RELATED WORK
This section describes research that is related to the proposed mechanism. It outlines research on information schemata, online repositories, and cybersecurity information ontologies.

Information schemata
The huge volume of available information makes it difficult to share and retrieve data over the network efficiently unless the cybersecurity information is machine-readable. For this reason, various information structures have been specified. For example, CVE 5 provides identifiers (CVE-IDs) and an XML schema to identify and describe vulnerability information. § The Incident Object Description Exchange Format (IODEF) 6 defines a data model for reporting incident information and provides an XML schema, while the Extensible Configuration Checklist Description Format 7 defines an XML schema on security configuration checklist. There are many other information structures, including the Asset Reporting Format, 8 Common Attack Pattern Enumeration and Classification, 9 Common Configuration Enumeration (CCE), 10 Common Configuration Scoring System, 11 Common Event Expression, 12 Common Platform Enumeration (CPE), 13 Common Vulnerability Reporting Framework (CVRF), 14 Common Vulnerability Scoring System (CVSS), 15 Common Weakness Enumeration, 16 Common Weakness Scoring System, 17 Cyber Observable Expression, 18 Malware Attribute Enumeration and Characterization, 19 Malware Metadata Exchange Format, 20 Open Checklist Interactive Language, 21 Open Vulnerability and Assessment Language (OVAL), 22 Structured Threat Information Exchange, 23 Software Identification, 24 Web Services Agreement Specification, 25 and Extensible Access Control Markup Language. 26 These are useful for accumulating cybersecurity-related information and building repositories.
To cope with cybersecurity issues, we often need several different types of information, such as vulnerability, asset, and malware information. These different types of information do not typically reside in a single repository following a single schema. One may build an all-encompassing schema that incorporates all types of cybersecurity information to store these information. However, in practice, doing so is inefficient, because different types of information are needed and different usages prefer different schemata. Even if we were to build such a schema, it may need to be updated as new security techniques and operations emerge, and this may render the schema useless. Thus, instead of designing a comprehensive schema, it is better to design a flexible and extensible schema that can incorporate various cybersecurity information and that can cope with future schema specifications. This would ensure wider usability.

Repositories and databases
Once we have a consistent data structure, we can accumulate security information and build repositories and databases. Here, various vulnerability repositories have been proposed. Sufatrio et al. 27 proposed a machine-readable integrated vulnerability database. The database was created by integrating several public vulnerability databases that contain human-readable information. Each entry of the database contains machine-readable fields for 3 types of information, i.e., system components, environment factors, and consequences of the vulnerability, in addition to the URL of the public repositories that contain the original information in a human-readable form. The authors demonstrated a prototype vulnerability scanner that uses the database. Khadilkar et al. 28 studied semantic modeling for building databases. They argued that relational models pose certain limitations on automation of vulnerability management, namely, interoperability, and proposed a semantic model that supports inference. They also introduced an ontology that models the product information contained in NVD and proposed storing the NVD information in a Resource Description Framework (RDF) 29 ¶ triple store and creating a semantic web application that can search for relevant information. Gu et al. 30 proposed a relational database-based design and researched vulnerability database with an emphasis on accuracy and integrity. They used CVE, X-Force, and NVD databases for the source information and Microsoft SQL Server to store the vulnerability information. CVE-ID was used to relate the vulnerabilities described in different sources. They advocated modular design, i.e., providing modules for downloading vulnerability information from those repositories and storing it locally and devising a module for publishing information on the Internet, through a website and a web service. The above research § CVE was originally designed to build a vulnerability dictionary by defining unique vulnerability identifiers. The XML schema used for describing the vulnerability of each CVE-ID is prepared as an additional feature. ¶ RDF is a syntactic and semantic language for representing information on available resources and making them searchable over the Internet.
provides us with a means to build a vulnerability database with a single data model. It thus lays the foundation for our research that links various information with various schemata. On the other hand, various cybersecurity repositories are available online, as mentioned in Section 1. NVD is an online vulnerability repository, in which each item provides vulnerability information with a unique CVE-ID. The Red Hat repository 31 provides vulnerability information and security check lists in the form of CVRF and OVAL. MITRE also provides these in the form of CVE and OVAL. # The items in these repositories are written in English, but there are other repositories that use local languages for describing their items. For instance, the Japan Vulnerability Notes (JVN) repository 33 publishes vulnerability information in Japanese, by using its own RDF schema. Likewise, the China National Vulnerability Database 34 and China National Vulnerability Database of Information Security 35 provide vulnerability information in Chinese.
Many more repositories containing cybersecurity-related information will emerge in the future, and they may use different schemata to describe the information. The resulting schema gap would make it difficult to locate information distributed in various repositories. Moreover, it would be difficult for us to access a large number of repositories. Thus, it is desirable to have a single point of access where one may locate needed information in all these repositories.

Information discovery mechanisms
There are various information discovery mechanisms. For the ease of discussion, they are classified into 3 approaches, i.e., directory-based, crawler-driven, and direct posting.
Directory-based approach uses directories that enable easy location and quick discovery of targeted information. These directories are often manually managed, and information needs to be registered manually. For instance, web search engines in their early stage of development used this approach and manually registered web pages in their own directories, 36 whereas some domain-specific search engines still use the approach. A hierarchical mechanism taking this approach is also proposed. 37 It provides for the registration of object identifier 38 arcs, which enable coherent, unique, and global identification of cybersecurity information as well as for organizations exchanging that information and associated policies. This approach allows more control on information sources it deals with than the other approaches. For instance, this approach can exclude information sources that spread false security information or that is untrustful. However, the varied resources and costs involved in maintaining such directories can make it prohibitive for those with limited resources, and thus this approach is not suitable for the applications that require scalability.
Crawler-driven approach crawls the entire Internet. This requires minimal resources and costs associated with making information available, and those providing and seeking information need not know of each other's existence beforehand. This approach scales well, thus its use is suitable on the Internet. Various web information search mechanisms take this approach by using indexer and query processor in addition to crawler. 36 They were originally designed for locating text information on the web and ignored its semantics, but nowadays, some research articles have been trying to improve the quality of search by taking semantics of information into accounts. They often use ontologies, and a semantic focused crawler traverses the Web and determine the affinity between a Web document and an ontology. 39, 40 The above 2 approaches require intermediate entities that have the information on the information sources. Different from them, direct posting approach shares information without such entities. For instance, ROLIE 41 publishes, shares, and exchanges security information as Web-addressable resources by using Atom Publishing Protocol and Atom Syndication Format. Another such mechanism is XMPP-based information publishing. 42 These mechanisms do not require intermediate entities for discovering information, but information consumers need to know the location of information feeds. Moreover, they are not designed for locating information and thus require some wrapping mechanisms that facilitate information locating. They are suitable for one-to-many communication, where subscribers of particular information sources receive updates from them.
When handling security information, mechanism for discovering information should be carefully designed since the delivery of wrong or obsolete security information may lead to severe security incidents, and many current web search algorithms that take crawler-driven approach are known to be vulnerable to spams of information. The repositories mentioned in Section 2.2 distribute XML-based structured cybersecurity information, and we wish to link such structured and managed repositories instead of crawling the entire Internet.
The approaches mentioned above are not necessarily the most suitable solutions for our purpose. Directory-based approach is not suitable for our purpose due to scalability. Though the number of information sources we deal with is rather limited, they publish lots of information, and it is impractical to manually register and assign identifiers to each information item. Crawler-driven approach is not the best solution either; a crawler is unnecessary and inefficient since the location of the information is known and unchanged, and we do not need to crawl the other web pages. The direct posting approach is efficient but is not suitable for locating information across multiple repositories.
These approaches mentioned above are surely the foundations for dealing with such information. We prefer to make use of the structure and semantics of each tags of the structured cybersecurity information. Different XML schemata exist for security information, they are defined for different use cases, and thus the data structure should be preserved and used for search. We intend to design a security information domain-specific discovery mechanism, which deals with only XML information of different schemata from trusted repositories. We wish to register trusted sources, and wish to locate required information across these sources, in a scalable manner.

Cybersecurity information ontology
Our mechanism links various cybersecurity information by using an extensible data model, which is described in Section 7. The data model separates information categories and formats. The categories should not be required to change in the near future and should be agreed upon by the parties involved. Accordingly, the categorization needs to be abstract, and yet built in close cooperation with the consumers who actually use the information. An ontology, which is an abstract and simplified view of the world that we wish to represent for particular purposes, is useful for this purpose. 43,44 A reference ontology for cybersecurity operational information is presented in Takahashi and Kadobayashi, 3 which appears in ITU-T Recommendation X.1500. 45 It was built in close collaboration with organizations that actually consume the information. It extracts concepts that are around the cybersecurity information and defines various categories and types of information needed for cybersecurity operations. Table 1 summarizes the categories. Note that our mechanism uses the categories for our data model.
More work has been done on ontologies besides what is described above. Fenz et al. 46 proposed a security ontology for organizing knowledge on information security-related concepts with a focus on information security risk manage-

Type of information in the category
User Resource DB Information on IT assets and the other information needed to maintain an organization's security, prepared by the organization.
Provider Resource DB Information on IT assets and the other information needed to maintain an organization's security, provided by external service providers.
Incident DB Information on incidents of an organization. It is generated from analyses of assorted logs stored inside the User Resource Database.
Warning DB Information on cybersecurity warnings. It is generated from analyses of incidents that occurred in different organizations.
Cyber Risk KB Knowledge on cybersecurity risks.
Vulnerability KB Knowledge on known vulnerabilities, including naming, taxonomy, and enumeration of known software and system vulnerabilities.
Threat KB Knowledge on known cybersecurity threats.
Countermeasure KB Knowledge on countermeasures against cyber risks.
Assessment KB Knowledge on security assessments, e.g., the rules, criteria, checklist, and best practices for assessing an organization's security level.
Protection KB Knowledge on techniques to detect and protect against security threats. Product & Service KB Knowledge on products and services. and services.

Version KB
Knowledge on versions of products and services, including their naming and enumerations.

Configuration KB
Knowledge on the configurations of products and services.
Abbreviations: DB, database; IT, information technology; KB, knowledge base. ment. Wang et al. 47,48 introduced a vulnerability ontology designed for vulnerability analysis and management that captures the relationships among information technology (IT) products, vulnerabilities, attackers, security metrics, countermeasures, and other relevant concepts. Tsoumas et al. 49 built an ontology of security management within an organization, with a focus on risk assessment. Parkin et al. 50 proposed an information security ontology incorporating human-behavioral implications. Masoumzadeh and Joshi 51 introduced an ontology for social networking systems that captures privacy-sensitive information in social networking systems. Obrst et al. 52 developed an ontology of the cybersecurity domain that focuses on malware. Although there have been various other ontology studies, 53 their usability for our purposes is rather limited because they are designed for different purposes.

Knowledge extraction
While the core of our mechanism is linking different pieces of structured information and making them discoverable, the Internet provides a lot of information in unstructured form. Knowledge extraction techniques can generate structured information from such unstructured information. Using such techniques, our mechanism could deal with the unstructured, non-XML information. Various such techniques have been proposed. More et al. 54 developed a scheme that extracts threat and attack-related information from web texts. It integrates the extracted information with IDS/IPS sensor information and builds a semantically rich knowledge base to detect cyber threats and vulnerabilities. Joshi et al. 55 extracted cybersecurity-related information from web texts, integrated the information with NVD, and generated an RDF-based knowledge base. Benjamin et al. 56 explored techniques to extract threat and vulnerability information from hacker forums, hacker Internet relay chat, and carding shops. Jones et al. 57 proposed a bootstrapping algorithm for extracting security entities and their relationships from text, closely following developments in semi-supervised natural language processing, to assist security analysts in obtaining information pertaining to their networks, such as new vulnerabilities, exploits, or patches. These extracted information can be structured by preparing an appropriate data structure.
This paper focuses on how our mechanism links various structured information without losing its extensibility, but the mechanism should be able to cope with these techniques in the future.

Security automation using repositories
Cybersecurity-related repositories are useful for security operations. Martin 58 elaborated the use of accumulated security data conforming with security standards, e.g., CCE, CPE, CVE, and OVAL. He claimed that accumulated security data contributes to facilitation and automation of security operations inside organizations. For instance, he showed how to assess a system and threats to it by using such security data. The prototype vulnerability scanner described in Sufatrio et al. 27 is another example of security automation using repositories. The scanner, which was built to demonstrate the usability of the security repository described in that paper, finds information in the repository that identifies vulnerabilities inside a system. The repository contains purely machine-readable data, thus the scanner can process the data efficiently and automatically. Takahashi et al. 59 introduced a scheme to automate vulnerability management of IT assets in an administrative network by using human-readable vulnerability notes, such as the ones from NVD and JVN repositories. It monitors IT assets and determines their identifiers by using natural language processing techniques, then looks up the vulnerability notes using the identifiers. If a related vulnerability note is identified, an alert message is sent to the administrator.
The studies described above illustrate the importance of accumulating cybersecurity-related information and providing a means of discovering such information.

DESIGN PRINCIPLES
This section describes the design principles by elaborating the problem statement and the approaches to solving the problem.

Problem statement
Our aim is to link XML-based cybersecurity information and make it discoverable over the Internet. The following issues need to be resolved to achieve this objective. a) Schema gap: Efficient information retrieval and operation automation require cybersecurity information to be machine-readable. There are various ways of describing this information, as described in Section 2.1. They define cybersecurity information structures in XML and enable the exchange of this information. Nevertheless, these structures, i.e., schemata, differ from one another because different use cases of cybersecurity information prefer different information structures. It is difficult to use techniques to retrieve XML documents 60-62 as they are designed for cases in which the structure of documents is fixed. To retrieve cybersecurity information, the gap between these schemata needs to be addressed. b) Future extensibility: Specifications developed in the future may define new schemata, so we need a mechanism that can incorporate them. The mechanism must be extensible enough to allow us to use new schemata without sacrificing the usability of cybersecurity information. c) Information update: The proposed mechanism handles cybersecurity information that should be treated differently from other types of information. Up-to-date security information is crucial if an organization is to cope with security threats, while obsolete security information may itself pose significant security risks. Therefore, updates of cybersecurity information need to be instantly propagated to anyone who has obsolete information.

Our approach
To cope with the above issues, we propose a mechanism of linking, locating, and discovering structured cybersecurity information online. Here are the design principles of the mechanism.
a) Provide a means to share information • Provide a means to register information: The mechanism allows users to register and publish their information.
• Provide a means to locate and identify needed information: The mechanism allows users to locate information and obtain the Uniform Resource Identifier (URI) of the information so that they can identify and access the information. • Provide a means to update obsolete information: The mechanism has a means to share information updates with entities that have obsolete information. The security risk of users who possess obsolete information will be minimized by this feature.

b) Support assorted types of information
• Build flexible and extensible information structure: The mechanism has a flexible and extensible information structure that can incorporate assorted schemata. We may build a universal schema for all types of cybersecurity information to address the schema gap and future extensibility issues, but it is impractical, as mentioned in Section 2.1. We thus took the approach of developing a flexible and extensible information structure instead. • Accumulate metadata of cybersecurity information: The mechanism accumulates not raw information but its metadata, which is necessary for future retrieval. The metadata is in RDF and is generated by applying Extensible Stylesheet Language Transformations (XSLT) to the raw information that is written in XML.

c) Consider deployment
• Avoid changes to existing online information: The mechanism does not require any modification to existing cybersecurity information that is available online. In this way, the deployment of the mechanism on the Internet is facilitated. Though the new protocol is needed to identify and locate information properly, the information stored inside the information sources must be free from any modification so that it can be deployed. • Cope with assorted security techniques: The mechanism needs to incorporate assorted security techniques to maintain its security, but this falls outside the scope of this paper.

ARCHITECTURE OVERVIEW
Our mechanism defines 4 roles, i.e., Information Source (InfoSource), Registry, Discovery Client (D-Client), and Discovery Server (D-Server). Figure 1 describes their relationships. Note that several roles can be implemented inside a single entity. Each of the roles is defined below:

FIGURE 1
Roles. D-Client, discovery client; D-Server, discovery server; InfoSource, information source FIGURE 2 Illustrative topology. D-Client, discovery client; InfoSource, information source; URL, uniform resource locator a) InfoSource: It possesses cybersecurity information in XML and provides the information upon request. It makes the information discoverable over networks by registering it in one or more Registries. Note that InfoSources in this paper handle only information in XML, but the mechanism we propose should be made to cope with non-XML information in the future. b) Registry: It serves as an interface for InfoSources and accepts their registrations. It accesses InfoSources that registered their cybersecurity information and generates metadata from the information it obtains, which can be used to locate and identify the information. Note that a Registry can be implemented by multiple entities for load balancing purposes. Because information is distributed under different administrations, a large number of unique Registries can exist, each of which is connected to different InfoSources. c) D-Client: It locates cybersecurity information by communicating with a D-Server, which replies with the Uniform Resource Locators (URLs) of the information. Upon receiving the URLs, it accesses the URL to receive the information. Note that it can communicate with any D-Servers, depending on its needs and preferences. d) D-Server: It serves as an interface for D-Clients. It communicates with Registries to find the InfoSources that contain the information it is looking for on behalf a D-Client. A D-Server communicates with one or more Registries, aggregates information they sent in reply, and delivers the aggregated information to the D-Client. Each D-Server can reach different InfoSources through different Registries, and thus has a different search range from the others. Figure 2 is an illustrative deployment of the roles. The mechanism provides a means for a D-Client to locate cybersecurity information distributed over multiple InfoSources. The D-Client issues a query to a D-Server, which then locates InfoSources through the Registries and replies back with the list of URLs of related InfoSources, to locate the needed information in the InfoSources.

PROTOCOL MESSAGES
Our mechanism defines 5 types of messages: registration, notification, query, result, and membership, which are used within the procedures described in Section 6. All of the protocol messages have 2 mandatory fields and 2 optional fields. The type and content fields are mandatory, while the protocol and credential fields are optional. The messages are described in RBNF, 63 as follows.
The type field identifies 1 of the 5 message types. The credential field provides the sender's credential information, such as the signature of the sender. The protocol field identifies the protocol the message sender supports, such as HTTP, HTTPS, SOAP, 64 BEEP, 65 and WebSocket. 66  e) Membership message: This message is mainly used by a Registry to join or leave a D-Server. This message consists of purpose and URL fields. The purpose field identifies the purpose of the message, and its value must be either "JOIN," "LEAVE," "QUERY," or "STATUS." The URL field identifies the URL of the sender of the message, i.e., the Registry. When the value of the purpose field is set to STATUS, the message contains an additional field, called the spec field, which describes the specifications of the server, such as the categories and formats the server supports.

PROTOCOL PROCEDURES
This section introduces the protocol procedures of the proposed mechanism. They use the messages defined in Section 5.

Information publishing
An information publishing procedure defines how an InfoSource publishes or updates its information. Our mechanism provides 2 such procedures, i.e., push and pull type. Neither procedure requires modifications to the information of the InfoSources. Nevertheless, the push-type procedure requires an InfoSource to send a protocol-specific message for the sake of efficiency. On the other hand, the pull-type procedure does not require InfoSources to have any protocol-specific features, but it requires extra message exchanges.
a) Push-type information publishing: Here, the information is published, as described in Figure 3. The procedure begins with a registration message sent from an InfoSource to a Registry. The message includes URL InfoSource and permitted communication protocols. If the Registry receives the message, it selects one of the protocols, accesses the URL by FIGURE 3 Information publishing. D-Client, discovery client; D-Server, discovery server; InfoSource, information source using the protocol, and receives the InfoSource's information that is written in XML. ‖ The Registry then applies XSLT to the information to generate its metadata in RDF, stores the metadata in its repository, and sends a Notification message with URL InfoSource to the D-Servers. Upon receiving the message, the D-Servers send a Result message to the Registry to confirm that the message was received correctly, or they may send a Query message to receive an up-to-date Result message, though both procedures are optional and are omitted from this figure. Upon receiving the Notification message, the D-Servers may send a Notification message to D-Clients that hold obsolete information.
In this way, the D-Clients can instantly receive information updates by accessing URL InfoSource , ** which minimizes the risk due to obsolete information. b) Pull-type information publishing: For better deployability, protocol-specific features should be avoided since not all InfoSources can support those features. The pull-type procedure is designed for those InfoSources that do not support the push-type procedure. It introduces an intermediate entity, which we call a WrapSource. A WrapSource is stationed between a Registry and an InfoSource. It performs the registration procedure on behalf of the InfoSource. WrapSource checks whether any update is available by periodically polling and accessing the InfoSource. If any update is detected, the WrapSource sends a registration message to the Registry with URL InfoSource . The procedure following the Registry's receiving the message is the same as the one for push-type information publishing. The pull-type procedure does not impose any additional requirements on InfoSources. However, a small amount of network and computer resources may be wasted, and the timing of the information updates may be delayed, the maximum delay being equal to the interval of the periodic polling.

Server registration and cancellation
The server registration and cancellation procedure enables a Registry to use a D-Server. The procedure begins with a Membership message sent from a Registry to a D-Server. The purpose field of the message is set to "JOIN," and the message contains URL Registry . The D-Server replies with a Membership message whose purpose field is set to "STATUS." The message confirms whether the registration is successful and contains the categories and formats the D-Server uses. Note that this procedure allows D-Servers to use any category since it can tell Registries which categories it uses by using the Membership message, though the categories and formats described in Section 7 are currently used in our mechanism. A Registry may send a Membership message to a D-Server when it wants to be deregistered from the D-Server. The message contains URL Registry , and its purpose field is set to "LEAVE." Otherwise, the Registry may allow the timeout inside the D-Server to expire.
Optionally, a Membership message can be used to know the categories and formats a D-Server uses. In this case, its purpose field is set to "QUERY".

Information retrieval
The information retrieval procedure is for a D-Client to retrieve information distributed across different InfoSources. As shown in Figure 4, it begins with a Query message sent from a D-Client to a D-Server. Upon receiving the message, ‖ Note that this 3-way messaging could be substituted with a single message if the first message, i.e., the Registration message, contains the actual XML in it. Nevertheless, we chose this 3-way communication, since the Registry can control when and how often it accesses the InfoSources, though immediate access is often preferred when dealing with security-related information. ** A D-Server may know which D-Clients might have obsolete information by recording which of the registered D-Clients have exchanged which information.

FIGURE 4
Information retrieval. D-Client, discovery client; D-Server, discovery server; InfoSource, information source the D-Server forwards the message to the Registries it has already registered. Each of the Registries then retrieves its RDF-based internal repository by using SPARQL and produces a list of the candidate InfoSources. Note that each entry of the list contains scores that represent the degree of relevance of its information to the query, though the scoring algorithm is outside the scope of this paper.
Each of the Registries sends replies to the D-Server in the form of a Result message, which contains the list. The D-Server aggregates the Result messages, generates an aggregated list of candidate InfoSources, and sends a Result message with the list in it back to the D-Client. Upon receiving the message, the D-Client selects an InfoSource from the list and accesses its URL by using one of the permitted protocols to receive its XML-based cybersecurity information.

Information update
Our mechanism can provide information updates from InfoSources to D-Clients in real time. When an information update from an InfoSource is published as described in Section 6.1, the D-Server may push the update to the D-Client immediately by using a Notification message, as illustrated in Figure 3. In this way, the information is updated in real time. Nevertheless, it is not always possible for a D-Server to push the update to a D-Client. The mechanism thus allows the D-Client to poll the D-Server periodically so that the D-Client can check for any update from the D-Server. Though the update is not in real time, the delay is limited to be within the range of the poll interval. Note that the WrapSource, introduced in Section 6.1, polls the InfoSources periodically; thus, if the updates are conducted using WrapSource, the polling interval becomes another delay in the update.

INFORMATION STRUCTURE
Each Registry has an RDF-based internal repository to maintain the metadata of the InfoSources and to facilitate future retrieval requests. This section describes the flexible and extensible information structure of the metadata.
The information structure of the metadata manages the category and format of the information separately, and the category and format are linked with each other to incorporate different schemata and maintain extensibility. We used the information types defined by the cybersecurity information ontology 3 for the category, while we used industry specifications defining XML schemata for the format. At present, the categories incorporate the formats shown in Table 2. Figure 5 depicts the information structure, which is encoded as a series of RDF triples. The information structure has the 7 categories and 6 subcategories listed in Table 1 and defines them as its RDF classes. Likewise, it defines classes for each industry specification such as CEE and IODEF and relates them to one of the category classes. Each InfoSource URL is associated with the industry specification the InfoSource uses, and it is followed by a timestamp and metadata. The timestamp records the last time the Registry checked the information's existence, while the metadata is generated by running a predefined XSLT script on the information. The metadata is generated according to the procedures described in Section 6.1.
Note that the metadata may have more types of information, but we kept the metadata presented here to a minimum for simplicity. Depending on the situation, the metadata can be easily expanded.

PROTOTYPE
This section describes a prototype to demonstrate the feasibility and usability of the proposed mechanism. The language used for the implementation was Java, and JRE versions 1.6 and 1.7 are supported. Jena SDB 68 is used to save RDF data into the MySQL database, and SPARQL is used to query the data. For the test environment, we used a Linux machine with the specifications listed in Table 3 and prepared virtual machines with a CPU (1 core) and 1 GB of memory for each of the D-Servers, Registries, and InfoSources. Some of the details were left undefined in the earlier sections, but they need to be defined for the prototype. Here are the implementation-specific features. a) When an InfoSource publishes an XML document containing cybersecurity information, the related Registries generate RDF-based metadata on the InfoSource by applying XSLT to the document. Although a sophisticated metadata extraction mechanism can be implemented if needed, the prototype takes a simple approach. b) When scoring the candidate InfoSource, we first count the number of keywords available in a tag and divide the number by the total number of words in the tag. We then assign the highest rank to the entry that has the highest resultant value. If entries with the same value are found, the one with the older registration date gets the higher rank. c) The prototype can access NVD and JVN repositories and update itself automatically at a regular interval that can be adjusted as needed. Note that the repositories provide a file that has multiple entries containing vulnerability information. The prototype automatically parses the information and separates these entries so that it can handle the individual pieces of information separately. d) A Jetty certificate 69 is used to certify the InfoSource. e) HTTPS is used to exchange information.
The prototype runs a servlet on a Registry, which provides 4 types of interfaces: keyword search, advanced search, category browsing, and news update. D-Clients, i.e., users, can access the servlet via browsers. A snapshot of these interfaces is shown in Figure 6. a) Keyword search: This interface provides a means to locate information that is related to a simple keyword that users specify. This interface is shown at the top of Figure 6. Users input keywords in the box and run a search by clicking the "Send Query" button. Additionally, this search has 2 modes, i.e., perfect and partial match. The perfect match mode returns entries that exactly match the keyword the users input, while the partial match mode returns entries that contain any part of the keyword. When using the partial match mode, users may input a few regular expressions. b) Advanced search: This interface provides a means to run a more sophisticated search than the above keyword search. Users may specify the XML tag indicating where they wish to search. The area below the simple keyword

FIGURE 6
User interface for retrieving information search is used to run this detailed search. Users may look up the tags; the servlet provides the available tags, and the user simply chooses one of them. The user inputs the keywords and runs the search by clicking the "Send Query" button. As illustrated in the figure, multiple lines of conditions can be specified to create a more targeted search. As with the keyword search, this advanced search has both perfect and partial match modes. c) Category browsing: This interface lets the user browse a list of information within a specific category of his/her choice. This interface is shown on the left side of Figure 6. The user clicks on a category or format to see the list of available information. d) News update: This interface provides the list of information entries that are registered recently in a chronological order. This interface is shown on the right side of Figure 6. Users can specify which information they are interested in and if needed filter the list of information displayed.
Once the users receive a list of information entries through one of the interfaces, they can see the actual content of the entries by clicking on them. Note that the 4 interfaces are the same internally; all of them run the same tag-based search on the SPARQL engine.
The prototype links and locates the online cybersecurity repositories to discover cybersecurity information users need. It thus demonstrates the feasibility of our mechanism.

DISCUSSION AND ANALYSIS
We evaluated the proposed mechanism in terms of its extensibility and scalability as well as the credibility of the information it provides. We also investigated operational issues.

Extensibility of the mechanism
This section analyzes the extensibility of the mechanism by discussing how it can be made to cope with future schemata and unstructured data.

Coping with future schemata
The information structure introduced in Section 7 is unique in that it separately defines categories and formats and then links them together. This structure makes our mechanism extensible into the future. By separating the information structure into 2 parts, we can keep the category part static while allowing changes to the format part. We used the categories defined in the reference ontology, 3 since they are at an abstract level rather than a concrete or schema level. Likewise, we have used industry specifications for the formats so that we can add new schemata in the future. If the information structures defined by existing specifications become obsolete, all we have to do is to build a new specification and associate it with one of the categories.
When a new schema is added, we need to link the schema with one of the categories. More precisely, when we add a new type of cybersecurity data to our system, we extend the RDF schema of the metadata by adding a new format to one appropriate category, as shown in Figure 5, and register the schema of the data to our system. Note that we need to select one appropriate category, but we do not need to modify any of the existing categories. As with the other formats, this new format is linked with instances, i.e., URIs of data instances. We also need to prepare appropriate XSLT scripts to extract metadata from the InfoSources. Since the information structure is represented in RDF, changes can be performed and propagated by simply adding or removing several RDF triples. Therefore, we conclude that the proposed mechanism is extensible.
The categories are designed to be static, and we do not see the need to change the categories at this stage because the underlying ontology was designed on the basis of a year-long analysis of the current operations of major international security operation centers 3 and because the categories are abstract enough to absorb minor differences. Nevertheless, if the D-Server prefers its own categories, it can use them. During the server registration procedure described in Section 6.2, the D-Server may specify an arbitrary set of categories. In this case, we advise the D-Server not to change the categories so frequently to avoid any confusion among D-Clients and InfoSources. When using different categories, the namespace of the schema defining the information structure should be changed so that the changes can be managed.

Coping with unstructured data
The extensible information structure allows us to adopt a new schema that may emerge in the future. This allows us to collect more information. Nevertheless, the amount of structured information will remain small compared with the Number of entries inside the Registry "time_register.dat"

FIGURE 7
Registration performance amount of unstructured information. To access the great variety of information on the Internet, the proposed mechanism should be able to cope with unstructured information as well.
For instance, knowledge can be extracted from the unstructured online information and/or from the description fields of structured information. Some techniques for that purpose are mentioned in Section 2.5. We can use these techniques to generate structured information from unstructured information. These techniques increase the amount of information that we can cover and thus enhance the usability of the mechanism.

Scalability of the mechanism
Here, we evaluate the scalability of the proposed mechanism by analyzing the performance of the prototype. We measured the time for registering and locating information using the dataset and machines used in Section 8. This measurement used only 1 Registry and InfoSource in a closed network for simplicity. Figure 7 plots the time needed to register an information entry. For simplicity, we have selected an entry of NVD and registered it recursively. The horizontal axis describes the total number of entries accumulated in the Registry, while the vertical axis describes the time required for registering the entry in the Registry. The registration was done using the push-type information publishing procedure described in Section 6.1. The measurement began when the InfoSource started to generate a Membership message to send to the Registry, and it ended when the D-Server received the Notification message from the Registry.

Processing time of registration
Though we registered the same entry recursively, the time needed to register each instance fluctuated due to environmental factors. Nevertheless, the registration time was roughly between 100 and 300 milliseconds. In some cases, the registration took around 20,000 milliseconds, but we believe that this was due to the commitment procedure of MySQL, and we thus treated them as exceptions. As the figure demonstrates, the registration takes more or less a constant amount of time regardless of the number of entries accumulated in the Registry. Therefore, we conclude that the registration process maintains scalability.

Processing time of retrieval
As mentioned in Section 8, the prototype provides 4 search interfaces, but all of them run the same tag-based searches internally. Therefore, we chose the simplest search, i.e., the keyword search described in Section 8, as our evaluation target. The search has 2 modes, i.e., perfect and partial match, and the performance depends on which mode is used. Figure 8 shows the prototype's performance for a perfect match retrieval, while Figure 9 shows the performance for a partial match retrieval. The horizontal axis is the total number of entries accumulated inside the Registry, while the vertical axis is the time required to retrieve an entry. The retrieval was based on the procedure defined in Section 6.3 for each mode. The measurement began when the D-Client started to generate a Query message to send to the D-Server, and it ended when the D-Client received a Result message from the D-Server. Figure 8 shows that the retrieval takes more or less a constant amount of time. This is because the Registry internally assigns indices to each of its entries. Therefore, we conclude that the perfect match mode maintains scalability. In contrast, Figure 9 shows that the time needed for the partial match mode is proportional to the size of the Registry. This mode is Number of entries inside the Registry "time_retrieval_partial.dat"

FIGURE 9
Partial match search performance significantly slower than perfect match mode, since it needs to iterate operations that are similar to the perfect match mode operations. Though the performance is inferior to that of the perfect match mode, the partial match mode also maintains scalability.

Information credibility
Taking actions based on malicious or wrong information can cause serious problems; thus, the credibility of the information is a critical concern. The following issues need to be considered to guarantee information credibility.

a) Authentication of information provider:
The entities that provided the information need to be identified and authenticated. We should make sure that the information providers are not disguised in some way, because malicious parties often disguise themselves. Here, we need to authenticate the information provider by using appropriate measures, such as certificates. Once we identify and authenticate information providers, the quality of the information provided by them could be filtered based on their reputations. b) Integrity of information: It is necessary to verify that particular information existed at a certain time and that it was not modified thereafter. One approach to this is to use certificates and digital signatures. We can add a digital signature to a message to prove the authenticity of the message sender. The message has an optional field, called credential, as defined in Section 5, and that field could be used for a digital signature. We could also use communication protocols that use digital signatures to verify the authenticity of the parties. The prototype uses Jetty certificates, but further measures to reinforce credibility must be studied. c) Accuracy of the content of the information: The content of the information that is shared with public should be also verified. We need to make sure that the information provider maintains a high level of credibility in the public eye and that the information it provides also has a high level of credibility. Reputation management can address this issue. d) Information duplication and contradiction: When locating information, D-Servers and Registries may receive duplicated answers from multiple InfoSources. They may also receive contradictory answers from multiple InfoSources. They need to deal with such duplications and contradictions because D-Clients may be confused if they receive inconsistent information. Such duplications and contradictions could be used, for instance, as a hint to evaluate the credibility of the information content.

Operational considerations
When deploying a system that conforms with the proposed mechanism, we need to consider performance, privacy, and security issues. This section considers the issues related to these matters.

Load-balancing of Registries
A Registry could be implemented on multiple computers to spread out the workload, as mentioned in Section 4. Although the communication protocol between such computers was not defined in this paper, an arbitrary protocol could be implemented. One simple scheme is to let a downstream computer that is closer to the InfoSource behave like an InfoSource, so that its upstream computer that is closer to the D-Server can communicate with it as if it is an InfoSource.

Privacy-oriented Registry setup
As mentioned in Section 4, the relation between D-Servers and Registries is multipoint to multipoint. This feature allows us to establish domain-specific Registries to handle sensitive information that must not be disclosed outside the domain. Figure 10 illustrates the deployment of a D-Server and Registries. The D-Server is connected to 1 internal Registry and 2 external Registries. The internal Registry handles sensitive information that must not be disclosed outside the intranet, while the external Registries handle public information. The D-Client in the figure can access the InfoSources administered by the external Registries and the internal Registry, while nobody outside the intranet can access the InfoSources administered by the internal Registries so long as the D-Server is not accessible from outside the intranet.
Organizations often have both sensitive and nonsensitive information. As a result, they may have 2 separate Registries: one for internal use and another for the public domain. While various security mechanisms, including access control at a D-Server, may support the confidentiality of sensitive information, this type of secure network design can be considered to bolster the confidentiality of this information.

Adjustment for each InfoSource
The pull-type information publishing scheme introduced in Section 6.1 allows us to cope with InfoSources that do not know anything about our mechanism. Nevertheless, the use of pull-type scheme is sometimes insufficient to efficiently use the current information provided by InfoSources. Indeed, the prototype needed to parse information updates from both NVD and JVN, as mentioned in Section 8, even though it used the pull-type scheme.
The NVD provides an XML file that lists all the vulnerability notes. The prototype works efficiently using the pull-type scheme if a new file containing only a new vulnerability note is generated. However, when a new vulnerability note is FIGURE 10 Illustrative topology handling private and public information. D-Client, discovery client; D-Server, discovery server; InfoSource, information source found, the XML file is modified without producing any new file. Therefore, our prototype determines the update by checking the changes to the XML file and, upon detecting any update, parses the XML and extracts only the new vulnerability note from the XML file to run the procedure described in Section 6.1. It is the same for JVN; we have a separate parser program that extracts the new vulnerability note from an XML file containing all the vulnerability notes. The prototype may choose different implementation. These repositories provide each vulnerability note in separate HTML pages. Therefore, we may use the vulnerability notes in HTML instead of the ones in XML and generate metadata by converting the HTML pages. In either case, we still need to adjust the prototype to efficiently handle existing InfoSources.
Such an adjustment is inevitable because these sources are not designed for our mechanism. Nevertheless, preparing a few scripts may solve this problem whenever we find a new source.

Further development of the prototype
Although the mechanism described in this paper focuses on linking information, it must also cope with malicious attacks. For instance, it should be able to cope with impersonation, man-in-the-middle, and denial of service attacks that can severely affect security. For this reason, the prototype needs to be developed further before it can be deployed on the Internet. Though this is outside the scope of this paper, we should note that the prototype needs to use appropriate security techniques, and its security level needs to be reinforced before it is deployed on the Internet. For instance, the sequences defined in Section 6 are designed only for exchanging cybersecurity information, but they do not describe any mechanism for securing the exchange. Various security techniques, including authentication and encryption, can be applied to the sequences. Likewise, various negotiation processes can be added to the sequences to enrich their functions and capabilities.

Usefulness of the mechanism and its applicability to other contexts
The proposed mechanism allows us to locate cybersecurity-related information across various repositories. It makes it easy for the information consumers to locate and access needed information instantly. For instance, when system administrators look for vulnerability information of all the software in their systems, they may use the mechanism to locate related information across various repositories by specifying the names of the software list or CPE-IDs. Then they can instantly receive list of vulnerability information related to their systems from different repositories. Different repositories may provide different types of vulnerability information, and they sometimes publish information on the same vulnerability (with the same CPE-ID) with slightly different contents. For instance, severity score, which is typically represented by CVSS, may differ among these repositories. Indeed, the CVSS score of a certain vulnerability note in JVN is not necessarily the same with the score in NVD in some cases. Moreover, some organizations may first publish vulnerability information without its details and later revise the information. By receiving the list of information and looking at the item that has the latest timestamp, the system administrators may be able to get the latest information.
The same can be done by customizing the ordinary search engine that is designed for general purposes and retrieves unstructured information. It may return the list of information sources, but some of them may not be related to vulnerability information, or may be unreliable. By using our mechanism, we can be more specific, thus can get more accurate and usable information.
The proposed mechanism is designed for cybersecurity-related information. Indeed, cybersecurity is an appropriate context for applying the mechanism because cybersecurity operations require a lot of different information, and there already exist online repositories that provide cybersecurity-related information. Nevertheless, the mechanism is applicable to any other, non-cybersecurity context as well, as long as information is structured and represented in XML. When using the mechanism, we should list use cases and define the scope of the information we deal with. Otherwise, we may end up with linking unnecessary types of information, which could decrease the usability and performance of the mechanism. In the context of cybersecurity especially, unnecessary information may confuse the users and hinder them from taking appropriate and prompt actions.

CONCLUSION
The proposed mechanism links, locates, and discovers structured cybersecurity information over the Internet. Its unique information structure distinctly defines categories and formats that are linked with each other. We used the categories defined by the reference ontology 3 and used industry specifications for the categories and formats of the information structure. The prototype implementation demonstrated the feasibility of the mechanism, and our analysis showed that the mechanism is flexible, extensible, and scalable. The mechanism makes cybersecurity information discoverable and enables better use of the information. Various repositories from all over the world can be linked and integrated through it. This will create an Internet-scale cybersecurity knowledge base and realize a web of cybersecurity.
The mechanism can be used for various purposes, including simple information retrieval. As mentioned in Section 2.6, repositories are useful for security operations. The mechanism allows automated security techniques to access assorted cybersecurity-related information and repositories all over the Internet. We hope that this study will contribute to the advancement of security automation techniques.

ACKNOWLEDGMENT
This work was supported by a grant from the Japan Society for the Promotion of Science (JSPS KAKENHI grant number 17K12699).