A data model for enabling deep learning practices on discovery services of cyber‐physical systems

The W3C Web of Things (WoT) is a leading technology that facilitates dynamic information management in the Internet of Things (IoT). In most IoT scenarios, devices and their associated information change continuously, generating a large amount of data. Hence, to correctly use the information and the data generated by different devices, a new perspective of managing and ensuring data quality is recommended. Applying Data Science techniques to create the data model can help to manage and ensure data quality by creating a common schema that can be reused in future projects, as well as producing recommendations to facilitate Service Discovery. In addition, due to the dynamic devices that change over time or under specific circumstances, the data model created must be sufficiently abstract to add new instances and to support new requirements that devices should incorporate. The use of models helps to raise the abstraction level, adapting it to the continuous changes of devices by defining instances associated with the data model. This paper proposes two data models: one for Cyber‐Physical Systems (CPS) to define device information fetched by a Discovery Service, and another for applying Deep Learning in natural language problems through a Transformer approach. The latter matches user queries in natural language sentences with WoT devices or services. These data models expand the Thing Description model to help find similar CPSs by giving a confidence level to each CPS based on features such as security and the number of times the device was accessed. The results show how the proposed models support the search process of CPSs in syntactic and natural language searches. Furthermore, the four levels of the FAIR principles are validated for the proposed data models, thus ensuring the data's transparency, reproducibility, and reusability.


INTRODUCTION
Information and services available through the network are ever increasing. 1 Previously, information and services were predominantly published by people.As a result, information was restricted by human beings' limits.Indeed, a human being does not have the same capacity to process and publish information as a machine.As technology has advanced, this process has been taken over by machines, including publishing information and services, meaning the ability to deliver information has increased, as it no longer suffers from the previously mentioned human constraints. 2However, the end users that utilize this information cannot manage a huge amount of data and, therefore, can be supported by discovering techniques.Given the increased volume of information available, it is necessary to define and structure the information to ensure the quality of the data and that it can be understood by us and used in the best possible way.
Among the entities that publish information are Cyber-Physical Systems (CPSs).* These devices are becoming available in areas such as Smart Homes and Smart Buildings, among other recent areas, especially in the field of Smart Cities. 3,4owever, the fact that these devices change continuously, for example, the management of the information on the position of a smart scooter that is in movement or how to manage the information of sensors that are replaced or updated by software by the manufacturer every few months, requires a data model abstract enough to be compatible with all the devices.Applying Data Science techniques to create the data model of CPSs ensures data quality and helps to manage data. 5his paper focuses on solving the need for a common data schema to define CPSs information compatible with current and future devices.To do so, we propose a data model to define a data schema for defining CPSs using Discovery Service perspectives. 6The proposed data model represents CPSs as a combination of information related to the services that the device provides and information related to the quality and security of the device.Using Model-Driven Engineering (MDE) techniques to represent the information of CPSs helps in the search and recommendation processes of CPSs.Furthermore, the proposed data model is developed to be re-used with future developments of CPSs, regarding the type of device or the manufacturer, among other possible examples.In addition, an extra data model is proposed for applying Deep Learning (DL) in natural language problems through a Transformer approach. 7The latter matches devices and services with a user query in natural language, returning a suitable list of devices that match the query, serving as a matching mechanism or Recommender System.Proposing a data model for the DL process reduces the workload in the preprocessing step for getting the data and applying feature engineering techniques.The input data from the dataset is transformed to follow the proposed schema represented using MDE techniques for recommending CPSs, thus facilitating the re-use of the schema when training DL models using other algorithms different from the algorithm used in this paper.
Due to the dynamic nature of the Internet of Things (IoT) devices, their information is continuously changing. 8,9In addition, devices from different providers communicate in different ways. 10An abstract data schema that supports past, present and future devices, is recommended to collect and store the information from the dynamic devices, ensuring the data quality and helping manage data from any device.Furthermore, the use of models facilitates the addition of new instances and helps to explain the inner workings of the data schema and define, verify and validate it.
For the proposed data model, the Thing Description (TD) model from the Web of Things (WoT) is used as the core of the data model. 11The TD is a document that defines the features and properties of the device, providing more information than just the device itself.The proposed data model expands the TD with data related to information such as data quality.Furthermore, for the definition of the data model, the FAIR principles are followed, ensuring that the data schema created from the data model is Findable, Accessible, Interoperable and Reusable. 12egarding the second data model proposed, the goal is to define a data model that can solve natural language queries for any device.The defined data model is based on the first data model proposed, thus re-using some of the information related to the quality of the device.The reason for using Artificial Intelligence (AI) to solve the natural language problem is that it helps to solve problems that people cannot.In addition, using AI techniques improves the search process of a WoT Service Discovery, giving support to syntactic and semantic search.The data model is designed to be used (and work properly) using any Deep Learning algorithm, but for the analysis, the Transformer algorithm is used, 13 extending a previous work where Transformer was used to design a recommender system for CPSs. 14Transformer is a novel solution that delivers better results regarding the problems used, mainly those of natural language.As such, Transformer is used as the algorithm to analyze the approach of matching user queries in natural language with WoT devices and services.
The following research questions are addressed to identify the objectives as well as to approach the aforementioned facts.
1. RQ1: Is it possible to ensure quality and interoperability by defining an information data model related to devices and services in a WoT Service Discovery? 2. RQ2: Does it make sense to define an additional data model to facilitate the discovery of devices and services in CPSs using natural language through a Deep Learning model such as a Transformer working as a matching mechanism or Recommender System?
To answer them, we proposed a Domain-Specific Language (DSL) to represent any CPS using the information related to their security and quality (RQ1).Furthermore, the metamodel describing this DSL is used in an example scenario to analyze how the use of the proposal may affect when searching for CPSs.In addition, another DSL is proposed to train Deep Learning models for recommending CPSs using natural language sentences (RQ2).This DSL is based on the previous one for supporting the recommender's result using the device's security and quality information.
This article is structured as follows.Section 2 offers an overview of various related projects about metamodels for describing Cyber-Physical Systems; this section introduces some literature review and the background of the problem.Section 3 presents a general overview of the proposed models and briefly describes the classes and attributes of the model.Section 4 provides an in-depth explanation of the proposed models' classes and attributes.Section 5 describes the analysis of the proposed model for storing device information, the analysis of the data model for applying Deep Learning and validates that both data models follow the FAIR principles.Finally, the conclusions and future work are explained in Section 6.

LITERATURE REVIEW
This section offers an overview of papers related to our proposal.Furthermore, a background of the related work is outlined to describe some of the terms used in the paper.

Related work
In the literature, metamodels for CPSs are focused on describing the communication or networking of devices to solve the interoperability problem due to the heterogeneity of devices.Protocols, interactions, usage, and exposure of devices are defined in metamodels using Model-Driven Engineering (MDE) techniques. 15Furthermore, in the literature, metamodels also aim to describe data generation of devices to establish a common way of storing and using the information of devices.
In addition, MDE approaches may be supported by code generation techniques and Artificial Intelligence (AI) oriented solutions.However, the literature does not present metamodels to describe information for finding devices to facilitate the discovery or selection of devices among a large number of similar devices.In our approach, we focus on defining a data model to describe CPSs, with the main idea of making it a helpful description when searching for devices, thus making unique advancements in defining a model to represent CPS data that can help in the discovery process of CPSs.Furthermore, the proposal is supported by a data model which applies a Deep Learning algorithm in our case study through a Transformer approach.In Reference 16, as the existing MDE proposals are for modeling the internal behavior of things, that is, how devices manage data generation, the authors propose an MDE approach for modeling the networking, thus solving the existing interoperability problem regarding IoT communications due to the heterogeneity of devices.The proposal includes the generation of network artifacts using ThingML.This proposal is similar to our approach because both propose data models to solve a problem different from that one defining the internal behavior of CPSs.However, in our approach, the interoperability problem of devices is solved using and extending the Thing Description (TD).The TD data model, among other features, defines classes to describe device communication.Another approach focused on the communication and interaction between CPSs is the metamodel proposed in Reference 17.In Reference 17, the authors propose a metamodel focused on defining the communication and behavior of CPSs to allow the internet communication of the described CPSs.Finally, the metamodel is mapped into the Industry Foundation Classes (IFC) schema and validated.
On the other hand, an approach that focuses on improving the interoperability of devices is presented in Reference 18.The authors propose an architecture for the Industrial Internet of Things (IIoT).A metamodel is defined for this architecture to integrate IoT devices into networks and social networks.The architecture has five layers: (a) a sensing layer to define the physical devices; (b) a database layer to define the physical and virtual databases for storing the information of the devices; (c) a network layer to describe the connection and communication protocols between devices and users; (d) a data response layer to define the automatic response of devices to users; and (e) a user layer to describe the API for the use of devices by external entities.The metamodel defines devices and how they are consumed and exposed through the network and defines rules to control and automate the features of the actuators and sensors.This approach focuses on defining the physical devices and the interaction between devices and external entities.Our solution extends from the Thing Descripcion (TD) and has solved the interaction because external entities interact with devices using the information provided in the TD metamodel.Accordingly, our approach focuses on searching for devices to be able to interact with them through voice using natural language.
SoAML4IoT is another approach where communication is defined in the metamodel, but regarding the metamodel definition, it is the most similar paper to our approach.In SoAML4IoT, 19 the authors propose a Domain-Specific Language (DSL) based on Service-Oriented Architecture (SOA) to address the interoperability problem of IoT devices.The proposed metamodel focuses on defining services in such a way as IoT devices are exposed and found by those that need to consume them.In addition, communication, location, and information about the data managed by devices are described in the metamodel.In our approach, devices are also defined as services.The TD defines the basic information of devices and the services that the device exposes (actions and properties) and consumes (events).The main difference with the data model proposed in our approach is that we focus on valuable data to search for devices in addition to describing devices as services.
Another approach is, 20 where the authors propose a metamodel approach for developing IoT systems.In this proposal, the authors extend three existing metamodels: a high-level one, ELDA, and JACOSO.The high-level metamodel defines information related to quality, location, and available services.In contrast, the ELDA and JACOSO metamodels represent information related to events, tasks, and communication.
Other approaches regarding the use of metamodels for representing the behavior of CPSs are References 21,22.In Reference 21, the authors propose a decision-making system for the Circular economy business model for tracking and monitoring IoT products.To develop the decision-making system, the authors propose an ontology model where the product is classified according to its UseCycle and LifeCycle, creating rules to decide when the product can be reused or must be changed.In Reference 22, the authors propose an extension of the Business Process Model and Notation (BPMN) represented by a metamodel.The extension aims to support IoT systems by representing them in the BPMN model.The extension includes information related to physical and virtual entities, information related to the computer resource, and information related to the quality of the devices, like security and data quality.Finally, the authors propose a fog/cloud federation architecture using the BPMN extension and validate the proposal in two scenarios, a smart autistic child scenario and a coronavirus disease scenario.Similar to our approach, the authors model the quality of service of the CPSs, defining the security and the data quality among other features.However, the model focuses on increasing the interoperability between CPSs, modeling other features such as hardware information.In our approach, the model is defined to facilitate searching for devices by differentiating similar devices with metrics that can be helpful in the search process.
As shown, current research focuses on defining the quality of the data managed by the CPSs and on solving the interoperability problem between CPSs.The aim is to extract and use the data managed by devices from different manufacturers.In our proposal, we focus on searching for devices instead of searching for information on devices.We aim to return a device that the user will use to get information such as the temperature and not a value from the device (i.e., its temperature).Therefore, the novelty of our approach, compared to the related work, is the use of models to define an abstract data schema to give extra information in the search process of CPSs based on information such as the security and search quality of the device.
Related to finding CPSs, current research tries to manage the high volume of data produced by the CPSs.For instance, in Reference 23, the authors propose a model to improve the performance of collecting, analyzing, and using the data produced by IoT devices.Our proposal advances the research of the data space domain by the mean of facilitating searching for similar devices and processing large amounts of devices.The information of CPSs is extended with contextual information (the location) and quality information that can help in the recommendation of CPSs.In addition, as our approach works in the domain of the device's information instead of the values that the device produce, the amount of data managed by our system is smaller, helping in the performance of the search process.Other approaches advance research in the field of data space in the meaning of the communication of data from different providers or places and how the data is collected, stored, processed, and consumed.In Reference 24, the authors propose an architecture that, following the FAIR principles, process and analyze data from different IoT sources to create a system capable of searching the information about the energy information of Italy.Another paper that focuses on data space is, 25 where the authors propose an architecture, following the FAIR principles, for integrating the data from different providers by applying techniques such as machine learning and a federation approach.With this approach, data providers can access the data from other providers and create a collaborative and shared environment.Our approach, compared to these two research about data space, advances the data space domain by proposing a common data model that can be used to compare similar data and data from different sources.The CPSs get a score based on information that can help the search process.Furthermore, that information can be extended to be able to compare the quality of the data from different providers when delegating queries to other discovery services.
Regarding Artificial Intelligence (AI) techniques in modeling CPSs, in Reference 26, the authors propose a Model-Driven solution for CPSs with a Machine Learning (ML) approach.The proposed data model for CPSs is focused on describing the communication protocols and information managed by devices.In addition, ML algorithms are defined to create models for different things.Finally, the proposal uses a code generator in ThingML for the data model.This code generator creates an API capable of instantiating machine learning models for prediction, classification, and clustering, among others.However, the Transformer algorithm is not supported directly by the code generator.The main difference with our approach is that in Reference 26, the authors focus on generating machine learning models through an MDE approach.The main proposal of our approach involves the definition of a data model for storing CPSs information in the repository of a Discovery Service, extended with a data model for supporting the recommendation of CPSs using Deep Learning.Furthermore.our approach, instead of generating code, focuses on defining a data model which solves natural language problems through a Transformer approach.On top of that, our approach describes the actions and events of the CPSs through the extension of the TD.
Lastly, for papers using the Thing Description to propose an MDE approach, in Reference 27, the authors propose an approach for generating Web of Things (WoT) compatible devices using MDE techniques.The authors define a metamodel based on the TD from October 2018.WoT device codes are also generated using the proposed metamodel, Xtext, and Xtend.To finish, the work is extended with concrete and abstract syntax to perform a Model-to-Model (M2M) transformation for automatically generating WoT Servients. 28Other papers that propose a metamodel for the Web of Things are, 29 which explores a metamodel focused on representing devices as virtual entities that offer services and Reference 30, that proposes a metamodel for context information.
The comparison between our proposal and the related work is shown in Table 1 and exposes that five aspects have been considered to determine that our approach includes these characteristics.In contrast, the rest of the existing works do not cover or address them partially: a. Interoperability of data model about IoT communications; b.Code generation from an MDE approach; c.FAIR principles applied in data records; d.Search information for the data model or improve the search process; and e. Artificial Intelligence techniques.
As shown, our approach combines the interoperability of TD with the search information proposed in the data model and the AI data model for applying Deep Learning.The proposal of Reference 28 extends the TD but focuses on automatically generating WoT Servients; unlike us, the data model is not their main proposal.Regarding the Deep Learning (DL) approach, in Reference 26, the authors combine AI and data models.However, the AI approach in Reference 26 is used to generate machine learning models automatically.In contrast, our proposal focuses on the data model used for the Transformer algorithm to support DL models for recommending CPSs.Furthermore, the Transformer approach is used to solve natural language problems for CPSs, where a user sends a query, and the Transformer has to match the user query with devices or services, focusing on the search information in the data model.

Background
The aim of the data model is to define, in an abstract way, the information that describes a device, facilitating communication between devices from different providers and the discovery of devices.For the definition of the data model, the Thing Description (TD) of the Web of Things (WoT) is used as a basis for the proposed data model.In addition, the FAIR Guiding Principles are followed to ensure the data model is Findable, Accessible, Interoperable and Reusable.Finally, Artificial Intelligence techniques are used to create a Deep Learning model that matches natural language queries with Cyber-Physical Systems using the defined data model.

Thing description
The TD document is a JSON-LD template that describes the basic information about the device. 31The TD was created with the idea of facilitating communication between devices using WoT technology.The Web of Things is an initiative that was created in 2010 to adapt application layer technology to IoT. 32 With WoT, devices are described through the web in a common way, improving the interoperability and heterogeneity of devices.The TD is modeled, structured and based on the TD Information Model, 33 a model that defines the semantics of the vocabularies of the TD by representing them as classes and associations between classes.The TD Information Model is divided into four blocks: 1. TD core vocabulary: Contains the basic information about the device and describes how it interacts with a thing by defining its properties, actions and events.2. Data schema vocabulary: Defines the type of data used by the TD core vocabulary.3. WoT security vocabulary: Contains information about the security mechanisms used by the device.4. Hypermedia controls vocabulary: Defines the hypermedia controls used by the TD core vocabulary for interacting with the thing.

FAIR principles
The FAIR principles (Findable, Accessible, Interoperable and Reusable) define the data model to ensure the data's transparency, reproducibility and reusability.FAIR principles are guidelines used to make the data findable and reusable by machines and individuals. 12To apply the FAIR principles in a data schema, the data model used by the data schema must comply with the four levels of the FAIR principles.For each level, a set of requirements are established through a questionnaire. 34,35The four levels of the FAIR principles are as follows: 1. Findable: Findable is the first level of the FAIR principles.Before using the available data, the data must be discovered automatically.To make the data findable, it must use machine-readable metadata.2. Accessible: Accessible is the second level of the FAIR principles.After the data is discovered, it must be accessible, through different protocols, for all the entities that want to access the data.Furthermore, the data must have authentication and authorization to ensure data security.3. Interoperable: Interoperable is the third level of the FAIR principles.Once the data is accessible, it must use a common vocabulary understood by other systems.4. Reusable: The last level of the FAIR principles is reusability.The data and metadata must use licenses and schemas that ensure data replication in different settings.

AI techniques
The selection of an Artificial Intelligence (AI) model is necessary to define the data model for the Deep Learning approach.
The AI model used is Transformer, a novel sequence transduction model based on multi-head self-attention. 13The reason for using Transformer and not any other AI model is because Transformer is a novel approach that delivers better results in Deep Learning approaches, especially when solving natural language problems, as it considers all the possible relations between words of a sequence, paying attention to them in parallel.In translation tasks, it performs better than convolution and recurrent-based solutions. 36s the input of the Transformer algorithm is an embedding, the training process requires an embedding layer before applying the Transformer algorithm.The embedding layer compact the vectors that represent the user's sentences, transforming the high-dimensional vectors into a low-dimensional space.Therefore, the sentences have to be transformed into vectors before being used in the training process.
The technique used to transform categorical data into numerical data to make the data usable by machine learning algorithms, in this case, transforming words into vectors, is called encoding.The AI model used in this paper for analyzing the proposed data model uses hash encoding to transform the words from each sentence into a numerical value based on a defined size of the vocabulary.Finally, after the sentences are transformed into vectors and before applying the embedding layer, the vectors have to be normalized by using padding techniques, transforming all the vectors into vectors of the same size.Further information about the main features needed to apply this kind of algorithm is explained in a previous work, from which the AI model used for analyzing the proposed data model in this paper is extracted. 14

DATA MODELS
Cyber-Physical Systems using Web of Things (WoT) technology have a data model for defining the data schema of devices.This data model called the Thing Description Information Model defines the basic information of the device, the available interactions and the security configuration.This data model is helpful for a Discovery Service as it allows it to store devices in a repository using a standard data schema.The information stored in the repository is returned to entities searching for a specific device set.However, regarding information related to the search process, the location and the availability of devices are missing.When searching for devices using the Thing Description as the data schema, a query is sent to the Discovery Service using the information in the Thing Description, and all the devices that match the query are returned.For instance, if we have a set of CPSs deployed in a Smart Home and want to discover the movement sensors, the discovery service will return the Thing Description of all the movement sensors deployed in the house.However, in a real scenario, it is common for a device to be broken or run out of battery.Furthermore, the user may want a device deployed in a specific location, for instance, the camera in the child's room, and not all the available cameras.In addition, some devices with the same functionality may be better than others, for instance, two coffee machines where one is older and everyone uses the newest one.When we say that one device is better than another, we mean that for most external users, the first device is useful or better suits their request.Other information that can be used to compare two devices can be security and performance.For these problems, when searching for devices, the current data model for Cyber-Physical Systems requires more information related to recommending and searching among a set of devices.
As information such as location, availability and information related to the search process is important for searching CPSs, we propose a data model based on the Thing Description of the WoT for storing information fetched by a Discovery Service.With the proposed data model, we ensure that past, current and future CPSs can communicate with each other and be discovered by external entities.Furthermore, the search process for these devices is improved by including information in the data model related to the quality of the devices.
Figure 1 represents the metamodel used for the proposed data model, which uses the Thing class and the Security class provided by the Thing Description Information Model.The Thing class is extended with the location of the device, its availability and the associated confidence level.Apart from that, the Security class is used with the information related to devising quality to calculate the confidence level associated with the device.
The proposed data model extends the Thing Description model by making it easy for a Discovery Service to find devices.We use the location for finding devices regarding their position and Time-To-Live (TTL) to check the availability of the device, complying with one of the sections of the Accessibility of the FAIR principles: the subsection A2, discussed in Section 5.3 (i.e., "Which metadata longevity plan do you use?").If a device is unreachable, the proposed data model returns the metadata associated with the device.The device is classified as reachable, unreachable or unknown in the metadata returned.In-depth details of this classification can be found in Section 4. Finally, we use a confidence level to return a list that best suits the user requirements, using quality information to compare similar devices.Using a confidence level, the Discovery Service can return a list of recommended devices, where similar devices that match the user query are sorted based on their security and quality.
The proposed data model allows CPSs to be searched using queries sent by the user.The Discovery Service searches for the CPSs, looking at the information stored in the repository following the data model.However, the system cannot understand natural language sentences.To be able to use natural language, a data model based on our approach is presented to apply a Deep Learning technique, for example, a Transformer that takes natural language sentences as input to make queries.

F I G U R E 1 Metamodel for describing both DSLs (transformer and CPS).
In Figure 1, the metamodel used for the proposed data model to apply Deep Learning is also represented, where a subset of the CPS data model is used to be able to train the Transformer.As some datasets do not represent natural language sentences, we propose a data model that ensures that the Deep Learning model has enough information to be trained to recommend CPS, containing information to build natural language sentences.It is represented as another data model because Deep Learning services require the information more reduced than the one used for representing CPSs.For instance, the quality and security of the device are represented in a single attribute, the Confidence rating.
Regarding the rest of the classes and attributes from the data model for applying Deep Learning, the attribute title is used from the Location class to be able to recommend CPSs according to the place they are deployed; attribute ttlStatus from the TimeToLive class is used to differentiate between available and unavailable CPSs.For instance, a sentence for Deep Learning services using ttlStatus would be "I want to see all the available cameras".The id attribute from the Thing class is used to define the device that is involved in the operation, the type of operation is defined by the InteractionAffordance (property, action or event), and the operation is defined by the DataSchema.Finally, as the Transformer approach matches natural language sentences with CPSs, it cannot rank the result list using device quality and confidential information.The rating attribute from the Confidence class is therefore used by the Deep Learning model as a value to support the decision of the Deep Learning model.For instance, if the user wants a washing machine and the Deep Learning model returns two washing machines, one disconnected and another connected, the confidence rating will be able to recommend the connected washing machine over the disconnected one.In-depth details can be found in Section 4.
Both proposed models follow the FAIR Principles to ensure the Findability, Accessibility, Interoperability and Reusability of CPSs.In-depth details are described in Section 5.

DEFINING THE DATA MODELS
This section describes both data models in detail.Each of the proposed classes of the data model is defined, and the utility of using each of the classes is explained.Furthermore, in the analysis section, an example of each class is presented, explaining the usage of the proposed classes in real situations.

Location
IoT devices are dynamic devices, that is, the location and information of the device may change over time.Furthermore, the same devices may be stored in a repository, with the difference that one device is deployed in a building in London and the other is deployed on New York's streets.Thus, the location must distinguish devices with the same functionality from the same provider but deployed in different places.
The proposed Location class uses a title or name to identify the device's location.In addition, geolocation information can be used to locate the device's position accurately.The name of the device's location is called the title to follow the nomenclature of the Thing Description.Some devices do not have geolocation information, meaning geolocation is not required when using location information.For instance, we have an ESP32 with a temperature and humidity sensor deployed in a building.We need to find the temperature sensor in the meeting room on the second floor.Despite not having its geolocation, using the class proposed, the device can be found by its title attribute, where the title could be "Meeting room on the second floor".Therefore, this class supports devices with and without geolocation information to find devices deployed in different locations.
For representing the device's geolocation, we decided to use Geodetic coordinates (latitude, longitude and height) over Cartesian coordinates as they are more commonly used for finding devices in the world. 37

TimeToLive
For IoT devices, it is common not to have a continuous connection. 38As explained in the previous subsection, IoT devices are dynamic, meaning they can move between different locations.Hence the connection can be lost.Furthermore, some devices, called sleeping devices, deactivate features to reduce energy consumption and limit incoming connections.In our data model, we propose using a class to represent the connection status of the devices.
The TimeToLive class proposed uses an attribute, lastTimeChecked, to monitor the last time the device connection was checked.This attribute allows the Discovery Service to establish a timer to check the availability status of the device.Another attribute implemented in the TimeToLive class is lastTimeAvailable to represent the last time the device was available.This attribute helps differentiate between an unused device and a device that temporarily loses its connection for maintenance due to a reduction in energy consumption or any other temporary reason.However, to measure the quality of devices to improve the search process, other attributes related to the lastTimeAvailable attribute are proposed in the SearchQuality class.TimeToLive is not included in the Quality class because the current availability status of a device is not related to quality.For instance, a smart bike or a smart scooter traveling in a city may temporarily lose connection.That temporary loss of connection can modify the response of the query sent to the Discovery Service, but the quality of the device will remain at the same value.However, a situation where a device loses its connection for an extended period affects the quality of the device through a set of attributes related to the search quality.For instance, losing the connection for a long period will reduce the number of times the device is accessed, lowering the device's position in the returned list when recommended to the user.In-depth details about the SearchQuality class are described in the following subsection.
Finally, the ttlStatus defines the current connection status of the device.The different types of connection statuses are: 1. Reachable: Defines a status where external entities can connect to the device and receive its information.2. Unreachable: Represents a status where external entities can't start a connection with the device.3. Unknown: Represents a status where external entities can connect to the device, but external entities can't communicate with it.For instance, a device that is reachable but doesn't return its information.
For the implementation of TimeToLive, the Discovery Service checks the status of every CPS by sending a request.The request sent by the Discovery Service can be configured to change the temporizer used.Furthermore, a maximum number of tries can be configured to avoid indefinitely sending requests to a lost device.

Confidence
When searching, we, as humans, use a set of information to rank the list of products that the system returns.The information used to rank the product list differs depending on what we are searching for.However, there is a set of information used for every search: the product's confidence.This confidence is used to differentiate two products with the same characteristics, even if we are not searching for them.For instance, we are searching for a camera that brings back two cameras and two movement sensors, where one of the cameras allows five concurrent connections and the other one only has one concurrent connection.One of the movement sensors is deployed without security systems, while the other has a username and password.When we check the information, we rank the cameras and movement sensors.The camera with more than one concurrent connection will be our first option.Between the movement sensors, the one with security systems will be the first choice between the movement sensors and the third option among all the returned devices.Therefore, we rank every result returned using the search query, that is, the information we are looking for, and the confidence we have in each product.
In the Confidence class, we formalize the confidence when we search for products but in a specific search for CPSs.With this formalization, machines can differentiate between devices with the same characteristics using the confidence that we, as humans, use when searching for products.This formalization is required for large-scale IoT environments because a machine is needed to manage large amounts of data, and we cannot review each of the available devices to rank them in our search.In the Confidence class, rating is used to provide a confidence value to the device.In addition, Confidence is composed of Quality and Security.Security refers to the Security class of the Thing Description, a class that defines the possible security configurations that a device may have.
The rating used by the Confidence class is calculated using Equation (1).The confidence rating calculated in Equation ( 1) is for the individual equation of a device, C i .In the equation, q i refers to the quality rating of the device i, and s n refers to the security configuration used by the device.Table 2 shows the rating of each of the security configurations, where no configuration has the lowest rating, 0, and PSK, OAuth2, Bearer, and API have the highest rating, 100.
The rating value is calculated with a maximum value of 100, with 25% of the weight for the quality of the device and 75% of the weight for the security of the device.For instance, a device with a quality rating of 100 and no security systems will have a confidence rating of 25.
Another equation used is shown in (2).This equation is used for calculating the global confidence of the Discovery Service, C g .Global confidence may be used in the future with the security configuration of the Discovery Service to search for Discovery Services in a Federation of Discovery Services.
Like the Confidence class, the Quality class has a rating for giving a quality value to the device.Quality is composed of DeviceQuality and SearchQuality.The first one, DeviceQuality, defines attributes related to the quality of the device.In contrast, concurrentConnections represents the connections that can be opened simultaneously by the device and responseTime defines how long it takes for the device to answer the requests.These attributes help to define the quality of the device and to be able to differentiate between devices with the same functionalities.For instance, a weather station with a responseTime of 10 ms has higher quality than a weather station with a responseTime of 200 ms.
For the implementation of responseTime and concurrentConnections, the Discovery Service connects to the devices to complete these attributes.responseTime and concurrentConnections are obtained when Discovery Services checks for the device availability to complete the ttlStatus attribute.For the calculation of concurrent-Connections, the Discovery Service opens concurrent connections to the device until the connection fails or reaches a number of 25 concurrent connections.However, the calculation of concurrentConnections entails two problems: (a) an external entity may use the device while the Discovery Service is trying to test the capabilities of the device, and (b) the operation of opening concurrent connections may drain too much energy from the device.To solve the last issue, an independent method from the ttlStatus can be implemented to check for concurrentConnections less often and to solve the first issue, the highest number obtained when getting the concurrentConnections can be stored instead of replacing the value every time concurrentConnections are obtained.
SearchQuality class defines attributes related to the quality of searching for a specific device.The attributes included in this class are: (a) lastTimeAccesed for storing the last time the device information was accessed using the Discovery Service, (b) indivAccess to store the number of times the device information was returned individually by the Discovery Service, that is, no more devices were returned in the search process; (c) groupAccess for storing the number of times the device information was returned by the Discovery Service in a query that contained other devices, and (d) accessedTimes to store the number of times the device information was accessed using the Discovery Service.
To determine the quality of the device, we look at the device's capabilities and the usefulness or popularity of the device, that is, the number of times the device is requested.As the last step, this information is represented as a number that defines overall quality, combined with the security information of the device to define its confidence rate.
When external entities perform queries, the attributes of SearchQuality are obtained.The attribute last-TimeAccessed is a date that changes to the current date each time the device is returned as a result of a query.
The attribute accessedTimes is calculated as the attribute lastTimeAccessed with the difference that instead of saving a date, an integer is increased by one when the device is returned as a result of a query.The attribute indivAccess only increases when the device is the single result in a query, and the attribute groupAccess increases when the device is returned with other devices.
Finally, the rating attribute of the Quality class is calculated following the Equation ( 3), where sq i is the search quality of the device i and dq i is the device quality i.The values of 0.6 and 0.4 are the weights of the device quality and the search quality, respectively.As quality is used for searching for devices, search quality has more weight in the search process than device quality.
To obtain the qualities that fall in line with the overall quality of the device, Equations ( 4) and ( 14) are used.Equation ( 4) calculates a value for rating the SearchQuality class to be used in (3).
In ( 4), the weights are distributed, placing more importance on individual access and the last time the device was accessed.The number of times the device is accessed and the number of times it is returned in a list with other devices equals a lower value due to it already being represented in the attribute indivAccess.
The lastTimeAccessed attribute is represented as a range of values, where the highest value is for devices that were accessed in the last 24 h and the lowest value is for devices that were accessed more than 10 days ago.Table 3 shows the rating values for each range of days.
To represent the attribute accessedTimes,  (5) is used as the quartile of the total sum of the accessedTimes of every device, divided by the total number of devices (K) (7).In mathematical terms, as we are dividing the data into five ranges, the quartile that we define is not a mathematical quartile, so it will be represented by  (8), where  is the union of every possible range (9).In Equation (9), n represents the calculated range, where n has a maximum value of 4; and m represents the maximum value that our range of values has.For instance, if we are calculating  for  and we have an accessMean value of 80 (m = 80) (6),  1 will return values between 20 and 40, [20, 40).

𝛾𝜖𝛼
(5) With Equation ( 7) we represent the mean number of times each device is accessed.Therefore, as shown in higher or equal to the mean number of times each device is accessed.For these ranges, we give more weight to devices that are accessed more times than expected (≥ 100%) and less weight to devices that are accessed fewer times than expected ([75%, 100%)).
The attribute indivAccess is represented using , where  is calculated as the  of 75% of the accessed-Times (10).Therefore, in this case, the variable m is the 75% of the accessedTimes instead of the mean value of the accessedTimes (11).The groupAccess, represented using , is calculated as the  of 25% of the accessedTimes (12).As with the attribute , the value of m is modified to be the 25% of the accessedTimes instead of the mean value of the accessedTimes (13).With these equations, we are defining the maximum value of indivAccess as 75% of the total number of times the device is accessed, and groupAccess as 25% of the same, thus giving more weight to individual accesses than to group accesses.

𝜎𝜖𝛼
(10) Each subset of  used for calculating the different ratings has a value assigned to the highest  with a maximum value of 100 ( 4 ) and the lowest  with a minimum value of 0 ( 0 ).These values are represented in Table 4. Equation ( 14) calculates the value for rating DeviceQuality to be used in (3).As the response time of a device is more important than allowing concurrent connections, the attribute responseTime is given more weight than concurrentConnections.
The attribute responseTime is represented as rt + i , where rt is the response time of the device i. Equation ( 15) represents the calculation of rt + i .In Equation (15) a rating is calculated by measuring the response time of the device in milliseconds, considering 5000 ms as the highest time we allow to give the device a rating above 0.For instance, a device with a response time of 78 ms will return a rating of 98, while a device with a response time of 7000 ms will return a rating of 0.
Finally, attribute concurrentConnections is defined as con n , where n represents the range of concurrent connections the device supports.Table 5 describes each defined range, where depending on the number of concurrent connections the device supports, con n gets a certain rating value.

Transformer
As explained in the previous section, the data model proposed for the Transformer approach aims to support natural language queries in the search process.The proposed data model is used to improve a Deep Learning model from a previous work, where Transformer was used to recommend CPSs using natural language queries. 14For this to happen, classes Location, TimeToLive, DataSchema and InteractionAffordance are used to build natural language sentences for datasets that don't contain sentences in the form of natural language.Furthermore, Confidence is used to support the Transformer.
When creating a Deep Learning model, it is essential to preprocess the information correctly for producing the training dataset.Using a data model to preprocess the information helps automate and simplify the preprocessing step.Furthermore, if the data model is prepared for training Deep Learning models to solve a specific problem, it can also help create Deep Learning models with good results.In this proposal, the data model is prepared for training Deep Learning models for recommending Web of Things devices or services, matching them with user queries in natural language sentences.
The Deep Learning model used for analyzing the proposal of this paper was trained using a set of observations from a Smart Home scenario. 14As not all the devices from the dataset have a natural language sentence associated, the attributes from the proposed data model are merged to build a sentence that can help in the training process (16).The sentence represents the place where the device is deployed (title from Location) if the device has to be reachable (ttlStatus from TimeToLive), the operation that we want to perform (DataSchema) and the kind of operation that we want (InteractionAffordance).The class TimeToLive is only included in the sentence when the CPS is reachable, as this class is only useful for finding available devices.When using TimeToLive as a filter, Confidence class is used to support the decision of the Deep Learning model, which includes attributes that represent the class TimeToLive.An example of a sentence is defined in (17).
I need available action light on in kitchen.
After the sentences are generated, they are transformed into vectors using hash encoding and normalized using padding techniques to transform all the vectors into vectors of the same size.The created vectors are used for training the Deep Learning model by applying an embedding layer and the Transformer algorithm, using a Softmax as an activation function to represent each possible CPS that can match the user's sentence.
Finally, the Confidence class is used as a support system for the Deep Learning model, which is able to recommend a set of devices from a natural language sentence.However, it is not able to distinguish devices using quality and security configurations.The Confidence class improves the Deep Learning model result by providing knowledge about the quality and security of each device.Consequently, with the use of both data models and Deep Learning through a Transformer approach, the system is able to match natural language sentences with CPSs and recommend them using the confidential information of each CPS.

ANALYSIS OF THE RESULTS
After defining both data models and describing each of the elements that comply with the proposed data models, both data models are analyzed using two analysis scenarios.The first one focuses on describing how a Discovery Service adapts the existing devices using the Thing Description to our proposal and how that information is stored in a repository.The second one then describes how a Transformer approach for matching user queries with Web of Things devices (or devices) uses the proposed data model to return a list of recommended devices.

Data model for discovery services
Figure 2 shows the analysis scenario for the data model proposed to store CPSs information fetched by a Discovery Service.
In the analysis scenario, the Discovery Service is deployed in a Smart Home.The CPS devices in the Smart Home are (a) a washing machine in the kitchen, (b) one light in the parents' bedroom, in the children's bedroom, in the kitchen and in the bathroom; (c) movement sensors in the same rooms as the lights deployed, and (d) a lock on the entrance door.For the analysis scenario, the CPSs deployed require a Thing Description.Using the Thing Description, the Discovery Service will extend it with the proposed data model.For devices that don't have a Thing Description, two solutions are used: (a) using middleware to adapt the device to the Web of Things, thus generating a TD; 39 or (b) automatically creating the TD of devices that uses MQTT communication protocol. 40fter each device has a Thing Description attached, the TDs are stored in the repository.Listing 1 shows an example of a Thing Description for one of the deployed lights.Finally, the stored TD is extended by the Discovery Service using the proposed data model.
To analyze the data model, each deployed CPS device has a different configuration, hence having a different confidence level.Table 6 shows the configuration of each deployed CPS device in the Smart Home and the confidence and quality rating given by the following Discovery Service Equations ( 1) and ( 3).
After each TD has been extended by the Discovery Service, it is able to use the proposed data model for performing the user queries.For instance, a query searching for lights would return in the first position, despite being unreachable, the light deployed in the parents' bedroom due to having a better confidence level.In order to return CPSs that are reachable, users can perform queries with conditions that exclude unreachable levels.
In the next subsection, the analysis scenario for the Transformer approach is presented, in which a recommender system uses the proposed data model to recommend a list of CPS devices.In this list, values such as the CPSs connection status are used.Accordingly, the list of deployed lights may have a different order than that returned by the Discovery Service.
F I G U R E 2 Analysis scenario in a smart home.CPSs with 10 devices deployed and a discovery service to register the deployed CPSs.Kitchen (light, movement sensor and washing machine).Bathroom (light and movement sensor).Hall (movement sensor).Parents' bedroom (light and movement sensor).Children's bedroom (light and movement sensor).Office (discovery service).

Data model for transformer
For the Transformer approach, the same scenario as the previous subsection is used.The Discovery Service searches for CPS devices through user queries, and together with Deep Learning, it is able to recommend devices through a Transformer approach.
Figure 3 shows the interaction between the Discovery Service and the service that creates and uses the proposed Deep Learning model using the Transformer algorithm.The creation of the Deep Learning model needs a CPSs data history dataset.The used data set must employ the data model proposed.Our analysis scenario has a repository where the Discovery Service stores all the accesses to the CPSs using the proposed data model.This dataset is used by the Deep Learning service to create the model using the Transformer algorithm.After the Deep Learning service gets the dataset, it is analyzed using an external service.In our analysis scenario, the external service is deployed in the Discovery Service.The analysis of the dataset checks that the dataset sent to the Deep Learning service uses the data model proposed.In our example, the Discovery Service sends the dataset, ensuring it uses the data model.However, the dataset may be sent by another entity.Table 7 shows an example of a dataset sent to the Deep Learning service, where the classes that comprise the data model for the Transformer approach are represented.The Device column defines the id of the device, and the Service column represents the service executed in the operation.Furthermore, the Sentence column is empty for some devices, thus forcing the Deep Learning service to fill those empty columns.For instance, the second row of Table 7, following Equation ( 16), will be completed as I need available action washing in kitchen.With this solution, we ensure that datasets not containing natural language sentences can be used by the Deep Learning model.
After the dataset is analyzed, the Deep Learning model using the Transformer algorithm is created.To be able to use the created Deep Learning model, an external entity has to send a query to the Deep Learning service, which, using the  where the first option in the returned list is the device with the highest confidence level, that is, the device that best suits the user's request.With both analysis scenarios, the novelty of the solution has been demonstrated.CPSs from different providers are supported by using the proposed data models.Furthermore, the CPSs search process is improved and even supports natural language queries by a recommender system using Transformer.

FAIR principles
For the definition of the data model, the FAIR principles are followed to make the data Findable, Accessible, Interoperable and Reusable.This section describes the justification of complying with all the levels of the FAIR principles. 34,35As the proposal tries to follow the four levels of the FAIR principles, we cannot ensure complete compliance.However, validating our data model using the FAIR principles helps us find the limits of our proposal and propose future work that would help us improve the data model, thus making the proposal more findable, accessible, interoperable and reusable.In our data model, metadata and dataset are represented similarly.Therefore, questions regarding metadata and dataset are answered as the same question, for example, the question F1.
In the following, we comply with most of the questions from the four levels, even with questions not related to the data model.For the questions we don't comply with, all related to the Discovery Service, we intend to improve the Discovery Service to comply with all the questions from the four levels of the FAIR principles.

F1. What globally unique, persistent, resolvable identifiers do you use for metadata records or datasets?
The attribute id from the class Thing of the Thing Description is used to identify the metadata.The identifier is represented in the form of a URI following the RFC standard RFC3986. 41

F2. Which metadata schemas do you use for findability?
The data model proposed aims to improve the findability of CPSs by including information regarding the location of the deployed device, its connection status and information about the quality of the device.

F3. What is the technology that links the persistent identifiers of your data to the metadata description?
To store the metadata, JSON-LD is used, whereas namespaces are used for integrating or linking the identifiers with the description.

F4. In which search engines are your metadata records or datasets indexed?
Although this question is out of the scope of our proposal, as we propose a data model and not the Discovery Service, we are working on a solution that complies with this question.Currently, the metadata is not indexed in a search engine but in a Discovery Service.However, in future work, we intend to improve the Discovery Service to support a Federation of Discovery Services, thus linking our Discovery Service with other Discovery Services or vice-versa.

A1.1. Which standardized communication protocol do you use for metadata records or datasets?
To discover the CPSs stored in the repository, a RESTful API is used with HTTP communication protocol.

A1.2. Which authentication & authorization techniques do you use for metadata records or datasets?
Although this question is out of the scope of this work.In this paper, we propose a data model, not the Discovery Service; we are working on a solution that complies with this question.Currently, the Discovery Service used does not support authentication or authorization.However, in future work, we intend to improve the Discovery Service to support authentication and authorization to limit access to some of the stored information.

A2. Which metadata longevity plan do you use?
For the metadata records, the date of creation and the last update is stored using the Thing class of the Thing Description.Furthermore, with our proposal, the status of the CPSs is stored, allowing access to the CPSs even without a connection.This is not currently being carried out regarding the deletion of the metadata.However, using the quality information described in the metamodel, data not used is less recommended in the search process, thus reducing the relevance of disconnected CPSs.

I1. Which Knowledge Representation Languages (allowing Machine Interoperation) Do You Use For Metadata Records Or Datasets?
To represent the information stored in the repository, JSON-LD is used.

I2. Which structured vocabularies do you use to annotate your metadata records or datasets?
The vocabularies of the CPSs ontologies are used in the metadata.Furthermore, we intend to propose a common ontology for future work that supports all the available CPSs ontologies.

I3. Which models and schema(s) do you use for your metadata records or datasets?
We use the proposed model in this paper and the Thing Description schema for the metadata records and datasets.

R1.1. Which usage license do you use for your metadata records or datasets?
As the stored data is open data: anyone can use it, and a copy version of the data can be modified; we use the MIT license for the proposed data model.

R1.2. Which metadata schemas do you use for describing the provenance of your metadata records or datasets?
For the description of the provenance or maintainer of the metadata, the attribute support of the TD is used.

Threats to validity
To validate our proposal, we answer the four main validity threads discussed in the literature: 42 Conclusion, Internal, Construct, and External validity.This ensures detection of the objectives are fulfilled and the study's limitations.

Conclusion validity.
Did the introduced treatment/change have a statistically significant effect on the outcome we measure?
Yes, the results obtained using the proposed DSL differ from those obtained by not using the DSL.The order of the returned list of CPSs is modified to give more importance to devices with higher quality and security, represented by the confidence level (1).However, there is a lack of comparison between the current system without the DSLs and the proposed system.This comparison may be made using metrics from the literature to evaluate the difference between both approaches.

5.4.2
Internal validity.Did the introduced treatment/change cause an effect on the outcome?Can other factors also have had an effect?
The outcome is altered as we use the proposed models to support recommendation systems using Deep Learning.In addition to the result of the Deep Learning model, the confidence level is used to sort the returned CPSs.However, using In addition, a data model based on the previous one is proposed for matching Cyber-Physical Systems, and user queries in natural language through a Transformer approach that serves as a matching mechanism (or Recommender System).This data model supports the primary data model by searching for CPS devices using the Transformer model outcomes.Furthermore, the result of the model is adjusted to the device confidence level, improving the usefulness and effectiveness of the recommendations.
To analyze the first data model, a Smart Home scenario was established.A Discovery Service fetched the deployed CPSs and completed the information following the proposed data model.After the information was fulfilled following the data model, the confidence rating of the CPSs was compared to analyze the utility of the data model when representing and searching for CPSs.
After the first data model was analyzed, the scenario was used to analyze the second data model.For the second data model analysis, the Deep Learning service used the dataset from the smart home scenario to create a Deep Learning model.The used dataset was completed using the proposed data model for the Deep Learning approach.Finally, the Deep Learning model was created, and the resulting list of recommended CPSs was adjusted using the confidence rating of each CPSs.
Thus, the posed research questions can be answered positively.Regarding the first one, a data model is defined, built, and analyzed, ensuring the interoperability and quality of the WoT Discovery Service.Regarding the latter, a data model to support Deep Learning-trained models has also been defined, allowing the deployment of a Deep Learning model that processes natural language to facilitate the matching between queries and devices and services.
Regarding the FAIR principles, in the analysis section, the four levels were validated by answering the questionnaire built for the FAIR implementation.All the questions related to the data model have been answered affirmatively.Furthermore, most questions not related to the data model have been answered affirmatively.For the questions that were not answered affirmatively, we intend to comply with these questions in future implementations with the improvement of the Discovery Service used in the analysis scenario.
In future work, the Deep Learning model using the Transformer approach could be improved by proposing a fully functional and described Deep Learning model.In addition, as the use of the proposed data model for applying Deep Learning in natural language problems has been demonstrated to support the search process, the data model using the Transformer algorithm could be compared with the use of the proposed data model with other algorithms.For the validation, both data models could be deployed in a large-scale IoT scenario to validate the proposal in scenarios with a large amount of CPSs.Furthermore, we intend to improve the proposed data model with an automatic generation of Thing Descriptions using the defined data model.The generated Thing Descriptions would be able to simulate real devices to validate real scenarios before deploying the CPSs.Finally, we are researching a federation approach of discovery services, where the queries can be delegated to discovery services located in different places.However, we need a way of comparing the discovery services.For this reason, we intend to use the proposed data model to represent the CPSs that populate the discovery services, adapting the CPSs that use the Thing Description to the new model.In addition, we intend to extend the data model to represent the quality or confidence of each discovery service, thus being able to compare the discovery services when delegating the queries.

F I G U R E 3
Data model analysis for the deep learning approach.TA B L E 7 Analysis scenario example dataset (DV: Device; LC: Location; TS: ttlStatus; CF: Confidence; IA: Interaction Affordance; SV: Service; ST: Sentence). 16 17 18 19 20 Ratings for each security configuration.
TA B L E 2

Table 4
, using the proposed , devices are ranked into five categories, where devices in  4 are devices with the number of times accessed Ratings for the last day the device was accessed.Ratings for each subset of .
TA B L E 3 Ratings for concurrent connections.
model, returns a list of recommended devices sorted by the confidence level of the model.For instance, in the analysis scenario, the sentence Did the movement sensor read any movement?, returns all the available movement sensors, created