Predictive control based on occupant behavior prediction for domestic hot water system using data mining algorithm

Domestic hot water systems are a primary source of energy consumption in buildings, and present a promising future in terms of energy saving. Predictive control methods have been used to reduce energy consumption during an operation period. However, current methods lack consideration of occupant behavior, which significantly influences the prediction results. In this study, a data‐based predictive method is proposed to predict the shower behavior of occupants. A dataset was collected from seven occupants and was trained using support vector machine (SVM) to learn their showering habits. These results were used to predict the hot water demand. A comparative analysis shows that, in this way, the prediction results are more reliable under predictive control than predicting the hot water demand directly using SVM. A simulation of a domestic hot water system was then conducted. The results indicate that the proposed approach can reduce the heat loss from a hot water storage tank (ST) by up to 33% compared to a traditional control method, while maintaining a high assurance of the hot water supply (an insufficient hot water supply of less than 1%) with little change to the original system.


| INTRODUCTION
Hot water systems are the primary source of building energy consumption in both developed and developing countries. 1 In China, hot water accounts for approximately 23.4% of total domestic energy consumption (excluding heating energy consumption) in residential areas. 2 In Brazil, electric hot water for showers accounts for 24% of a typical household's energy consumption. 3 In other countries, such as the US and various nations in the European Union, 4 domestic hot water consumption accounts for approximately 14%-25% of the total energy consumption in residential buildings. 5 Because the energy consumption of a hot water system is relatively large, special attention should be paid to reducing building energy consumption.
With the aim of saving energy in hot water systems, many previous studies have been conducted. In general, there are two ways to reduce energy consumption: (a) improving the efficiency of the primary equipment, 6 and (b), optimizing the control method to balance the supply and demand more effectively, 7 thereby reducing the energy consumption during an operation period. Many studies have been conducted in the former area using higher-efficiency boilers, 8 heat pumps, 9 and solar heaters. 10 In recent studies, combined heating | 1215 CAO et Al. equipment such as the solar assisted heat pump (SAHP) 11 has been proposed to improve the equipment efficiency further. 12 Although an increased efficiency of the primary equipment can decrease the amount of energy consumption in a hot water system, this decrease often creates a bottleneck in reality, when most of the equipment has been updated to achieve better efficiency. 13 Most people tend to neglect the energy saving potential during an operation period, despite the fact that a significant amount of energy can still be saved. A study from the Netherlands revealed that if the occupants do not conduct their tasks in a highly efficient way, that is, one that supports the intended design, even a high-performance system can consume much more energy than expected. 14 A properly designed control method is considered an effective means of energy saving during an operation period. Several control methods have been proposed for hot water systems, aiming at minimizing the energy consumption or energy cost. Among the reported studies, predictive control 15 has stood out as an effective method for a hot water system to achieve good energy and cost saving results. Michael et al 7 used a model predictive control method of a price-based demand response for heat pump utilization. The costs were reduced by approximately 5%, whereas the morning peak of district heating and evening peak of the electric grid were relatively reduced. Rasmus et al 16 used economic model predictive control (EMPC) and obtained electricity cost savings of up to 25%-30%.
A proper prediction of the energy load is an important task for predictive control, which is formidable 17 because the future is unknown and uncontrollable. With the development of data science, 18 data-driven prediction methods for building energy consumption have become a research hotspot in recent years. 19 Compared with traditional predictive control methods, a data-driven method is more precise and achieves a better match in individual situations. 19 Catalina et al 20 developed a regression model to predict monthly heating demand for a single-family residential sector. Raza et al 21 reported the superiority of support vector machines (SVMs) over other methodologies such as artificial neural networks (ANNs). Comprehensive reviews introducing data mining methodologies and practical applications regarding energy consumption prediction in buildings can be found in. 22 Occupant behavior has a significant impact on the results of future energy demand prediction and has become an important aspect in reducing thermal losses from energy inefficient buildings. 23 Kazmi et al 13 further indicated that obtaining a better understanding of occupant behavior only increases in importance as a necessary first step. However, for most data-driven predictive methods, a prediction is usually conducted on an aggregated energy consumption. This provides the model certain advantages by extracting personal behavior from the aggregated data, then analyzing and putting them into practical use. Kazmi et al 13 proposed an occupant data-driven optimization model to operate an air-source heat pump at the highest possible efficiency, and the results indicate a 20% reduction in energy consumption using a computationally inexpensive heuristic approach. Aki et al 19 used support vector regression (SVR) to predict the hot water demand in a bathtub as well as many other end users and proposed a management system in a practical environment.
In the case of a domestic hot water system, better control of the remaining hot water in a hot water storage tank can help reduce heat loss, which accounts for up to 20% of the total energy consumption in a DHW system, and therefore reduce the amount of energy consumption. This is practically useful considering its application to not only new devices but also to the millions of devices that have already been installed in households. 23 A better balance between supply and demand will help decrease the remaining hot water in the tank and therefore realize greater energy saving. 24 This study aims to reduce the energy consumption of a DHW system during an operation period using a predictive control method and to achieve a better balance between supply and demand. This includes two main parts: proposing a predictive control method and calculating the amount of energy saving. In the first part, a predictive control method is raised based on a novel data-driven algorithm for predicting the DHW demand. This algorithm takes each occupant's showering behavior as the first reference by using SVM. Future hot water demand is calculated by summing each occupant's individual prediction results, which differs from other prediction methods that directly predict the aggregated energy consumption or demand. In the second part, an energy consumption model of a DHW system is built to estimate the energy-saving effect by adopting the proposed predictive control method.
Moreover, this research is grounded on real data. Detailed user behavior data are often difficult to collect, thus making a behavior analysis and real-data-based prediction difficult to conduct. This paper fills the gap 1. A data-based predictive method for shower behavior of occupants is proposed. 2. Hot water demand is predicted with high supply assurance. 3. Feasibility of proposed predictive control method is proved. 4. Proposed control method is more energy-saving than traditional one.
regarding the use of occupant behavior characteristics to predict future energy demand. Therefore, results from the simulation are more convincing and more suitable for a practical application. The novel contributions of this paper not only lie in a datadriven algorithm for hot water demand prediction, but also in a complete predictive control method designed for a DHW system. This includes an algorithm for using the occupant behaviors to calculate the future hot water demand and a hot water supply strategy for the entire system. Such a control method can be easily applied in a similar hot water system with little equipment retrofit, little effect on user comfort, and considerable energy savings.
This paper is divided into three main sections: methodology, evaluation, and results and analysis. Among them, the methodology section describes the method for predicting the shower habits of each user, a way to calculate domestic hot water demand using the prediction results, and a related hot water supply strategy. In the evaluation section, a model of a domestic hot water system is described, and an evaluation of the proposed predictive control method is conducted through a simulation. The results and analysis section describes the prediction and simulation results and provides a comparative analysis with other control strategies.

| METHODOLOGY
The predictive control method for a hot water system is developed in the following steps: 1) Establish a predictive model to learn and predict user shower habits hourly. 2) Calculate future hot water demand using the prediction results of user shower habits. 3) Propose a domestic hot water supply strategy based on future hot water demand.
The first part is the core of the entire predictive control method. The problem can be defined as judging whether a user will take a shower in the next hour and can be modeled as a binary classification problem.

| Data description
Because water control units (WCUs), as shown in Figure 1, are widely used in many well-controlled DHW systems, particularly in dormitories and public bathrooms for individual charge, they can be used to collect hot water consumption data. An integrated circuit (IC) card is used to control the switch of the showerhead. WCUs will automatically record the IC number, start time (including the month, day, hour, minute, and second), end time, and other relevant information. The start time is defined as the shower time, whereas the end time is used to calculate the duration of a shower. Water consumption per shower is calculated based on the duration of the shower, at a water flow speed of 500 L/hour (8.33 L/minute). 25 Seven occupants living in Shanghai, China (31°01′N 121°25′E) volunteered to provide shower records for this research. General information of each occupant is listed in Table 1. A total of 1250 shower records were collected from February 1 to November 30, 2017. Each record includes 10 labels: the IC number, shower time (including the day and hour), hot water consumption, temperature, humidity, wind speed, solar radiation, weekday, and time interval between two showers. The IC number is used to identify the occupant. The amount of hot water consumption is used to generate the predicted water demand. The other eight labels are used as elements to predict the shower habits of occupants.
A raw dataset is prewashed before training. Low-quality data, in which the water consumption is less than 10 L or larger than 300 L, are removed. Because the tourist season is in August, all shower records from this month were also removed to unify the time stamp. The after-treated dataset was then divided into a training set, validation set, and test set, as shown in Table 2.  The shower habits of seven users are analyzed to find the main influencing factors. Fuentes et al reported that daily and seasonal variables (ie, shower time/hour and month) are the most significant factors. 25 In addition, the time interval between showers is also considered in this study. Further analyses of the shower habits of all users along with the parameters were conducted, the results of which are shown in Figure 2.
From Figure 2A, it can be observed that the fraction of shower incidents is related with the time interval for most users. Fraction peaks happen in a time interval of between 20 and 30 h, which means the users take a shower every day (24 hours). It can be noted that users 5 and 7 have similar patterns but differ from the others. Instead of having a single peak, they have several small peaks, which are at time intervals of 1, 2, 3, or 4 days (24, 48, 72, or 96 hours). This shows that the users take a shower every several days, but not at a fixed number.
The relation between the fraction of incidents and the shower time (hour) was investigated, and the results are shown in Figure 2B. It was observed that most users tend to take a shower after 18:00, whereas some tend to take a  shower both in the morning and in evening. Similar shower habits have also been reported by Fuentes et al 25 The relationship between hot water consumption and the seasonal variable is investigated. Hot water consumption is strongly related with the month mainly because of the seasonal changes in the environmental temperature. 25 Many researches have reported a trend in which hot water consumption will increase as the environmental temperature decreases. 26 Although the value varies based on the different regions and climate conditions, a similar trend is shown. George et al 27 conducted a high-resolution measurement in 119 houses in Canada and found a +6.8% variation of hot water consumption in cold periods and a −0.96% variation in warm periods. Ahmed et al conducted onsite data measurements in 185 apartments in Finland and found a +15.3% variation in cold periods and −17.4% in warm periods. 28 The increase in hot water consumption in cold months is mainly due to the increase in hot water consumption per shower. 29 This point is also supported by the data collected in this study. The relationship between the fraction of shower incidents and month is shown in Figure 2C. It can be seen that the number of shower incidents does not change significantly by month for most users, whereas some users (users 4 and 5) have more shower incidents in warm months (May to October) than in cold months (November to April of the next year). The average hot water consumption of each month with respect to the average hot water consumption of the entire year is shown in Figure 2D. It can be observed that hot water consumption per shower changes with the season, and in warm months, all users tend to use less water per shower, whereas in cold months the opposite trend occurs. This explains why hot water consumption is higher in winter than in summer. Detailed calculations were also conducted, the results of which are listed in Table 3, which will be used in calculating the water demand as a seasonal coefficient.
Based on the information above, the time interval and shower time are treated as the main factors in predicting a user's shower habits, whereas the month is treated as a minor factor with a small weight during this process. However, the month is treated as a significant factor when calculating domestic hot water demand, as described in Section 2.5.

| Mathematical methodology
The prediction of a user's future shower status can be simplified as a binary classification problem. Hourly data of a user is labeled as 1 when the user has a shower, and 0 when the user does not. The SVM algorithm is used in this work to learn the shower habits of users. SVM is a well-developed data mining technology derived from the statistical learning theory introduced by Vapnik. 30 The SVM model is based on a series of kernel functions, which convert the original, low-dimensional data to a higher feature space in an implicit manner. 31 It has recently been used successfully in various areas including solving complex problems of pattern recognition, classification, regression analysis, and forecasting and shows a greater performance compared to other data mining technologies such as a neural network 32 and a statistical data processing method. 33 A brief introduction to the principle of SVM is provided below. For a given set of instance-label pairs (x i, y i )i = 1,…,l, Where x i ϵ R n , y ϵ{1,-1} l . Training vectors x i are converted into a higher dimensional space by the function φ(x). To maximize the slack between hyperplanes, the optimization problem of the SVM can be defined as follows: where w and b are the weight and constant term, respectively, and can be used to describe the relationship between x and y. Here, ξ is called a slack variable, and C > 0 is the penalty parameter. For nonlinear classification problems, the vectors x i can be mapped into a nonlinear classifier using a Kernel trick.
where K is called a kernel function. The four popular kernel functions are linear, polynomial, radial basis function (RBF), and sigmoid functions. Among them, the RBF function shows a better performance in many nonlinear cases because of its computational effectiveness, convenience, reliability, ease of adaptation to optimize other adaptive techniques, and adaptability in handing complex parameters. 32 The definition of the RBF kernel function is as follows: where γ > 0 is called a kernel parameter.
The dataset may still not be linearly separable in a converted space, that is, it may not be able to use a linear decision boundary to divide all samples. An SVM classifier will not converge under this circumstance. In this case, the kernel function, slack variable ξ, and penalty parameter C should be adjusted to change the soft margin for a decision boundary. 30 In this study, the prediction is conducted hourly. At hour t, if there is shower record for user u, the shower status ψ u,t equals 1. Otherwise, ψ u,t equals 0. The prediction output is the predicted shower status ψ′ u,t for user u at time t, that is, 0 or 1.

| Evaluation criteria
The root mean square error (RSME) is used to evaluate the future hot water demand. For a binary classification problem, criteria that are widely adopted in a building energy prediction such as the coefficient of variation (CV), mean absolute percentage error (MAPE), and RMSE 22 are not suitable in this case. The accuracy of the binary classification is defined as follows: where a is the accuracy, N true is the number of data in which the prediction matches the experimental result, and N false is the number of data in which the prediction does not match.
However, because users usually take a shower 1.7 times a day, 25 it indicates that most of the shower data are zeros during most hours, and the occupant's behavior dataset is lopsided. This may cause a problem because the classifier will overfit zero data while underfitting data with a value of 1. This sabotages the utility of the prediction because the shower time is the focus of this work and is important to predict the hot water demand. 34 Compared with accuracy, confusion matrix (CM, shown in Table 4) is a more intuitive description of the training result for a binary classification problem. The confusion matrix consists of four types of classification results: true positive (TP), true negative (TN), false negative (FN), and false positive (FP). In this problem, FN indicates that the model failed to predict an occupant's shower incident, which may lead to missing hot water demand of the occupant. FP indicates that the model predicts that the occupant will take a shower, but in fact the occupant does not, which causes more energy waste in the ST. These two types of false classifications should all be avoided as much as possible, but in comparison, the effect caused by FN is more grievous than that of FP.
The relation between accuracy and CM is shown as below: It is worth noting that occupant comfort is regarded as being of most importance, and the new control method should try not to disturb the user's shower habits. Under this consideration, recall (r) is used as the other important evaluation criteria along with accuracy. Recall is a commonly used criterion to deal with lopsided binary classification problems, showing the portion of TP records in all actual positive incidents. This is defined as follows: The evaluation criterion is defined as the weighted average of the recall and accuracy during the training process, the corresponding weights of which are 0.7 and 0.3. Recall is also used in the predictive control method to check when the model deviates with the real data provided in section 3.2.

| Training process
To reduce the effects of a lopsided dataset, random undersampling is used in building the training set of the predictive model, which means only some of the records from a largequantity class contribute to building a predictive model. A sampling ratio of 1:1 to 1:10 was chosen and treated as one of the SVM parameters to conduct the optimization. 34 The optimized result is 1:1.
Predictive models for each occupant are built individually. There are three parameters used in the SVM model: the RBF parameter γ, penalty parameter C, and termination criterion tolerance ε. Here, ε is set to 0.001 while the SVM model is tested using a cross validation to find the best combination of the other two parameters, γ and C. The optimal parameters are selected and listed in Table 5. The optimal training results of each occupant are shown in Table 6. The training results show that the combination of accuracy and recall is a better evaluation criterion than using them individually.
Further experiments were carried out to find the minimum dataset size for prediction. The prediction model is built from   Figure 3, and the result shows that the required dataset size for prediction is not the same among different people. For users 2, 3, 4, and 6, an algorithm only needs 10 days of data collection (approximately 7-10 shower incidents) to obtain a prediction model with the required criteria, whereas 90 days of data collection (approximately 30-50 shower incidents) are needed for users 1, 5, and 7. This result is useful when new users arrive. Here, the system needs to learn the new shower habit first, which requires time to collect the user's data and run a test. The duration of this process depends on the user, but under favorable circumstances 10 days is sufficient to meet the requirement to generate the new user's prediction model. For some users, it may take up to 90 days or even longer. During this period, the system will continuously prepare adequate hot water for the new user.

| Calculation of domestic hot water demand
The future domestic hot water demand can be calculated from the average water consumption V u and standard deviation σ u with the prediction model of the user's shower behavior using the following equation: where � u,t+1 is the predicted shower status, V u is the average hot water consumption, σ u is the standard deviation, ς u is the seasonal coefficient, and ϱ u is the adjustment coefficient.
The values of V u and σ u for each user are listed in Table 7. A variation analysis is conducted for each user in both cold (November to February) and warm (May to October) seasons. The results are also shown in Table 7. Here, ϱ u reflects the variation of each user, and ranges from 0.5 to 1.2. This value is obtained by optimizing the result during the validation process to satisfy the hot water demand by assuring that the remaining hot water in the ST is always above zero to obtain 100% hot water supply assurance, as shown in Figure 4. The optimal values for ϱ u are listed in Table 7.

| Hot water supply strategy
The prediction model of hot water demand can be used to control a DHW system more effectively. The scheme of this process can be seen in Figure 5. When ′ u turns from zero to 1, the DHW system produces hot water to prepare a line for user u in the hot water storage tank (ST, see Figure 5A at time 1). When user u uses hot water (see Figure 5A at time 3), the DHW system stops producing hot water for the user's share. The hot water remaining in the ST becomes part of the next-hour hot water preparation. When another user uses hot water, this margin will be offset (see Figure 5B at time 2) by the system. This process ends when user u consumes hot water or ψ u,t turns from 1 to zero, and the remaining hot water in the ST is released.

| EVALUATION
The proposed predictive control method was simulated based on a dormitory DHW system located in Shanghai Jiao Tong University. Using the parameters of the DHW system model for seven users, the control method was applied with a time step of 1 hour from September 1 to November 30, 2017.

| Description of DHW system
The hot water system designed to evaluate the proposed prediction control method is shown in Figure 6. The design of the system conforms to GB 50015-2003 (2009) and satisfies the need of the maximum capacity (105 L for each user per day) as presented by Fuentes et al 25 The parameters of the DHW system are shown in Table 8. A water boiler and hot water storage tank (ST) are chosen as the two main pieces of equipment in the proposed DHW system. The incoming water is heated to the water supply temperature by the water boiler and stored in the ST. The ST takes energy to keep the water at the water supply temperature. A detailed design of the ST is used for simulation accuracy, and the geometric dimensions of the ST are as shown in Table 9. The value of the heat conductivity coefficient of each layer and the outside convection coefficient were adopted from a study by Kiyan et al 35 By connecting all original sensors such as the liquid level meter and temperature sensors to the controller, the water control unit (Figure 1) is implemented to each showerhead in the shower room. Therefore, the shower data of the occupants can be collected by the controller and can be used to conduct the water demand prediction to proceed with the control method.

DHW system
The predictive control method contains initialization mode and system running mode. It can be expressed in Figure 7. In the initialization mode, the controller build prediction model based on occupants' historical behavior dataset using the method presented in section 2.2 and 2.4. In the system running mode, hot water system predicts occupant's shower status in the next hour using the method presented in section 2.5. The amount of hot water produced by water boiler and stored in ST are controlled by the strategy proposed in 2.6. A fault tolerance mechanism is adopted in both initialization and system running period. When the occupant's behavior changes dramatically and the criteria of the built model cannot be guaranteed, this mechanism will not accept the prediction. The acceptable criteria are shown in Table 10.

| Heat loss model
Heat loss from the ST is the main heat loss source in the DHW system. In order to reveal the relationship between heat loss and water level in the ST, a more detailed heat loss model was built in this paper. The calculation of the heat loss in the ST was conducted under the following assumptions: 1. Hot water in the ST is considered to be kept at 60°C, uniformly and constantly. 2. The physical properties of air and water in the ST stays constant.

Other part of DHW system is well-insulated.
Heat loss can be calculated from four parts in Figure 8. Φ t,bottom is the heat loss from the bottom. Φ t,lowerside is the heat loss from the lower side where heat transfers from the hot water through the heat-preservation wall to the environment. Φ t.upperside is the heat loss from the upper side where heat transfers from the hot air in the tank through the heat-preservation wall to the environment. Φ t,top is the heat loss from the top. Total heat loss from the tank is the summary of all heat loss: The four types of heat loss will be calculated individually. Φ t,bottom and Φ t,lowerside can be calculated using Fourier's Law. 36 (11) Φ tank = Φ t,top +Φ t,upperside +Φ t,lowerside +Φ t,bottom Heat loss from the bottom of the tank can be calculated by the following: Similarly, the heat loss from the lower side of the tank can be calculated by the following: The air in the upper part is heated by the hot water below and can be simplified as finite space natural convection heat transfer. The convection in the interlayer can be determined For a horizontal interlayer (bottom heated), when Gr >2430, natural convection appears and it becomes more intensive when Gr is bigger. The horizontal interlayer finite space natural convection heat transfer can be calculated empirically 36 by the following: The convective heat transfer coefficient can be calculated by Nu: With water temperature T w and temperature at the inside top of the tank T t,top , the heat transfer from water to the top of the tank is as follows:  Thus, heat loss from top of the tank can be developed as follows: By combining Equation (22)(23)(24), temperature at the inside top of the tank T t,top can be solved: Then the heat loss from the top of the tank can be calculated using: Wall temperature of an infinitesimal unit of height δ at the inside of the tank can be calculated as: With T t,upperside,δ , the heat loss from the wall of this infinitesimal unit can be calculated as: Heat loss from the upper side of the tank can be calculated as:

| Electric heating device
Electric boiler heats water to the water supply temperature 60°C and electric heater in ST keeps the temperature of the water 60°C in the ST. The efficiencies of the direct-flow electric water boiler and electric heater are both 95%. The temperature of the income water from the municipal water pipe is environmental temperature.
Heat of boiling water from t to t 0 in water boiler is as follows: Energy consumption of water boiler is as follows: where V i,t is the volume of income water at time t, ρ w is the density of the water and c w is the specific heat capacity of water. The parameters of air and water properties are shown in Table 11.
Energy consumption from t to t 0 in the ST is as follows: Total energy consumption of the hot water system is as follows:

| Simulation process
Many software (Matlab, Simulink, C++, Trnsys, EnergyPlus, etc.) have been used for system simulation of DHW system before. 35 Among which, Matlab has its advantages for convenience of coding and a variety of built-in functions. 37 LIBSVM 3.22 was used to build the predictive model and the model of DHW system was programmed in Matlab R2017b. The flowchart of simulation process is shown in Figure 9. The simulation was conducted with a time step of one hour from September 1 to November 30 (2184 hours in total). When t reached 2185, simulation ended and summed all the energy consumption value together to get the total energy consumption during simulation.
The volume of income hot water can be calculated as: where the volume of remaining water in the ST can be calculated as:

| Prediction performance
The dynamic recall change of each user during the simulation was shown in Figure 10. Recall was calculated using the latest four-week data, and the minimum limitation is 70%. The fluctuation for some users, such as users 5 and 7, was severe. Another main reason for the users' sudden shower incidents not being predicted by the algorithm was the limited number of shower incidents. Compared with other users, users 5 and 7 had fewer shower incidents. Because the total number was small, a single incorrect prediction would influence the recall by up to 10%-15%. For other users with more shower incidents, the effect was smaller. The hot water demand was mostly guaranteed because the predictive result obtained a high recall. The overall recall and accuracy of each user were listed in Table 12. The relationship between the recall and accuracy was repellent, that is, it was difficult to achieve the best recall and accuracy at the same time.
Compared to the training results in the model validation process mentioned in Section 2, the weighted average change in the test process varied depending on the users. Users 1, 2, and 3 were relatively stable. The changes in users 4, 5, 6, and 7 were relatively large, but still met the requirement of the prediction and did not influence the DHW system during the simulation.

| Comparative prediction performance
The future hot water demand was predicted using the predicted occupants' shower status � u,t+1 and Equation (10) mentioned in 2.5. The prediction and actual results were shown in Figure 11.
The traditional method was to predict the aggregated hot water demand directly. This was also achieved using SVM (37)  for regression in this study. All parameters in the SVM had been cross validated to find the best combination. The RMSE was used as the evaluation criteria for an aggregated prediction, and the main parameters for the SVM regression were as listed in Table 13. The prediction results were restricted to be larger than zero. The aggregated prediction results and the actual results were shown in Figure 12.
From Figure 11, the results from a separate prediction were larger than actual results in most cases, whereas in Figure 12, for an aggregated prediction, the prediction results remained at the same level as the actual results. The RMSE was used to evaluate the prediction precision. The RMSE for a separate prediction was 77.63, while for that for an aggregated prediction was 58.65.
However, once the prediction result was less than the actual result, it would lead to a failure of the hot water supply and lower the hot water supply assurance (HWSA). The relationship between the RSME and HWSA was similar to the relationship between the accuracy and recall, which was described in section 2.3, and the best performance of these two evaluation criteria could not be achieved at the same time. To analyze the prediction results from the two methods, an error distribution analysis was conducted, the results of which were shown in Figure 13.
During the test process, there were 403 out of 2184 time steps that the user actually had hot water consumption and only four (0.99%) hot water failures occurred in a separate prediction. The HWSA was 99.01%. However, for the aggregated prediction, there were 313 (77.67%) occurrences, and the HWSA was 22.33%. In addition, there were 112 (27.79%) time steps in which the hot water shortage is larger than 100 L. This would result in user complaints because the hot water shortage would be so large that more than one user would be affected.
The differences in the RMSE and error distribution were due to the difference in the aims of the two prediction methods during the training process. A separate prediction used the weighted average of the accuracy and recall as the performance evaluation (see section 2.3), in which the recall could improve the HWSA by predicting all possible shower incidents. In comparison, the aggregated prediction used only the RMSE (which had the same meaning as the accuracy) as the performance evaluation. The model was trained to allow the error to be distributed on both sides of zero such that the RMSE could be the least. However, this result was not very useful or applicable in a predictive control strategy because of the low HWSA. In addition, in the calculation of domestic hot water demand, the introduction of the standard deviation, seasonal coefficient, and adjustment coefficient could improve the HWSA. These coefficients increased the hot water demand to a certain degree, which resulted in a lower RMSE.
In conclusion, the method of a separate prediction proposed in this paper had been proven as a feasible way to predict future hot water demand, with a similar RMSE performance as an aggregated prediction, but with a much better performance to ensure the HWSA; it was thus more suitable for predictive control in a domestic hot water system. A comparison between these two prediction methods was another proof that only taking the RMSE as the evaluation criteria may lead to "fake low-RMSE" results, which were not practically useful.

| Hot water supply assurance
The proposed hot water system was designed for a 24-hours hot water supply, and thus, the HWSA was an important evaluation for the comfort of the occupants. The amount of the remaining hot water was used to evaluate the HWSA during the simulation process. The remaining hot water changed as the simulation proceeds, as shown in Figure 14. It could be observed that the amount of hot water in the ST was well controlled by the new control method by reducing the maximum amount of remaining hot water to below 200 L. This would lead to less heat loss from the ST. As for the HWSA, there are only four time steps where the remaining hot water was less than zero, which meant the hot water supply was insufficient for the occupant demand. During the simulation, the total number of shower incidents was 441. The portion of insufficient hot water supply was less than 1%. All insufficient hot water supply incidents and insufficient values were listed in Table 14.
The influence of an insufficient hot water supply could be reduced by informing the user ahead of time and giving feedback to the control system. For example, if the remaining hot water in the ST was below 100 L and the predictive F I G U R E 1 5 Remaining hot water in ST using traditional control method consumption per shower for a user was 120 L, the system would alert the user in advance and let the user choose whether to continue to shower or give the system some time to prepare more hot water, in order to reduce the user dissatisfaction.

| Comparative energy consumption
A traditional control method for a centralized hot water supply system with a storage tank works by filling the storage tank every day at midnight and preparing for users to consume during the daytime. It is widely used in a similar hot water system and is therefore selected as a comparison method. Using a traditional control method, the same system was set to fill up the ST to the maximum at 2:00 to 5:00, during which most of the occupants do not shower. The incoming hot water, remaining hot water, and energy consumption were calculated. The remaining hot water in the ST from the traditional control method was shown in Figure 15. This control method suited the occupants' hot water demand well with 100% HWSA. However, using a traditional control method, a large quantity of hot water was stored in the ST to be consumed later by the occupants. This would cause more heat loss in the ST. The heat loss and energy consumption of these two methods were calculated and compared in Table 15.
The results showed that, by the adoption of the new predictive control method using the occupants' behavior prediction, heat loss from the ST was reduced by up to 33.12%. Heat loss in the ST occupied 16.03% of the total energy consumption of the whole system when using a traditional control method. This portion reduced to 11.32% when using the new predictive control method, reducing the total energy consumption by 5.31%.
To show the effect of this new control method more intuitively, the electricity consumption of both scenarios was also calculated using Equation (40). As the simulation runs using data from a 3 month period, the total energy consumption for seven users were reduced by 154 kWh, from 2898 to 2744 kWh. This proved that this new control method had the ability to save up to 7.33 kWh electricity per month per user compared to a traditional control method in the proposed simulation.
The cost of the implementation of such a new control method was low as the main change was the installation of a WCU for shower data collection. The cost of one WCU for one shower tap was approximately 25 to 30 US dollars including implementation. 38 For one apartment, 2 or 3 people shared a shower tap. The electricity price in Shanghai is 0.091 US dollars/kWh. The payback period of this new control method is therefore 1.04-1.87 years.

| CONCLUSION
In the present study, a novel predictive control method based on occupant behavior prediction using a data mining algorithm was proposed for a domestic hot water system. This method generates future hot water demand based on a prediction of each occupant's shower behavior. A control strategy for a domestic hot water system based on the prediction results was raised and simulated in software using shower data of real occupants. The heat transfer model of a hot water storage tank was built to show the relationship between heat loss and the remaining hot water in the hot water storage tank. Finally, a comparative analysis of the prediction results and the energy consumption was carried out. The main conclusions from this paper are as follows: