Accuracy evaluation of two markerless motion capture systems for measurement of upper extremities: Kinect V2 and Captiv

Motion capturing is a promising method to assess working postures and human movements and, therewith, the risk of musculoskeletal injuries that could occur while performing manual tasks in industrial settings. To obtain a reliable risk assessment, the motion capture system used has to accurately measure body postures adopted by the worker during the task. This study evaluates the accuracy to measure joint angles of upper extremities of two different motion capture systems, namely the Microsoft Kinect V2 and the Captiv system, for angles of upper extremities. For this purpose, an experimental study was conducted involving 12 subjects performing preset static postures and basic movements, including elbow flexion, shoulder flexion, and shoulder abduction. In addition, to examine whether self‐occlusion or occlusion of body parts by work equipment has an impact on the accuracy of the Kinect V2, the subjects handled boxes during some of the tests. As a gold standard, a goniometer for static and an angle scale for dynamic exercises was used. The Captiv system shows high correlation coefficients (r > .93) and small mean absolute errors (<5°) for all movements except for elbow flexion. The Kinect V2 has sufficient results for joint angles captured without occlusion as well, but the accuracy significantly decreases when occlusion occurs.

Recently, motion capture technologies have been used for an ergonomic assessment of physical work loads on industrial workplaces more frequently. These systems allow the automatic recording of postures and movements during manual working tasks. The collected data can be digitized, graphically visualized and analyzed using statistical computing methods. Although marker-based systems are considered to be the gold standard in laboratory environments (Lai et al., 2013), markerless systems are much better suited for practitioners because they can be used with less elaborate preparation and are significantly cheaper (e.g., Corazza, Mündermann, Gambaretto, Ferrigno, & Andriacchi, 2010). This study evaluates the suitability of two markerless motion capture systems for an ergonomic assessment of working postures and movements of upper extremities: the optical sensor Microsoft Kinect V2 and the body-mounted inertial sensors of the Captiv L7000 system provided by TEA (http://teaergo.com).
This study presents an important first step in understanding the applicability of these technologies for upper limb movements and provides insights into the accuracy of the Captiv system since there is little evidence on the system's reliability in the literature so far.
In the past, the Kinect system has frequently been studied for its accuracy in the assessment of postures and specific movements, usually in laboratories. In the majority of studies, the parameters measured by the Kinect were compared with the data of a gold standard. Most often, marker-based optical systems were used as a gold standard (e.g., Bonnechère et al., 2014;Clark et al., 2015;Dutta, 2012;Kuster, Heinlein, Bauer, & Graf, 2016;Plantard, Muller et al., 2017). Body-mounted sensors (e.g., Huber, Seitz, Leeser, & Sternad, 2015;Romero et al., 2017), goniometers (e.g., Hawi et al., 2014;Lee et al., 2015) and orthogonal reference photographs (e.g., Matsen, Lauder, Rector, Keeling, & Cherones, 2016) have been used as gold standards for postures and motions as well. Building on these findings, the Kinect has already been used for the automation of ergonomic assessments by calculating joint angles, for example using the Ovako Working Posture Assessment (Diego-Mas & Alcaide-Marzal, 2014) or the Rapid Upper Limb Assessment (RULA; Manghisi et al., 2017). The results of these tools calculated by the Kinect data, compared with expert ratings, are sufficient with over 70% correctly assessed RULA scores for the upper body, even under suboptimal conditions in real working environments (Plantard, Shum, Le Pierres, & Multon, 2017). Moreover, some researchers used raw data from multiple Kinect systems to capture and consolidate movements from different perspectives. Most of them focused on the analysis of human gait (Dolatabadi, Taati, & Mihailidis, 2016;Geerse, Coolen, & Roerdink, 2015;Müller, Ilg, Giese, & Ludolph, 2017). Chen, Lee, and Lin (2015), for example, developed an approach with multiple Kinect V1 sensors to evaluate the accuracy in range-of-motion experiments of the upper extremities. However, researchers agree that the newer generation (Kinect V2) is more accurate than the Kinect V1 for motion capture of upper extremities (Mishra, Skubic, & Abbott, 2015;Wang, Kurillo, Ofli, & Bajcsy, 2015). Therefore, it is recommended to use the current Kinect V2 model for further studies.
One major disadvantage of the Kinect V2 that has been mentioned in the literature is the capturing of work processes in environments where occlusion occurs. Occlusion, in this context, could either be caused by an object (e.g., a box) handled by the subject or by the subject itself, since the field of view of the camera is blocked. This may reduce the accuracy of the optical system (i.e., the Kinect V2).
Especially, this is important for practical cases in real work settings, as such occlusion situations often occur (Dzeng, Hsueh, & Ho, 2017;Kuster et al., 2016;Müller et al., 2017). Although  indicated that the accuracy of the Kinect V2 is sufficient to calculate RULA values even in situations where occlusion occurs, an accuracy evaluation of joint angle measurement in authentic working environments using multiple Kinect V2 sensors is still lacking. To the best of the authors' knowledge, prior research has so far not used multiple Kinect V2 systems to capture human motions from different perspectives to reduce the influence of occlusion of upper extremities and to improve the accuracy of the system in real working tasks.
In contrast to the Kinect V1 and V2, the Captiv system has not yet been analyzed for its accuracy at all. So far, this system has only been used to investigate other topics and thus some degree of system accuracy has been assumed. Bartnicka, Ziętkiewicz, and Kowalski (2015), for example, used Captiv to capture body postures of surgeons during surgeries. Furthermore, Bartnicka et al. (2017) examined the motions of orthopedists and medical staff to improve ergonomic working conditions. Taber, Sweeney, Bishop, and Boute (2017) used Captiv to measure the movements of limb, torso, and head to jettison a helicopter push-out window. In another study, Vignais, Bernard, Touvenot, and Sagot (2017) established a continuous RULA computation by supplying a biomechanical model with motion data from Captiv. Furthermore, Captiv is based on inertial sensors, and hence occlusion does not influence the accuracy of the system.
The aim of this study is to quantitatively evaluate and compare the accuracy of a motion capture system consisting of two Kinect V2 sensors and the Captiv system using a subject study including a defined working task (the lifting of boxes). Here, the measured joint angles of the upper extremities are compared with a gold standard for static postures (a goniometer, e.g., Hawi et al., 2014) and movements (an angle scale).
The following research questions are addressed in the work at hand.
(1) How accurate are the Captiv system and a Kinect V2 system with two sensors in capturing basic postures and movements, both with respect to a gold standard and to one another?
(2) How is the accuracy of the two-sensor Kinect V2 system influ- The second motion capture system used in this study is the Captiv L7000 system provided by the TEA group (see Figure 1). These two systems were chosen for the purpose of our study because, on the one hand, Kinect is a quite inexpensive and widely available system that has often been evaluated in the literature as having sufficient performance. Therefore, it could be especially suitable for companies that do not have the financial capacity for high-priced systems. In addition, unlike many alternative motion capture techniques, the operator is unaffected by markers or cables in his/her work. Captiv, on the other hand, is evaluated because it uses a different approach to capture motion than the Kinect and since it has not been validated yet. In addition, compared with optical motion capture systems, inertial measurement unit-based systems are more suitable for field studies due to their quick setup, which makes the Captiv system attractive for practical analyses.

This system uses inertial sensors with integrated magnetometers
Both systems are thus well-established complete solutions consisting of hardware and software. Our focus is not on improving the algorithms used by the systems. Instead, the whole system is understood as an integrated measurement tool whose output we analyze. By comparing both systems, we can evaluate two fundamentally different motion capture technologies to provide recommendations for action regarding which system is suited for capturing postures and motion for industrial working tasks.

| Subjects
Twelve young and apparently healthy individuals (age: 23.8 ± 2.6 years, height: 177.3 ± 9.4 cm, weight: 70.9 ± 12.3 kg, seven males and five females) participated in the study, which is comparable to similar studies (e.g., Huber et al., 2015;Plantard, Muller et al., 2017;. All subjects read an ethics statement and signed an informed written consent before the experiments. The subjects wore tight-fitting clothes that allowed an appropriate placement of inertial sensors.

| Experimental procedure
The experiment was set up by placing two Kinect V2 sensors 2.5 and 2.7 m, respectively, in front of the subject and 1.5 m above the ground; see Figure 2. One of the sensors was facing the subject, as this has often been described as an optimal sensor setup (e.g., Xu, Robertson, Chen, Lin, & McGorry, 2017). In preliminary tests, the second sensor was placed at an angle of 80°to the first sensor (iPi Soft, 2019). This setup was also used in previous studies (e.g. Skals et al., 2017;Stone & Skubic, 2011) and found to be particularly robust before our tests, even for complex movements, and therefore adopted for the experiments. For all experiments in this study, attention was paid to the best possible exclusion of all potential influences, such as ferromagnetic metals, electromagnetic radiation, or inaccurate alignment of the sensors.
To assess the accuracy of the two systems, static postures and basic movements (with and without holding a box in both hands) of 12 volunteers were recorded. The focus was on the joint angles of upper extremities: (a) shoulder abduction/adduction, (b) shoulder F I G U R E 1 Kinect V2 (left), Captiv sensor (middle), and Captiv recording device (right) STEINEBACH ET AL.

| 293
flexion/extension as well as (c) flexion/extension of the elbow, since shoulder and elbow are commonly associated with work-related MSDs within the upper limbs (Aptel, Aublet-Cuvelier, & Cnockaert, 2002;Buckle & Devereux, 2002). During the experiments, the joint angles were measured simultaneously with both systems, and the raw data of both systems were synchronized temporally using a light signal. The procedure was divided into three phases as described in Table 1.
Phase 1 includes four preset, static postures related to upper extremities (see Figure 3). Subsequently, joint angles were measured on the subject using the motion capture systems and a goniometer.
An optical marker-based system that has often been used as a gold standard could not be employed because of resource limitations; a goniometer, which is a standard tool for determining joint angles, was used instead Milanese et al., 2014). It has high validity (Rothstein, Miller, & Roettger, 1983) as well as intra-and interpersonal reliability (Brosseau et al., 2001). Thus, goniometers have already been used successfully as a gold standard in similar studies for simple movements, for example, Lee et al. (2015) and Hawi et al. (2014). The chosen landmarks for flexion/extension of the elbow, for example, are the lateral epicondyle, the tip of the acromion, and midline of the wrist (Chapleau, Canet, Petit, Laflamme, & Rouleau, 2011).
In Phases 2 and 3, the accuracy of both systems was evaluated for basic movements of the upper extremities. Therefore, abduction and flexion of the shoulder, as well as a flexion of the elbow, were carried out (right and left side). The subjects stood straight, without moving the legs or bending the back during the tests. The movements in Phase 2 were performed without the handling of objects. In Phase 3, a box (dimensions: 35 cm × 25 cm × 25 cm) was held using both hands during the movements. In this case, the view of one Kinect sensor on one of the arms was occluded. Hence, it was possible to determine the impact of box handling on the angle measurement accuracy of the Kinect V2.
An angle scale (dimensions 2 m × 1 m) was installed as a gold standard for the movements, because the angles cannot be recorded continuously with the goniometer. The upper extremities were rotated around the origin of the angle scale to allow accurate reading of angles. For this purpose, the scale was mounted on the ceiling with ropes to permit adjusting its height. The origin of the angle scale has to be exactly at the height of the examined joint. A video camera was used to determine the angles from the scale for the evaluation of the movements. The camera was aligned at this height with the help of a laser to prevent perspective distortion. A similar procedure was utilized by Matsen et al. (2016), who measured the actual joint angles on photographs.

| Raw data processing
First, the motion data of both systems were exported. The Captiv L7000 software provides a function to automatically calculate joint angles (abduction/flexion of the shoulder and flexion of the elbow), which could directly be used for evaluation. The frequency of the dataset was down-sampled from 32 to 30 Hz to be in accordance with the Kinect data.
In the iPi Soft software for the Kinect V2, in contrast, only the Cartesian coordinates of the joints can be exported.
To calculate the joint angles φ from the coordinates, vectors v i ⃗ and anatomical planes were computed by linear algebra. The mathematical models were based on those of Lee et al. (2015) and Diego-Mas and Alcaide-Marzal (2014). The angle of the elbow, for example, was calculated using the vectors of the upper arm and forearm.
Anatomical planes were determined to calculate shoulder angles.
For the frontal plane, the connecting vector of the right and left shoulder and the trunk vector, which consists of the connection between the spine base and neck, were selected. Based on this, the sagittal plane was defined. The first support vector was also

| Statistical analysis
To determine the deviation of the two motion capture systems from the gold standard, the mean absolute error ( We followed the recommendation of Willmott and Matsuura (2005) by choosing the MAE instead of the root mean square error (RMSE) as a measure of deviation, because RMSE gives a relatively high weight to large errors and is therefore hard to interpret. However, the RMSE is also reported below to facilitate comparing our results to those of earlier studies that relied on the RMSE. To analyze the differences between the two systems, the Wilcoxon signed-rank tests, was performed (α = .05) because the corresponding data were not normally distributed.
In addition, correlation coefficients between the two systems and the gold standard were calculated by comparing the maximum angle, minimum angle and the midpoint angle

| RESULTS
This section presents the descriptive results of the subject study concerning the accuracy of both systems for each phase.

| Results for Phase 1
The MAEs of the systems' measurements compared with the goniometer are between 2.5°and 12.9°for the Kinect V2 and between 1.9°and 14.9°for the Captiv for static postures (see Table 2). Especially for the elbow, the Captiv system shows high errors, while the angles of the shoulder, were captured very accurately in all postures with deviations <4°.
In general, the correlation coefficients for both systems are very high and significant, except for the Kinect's measurements of abduction (Posture 1) and flexion (Posture 3) or the angle of the elbow in Captiv, respectively.

| Results for Phases 2 and 3
After manually evaluating the joint angles during movements using the angle scale, the MAE, the RMSE and the correlation coefficients were calculated again. For Phase 2 (without box) and Phase 3 (with box), these values are summarized in Tables 3 and 4, respectively. For flexion of the elbow, the Kinect system also has higher MAEs than Captiv, especially for the detection of the left arm. This is also reflected in the correlations, which are lower for Kinect (r < .72) than for Captiv (r > .93).
Finally, it can be observed that the measurement deviations of  Looking at the corresponding Bland-Altman plots (Figure 4), the mean difference is close to zero for Captiv in many cases. Furthermore, the two dashed lines (95% limit of agreement) are closer to each other for the Captiv measurement. The narrower the limit of the agreement is, the more practical is the use of this measuring method (Dolatabadi et al., 2016). This error may have been caused by the initialization (manual alignment of the sensors to minimize the impact of environmental influences) of the Captiv system, which only had a "moderate" quality according to the software, although the initialization was carried out exactly as recommended by the manufacturer. A moderate initialization quality may not be sufficient for the highest accuracy requirements and has to be taken into account when interpreting the results.
In sum, the Kinect V2 is able to deliver results that are almost as accurate as those obtained by Captiv L7000 in very simple, static situations without occlusion of body parts.

| Basic movements
This subsection evaluates dynamic experiments. The following types of occlusion are of particular interest. (1) Self-occlusion (occlusion of the left arm by the right arm).
(2) Occlusion by tools/equipment (body parts covered by the box held between the hands).
As mentioned in the previous section, the mean absolute deviations of both systems for the abduction of the shoulders in Phase 2 with MAE < 4.4°are very low and the correlation coefficients with r > .9 are high. For these movements without occlusion, both systems can be recommended for use.
In Phase 3, MAEs were higher for the Kinect system. Bland-Altman plots in Figure 6 graphically illustrate these errors. The dotted limit-of-agreement lines in the right plots (no self-occlusion) are much closer to each other than in the left plots. Self-occlusion also becomes evident from the high number of measurements of the left shoulder located in the negative range of the differences (y-axis) of the left diagrams. These high negative differences occurred when the Kinect was unable to detect the joints of the occluded arm. The corresponding iPi Soft visualizations show that, in these cases, the arm points down parallel to the upper body, even though the arm is actually flexed at 90°. In addition, the influence of the occlusion of the box can also be recognized for shoulder flexion. The MAEs in the third phase is again higher than in the second phase.
F I G U R E 4 Bland-Altman plot-Phase 1: Shoulder abduction The detection of joint angles during elbow flexion is not completely satisfactory for both systems. The problem of the initialization of Captiv, which has already been described in the static phase, is also recognizable during movements. Mean absolute errors between 6.9°and 10.7°c ould be insufficient for specific clinical studies or ergonomic analyses.
Whether a system can still be used in investigations depends ultimately on the context of the analysis and on whether or not the conclusion of an analysis changes due to the measurement error (Myles & Cui, 2007).
The detection of the joint angles of the elbow by the Kinect V2 sensor has even higher errors than the measurement with Captiv. In contrast to the relatively accurate results of the static motion capture of the elbow flexion, the detection of dynamic movements by the Kinect sensor failed for some subjects. Again, the self-occlusion of body parts seems to be problematic. The mean absolute error of the left elbow in The results of this study are subject to limitations. First, it has to be taken into account that the two motion capture systems were only evaluated in this study with respect to three different joint angles.
Furthermore, the Kinect system can face further challenges in a real work environment. Daylight, moving objects in the background, suboptimal sensor placement and larger areas of the subject's movements may further reduce the accuracy of the motion capturing.
In addition, the experiments were carried out only with relatively young participants (20-27 years), which does not necessarily correspond to the average age of persons performing manual operations, for example, in logistics.

| CONCLUSION
The aim of this study was to quantitatively compare the accuracy of the Kinect V2 and the Captiv L7000 motion capture systems in a subject experiment study. The joint angles of the upper extremities measured by both systems were compared with a reference system for static postures and movements. A central objective of this F I G U R E 5 Bland-Altman plot-Phase 1: Shoulder flexion STEINEBACH ET AL.

| 299
experimental study was to investigate if a second Kinect V2 sensor, which detects the motions from another angle, is able to compensate the problem of occlusion that results, for example, from the handling of a box. This disadvantage has been described earlier in the literature. The results of the study showed that the angles of the upper extremities have significantly higher measurement errors when occluded by either a work item or the subjects themselves. This especially applies for the flexion of the shoulders and elbows.
Occlusion, therefore, remains a major disadvantage even when using two Kinect sensors. For complex movements, especially in industrial work environments with occlusions, the Captiv system is therefore preferable for ergonomic analyses in terms of accuracy in the majority of cases.
We conclude that the Kinect V2 sensor and the Captiv L7000 system are both able to obtain accurate results in situations without occlusion for joint angles of upper extremities. However, the flexion angle of the shoulder is underestimated during measurement with the Kinect V2. In addition, Captiv is less accurate in determining elbow angles; in the case of this study possibly due to an insufficient initialization quality. For many clinical studies, these characteristics are sufficient as long as there are no occluded extremities while capturing with Kinect. In the next step, motion data captured by Captiv in a real manufacturing environment could be integrated in ergonomic risk assessment, for example, RULA or the European Assembly Worksheet.
Moreover, it may be worthwhile to test the Captiv system in further studies for its accuracy in more complex movement sequences to further demonstrate its applicability to real working environments. In addition, it would be desirable to assess the effect of initialization quality on the accuracy of the Captiv system. Also, the capturing with more than two Kinect sensors or the integration of raw data with another software than iPi Soft could be evaluated in an extension of this study to analyze if this leads to higher accuracy.
F I G U R E 6 Bland-Altman-plot-Phases 2 and 3: Shoulder flexion F I G U R E 7 Error pattern for movements with box 300 |