RumexWeeds: A grassland dataset for agricultural robotics

Computer vision can lead toward more sustainable agricultural production by enabling robotic precision agriculture. Vision‐equipped robots are being deployed in the fields to take care of crops and control weeds. However, publicly available agricultural datasets containing both image data as well as data from navigational robot sensors are scarce. Our real‐world dataset RumexWeeds targets the detection of the grassland weeds: Rumex obtusifolius L. and Rumex crispus L. RumexWeeds includes whole image sequences instead of individual static images, which is rare for computer vision image datasets, yet crucial for robotic applications. It allows for more robust object detection, incorporating temporal aspects and considering different viewpoints of the same object. Furthermore, RumexWeeds includes data from additional navigational robot sensors—GNSS, IMU and odometry—which can increase robustness, when additionally fed to detection models. In total the dataset includes 5510 images with 15,519 manual bounding box annotations collected at three different farms and four different days in summer and autumn 2021. Additionally, RumexWeeds includes a subset of 340 ground truth pixels‐wise annotations. The dataset is publicly available at https://dtu-pas.github.io/RumexWeeds/. In this paper we also use RumexWeeds to provide baseline weed detection results considering a state‐of‐the‐art object detector; in this way we are elucidating interesting characteristics of the dataset.

grazing animals, and if consumed in large doses it can get poisonous for the animals, due to the large proportion of nitrate and oxalate in the plant (Hejduk, 2004). Obviously, Rumex shares soil resources and therefore, competes with sown and other more nutritious pasture species (Zaller, 2004). If Rumex is not controlled, it creates significant yield loss of 10%-40% (van Evert et al., 2011).
We aim to close that gap by making a new 30.3 GB large dataset-RumexWeeds-publicly available, enabling comparability of future approaches. RumexWeeds consists of image sequences with totally 5510 images of 2.3 MP resolution and 15,519 manual bounding box annotations collected at three different farms and four different days in summer and autumn 2021. Additionally, navigational robot sensor points from GNSS, IMU and odometry are recorded. The dataset is publicly available at https://dtu-pas.github.io/RumexWeeds/. In this paper, apart from introducing the RumexWeeds dataset, we also present detection as well as segmentation prediction results on it, which serve as baseline for the research community.

| RELATED WORK
In the domain of grassland agriculture, only three relevant and publicly available datasets were identified.
However, the dataset consists only of images taken within one field and 1 day as well as under controlled lightning conditions. The variety of the image data is therefore low. The GrassClover dataset (Skovsen et al., 2019) targets the segmentation of clover and grass to estimate the biomass composition on grassland fields. The images are generated synthetically using image composition, that is, no extensive manual pixel-wise annotations were required. Therefore, the images are not fully representative of real weeds. The DeepWeeds dataset (Olsen et al., 2019) is a large image classification dataset including eight common Australian weeds in their natural environment.
The generation of a solely image dataset-as the ones aboverequires significantly less effort than the generation of a robotics dataset, where not only images, but also data from additional robot sensors, such as GPS, IMU, odometry or LIDAR, are collected. A robotics dataset must be collected with an actual robot platform and the data management complexity is notably higher. Obviously, these datasets are richer in information and allow one to relate datapoints from different sensors, resulting in different possible applications, like for example, visual SLAM and object mapping. In the following, we summarize three relevant robotics datasets within agricultural field robotics-even though, none of them is targeting grasslands.
The SugarBeet dataset (Chebrolu et al., 2017) was the first agricultural field robotics dataset collected with the BoniRob platform (Chebrolu et al., 2017). Data was recorded over 3 months in a sugar beet farm, including a high number of different sensors: RGB camera, stereo camera, laser scanner, GPS and odometry. Furthermore, it provides pixel-wise ground truth annotation for a subselection of 400 images. For the Brassica dataset (Bender et al., 2020), weekly sequences of cauliflower and broccoli were collected to capture the growth cycle. Their robot platform Ladybird is equipped with a GPS, IMU, a forward-pointing line-scan hyper-spectral camera, a downward pointing stereo camera in controlled lightning conditions as well as a downward pointing thermal camera. For a selected subset of the acquired images, ground-truth annotations on pixel-as well as bounding-box-level are provided. Finally, the Rosario dataset (Pire et al., 2019) is a multisensor dataset-stereo camera, IMU, GPS, odometry-targeting specifically the application of visual SLAM. The stereo camera is pointing toward the forward-driving direction and no image annotations are provided.
So far, agricultural robots are predominantly applied to crop cultivation fields, where plants are clearly separable from soil by means of simple color thresholding or by considering near-infrared (NIR) imagery that has higher reflectivity on chlorophyll content. Wíth RumexWeeds, we provide the first robotics dataset for grasslands. RumexWeeds targets the most problematic weed in dairy farming Rumex, which is expected to be more challenging as foreground vegetation (weeds) and background (grass) share similar color and chlorophyll content.

| THE AGRICULTURAL ROBOT PLATFORM
The Clearpath Husky robot serves as the base platform, equipped with four relevant sensors (Figure 1a). These sensors as well as the computing unit is specified in the following.

| RGB camera
The Basler ace 2 camera has a global shutter with an image resolution of 1920px × 1200 px and pixel size of 3.45μm × 3.45μm, while it is connected via USB3.0. The camera is equipped with a Basler Lens with a focal length of 4 mm, resulting in a FOV of

| GNSS sensor
We mounted a Navilock NL-8022MU USB 2.0 multi-GNSS receiver on the top of the camera beam to receive the best possible signal.
The low-cost GNSS sensor is based on the u-blox 8 chipset and has a positioning accuracy of 2.5m circular error probable (CEP). Therefore, the sensor provides a rough absolute robot position, but is not by itself sufficient for precise navigation or localization tasks.

| IMU sensor
The Xsens MTI-630 sensor provides ∘ ±0. 2 accuracy for roll/pitch and ∘ ±1 accuracy for yaw. It is mounted in the inside of the black cube structure on board the Husky platform (see again Figure 1a).

| Odometry sensor
The odometry is measured with the wheel encoders that come by default with the Husky robot platform: 1024 pulses per revolution F I G U R E 1 The agricultural robot platform. (a) We build upon the basic configuration of the Clearpath Husky robot with odometry sensor by adding a RGB camera, a low-cost GNSS sensor, an IMU sensor and a Nvidia Jetson Xavier NX. (b) The RGB camera is pointing toward ground with an angle of ∘ 75 and at a height of H = 1000 mm, resulting in a trapezoidal field coverage. Due to the mounting offset of o = 565 mm the image plane doesn't include any robot parts (a) Robot in the real world (b) Camera system setup.
(PPR) optical incremental quadrature encoders mounted pre-gearbox, which result in approximately 72,000 counts per meter.

| Computer setup
The robot is equipped with two separate computing units: (1)

| DATA ACQUISITION
On grasslands, a weeding robot is expected to be deployed to the fields approximately 3-4 weeks after the grass has been cut. At this point in time, the Rumex weeds have re-grown to such an extent, that they are clearly visible and standing out from the grass; however, they have not yet reached a plant size, which is difficult to handle autonomously. Therefore, the data acquisition for the RumexWeeds dataset has always been performed at a weed growth of approximately 3-4 weeks. Note, that we can still observe weed size variances within the data, because the growing speed is also dependent on the prevailing weather conditions during the growing period, as well as the soil or other conditions. Species composition of grassland can vary strongly from field to field, therefore we considered three different dairy farms in the Copenhagen hinterland in Denmark as shown in Figure 2: (1) Lundholm, (2) Hegnstrup, and (3) Stengard. If available, even different fields within the same farm were considered, which are marked in Figure 2. Furthermore, data was acquired on four different days during summer and autumn of 2021. The exact dates are specified in Table 1. We captured data during all relevant weather conditions: clear/sunny, partly cloudy/partly sunny, mostly cloudy and right after it rained, leading to data acquisition on wet ground.
During data collection, the robot was manually controlled, keeping an average speed of 0.4 m/s. We did not cover the whole fields, but drove the robot criss-cross through the field, focusing on relevant and interesting parts of the field including either Rumex weed plants or other plants that can easily be confused with Rumex (called negative plants in the following). The data was collected with the rosbag-tool, which saves the sensor data on their publishing frequency, which is 5 FPS for the RGB camera, 20 FPS for the IMU and 10 FPS for the GNSS and the Odometry.

| Bounding box annotation
For all labeled image sequences, bounding box annotations were generated manually for the grassland weed Rumex. Each bounding box includes the whole plant with all attached leaves. If the plant consists of only one leaf, the single leaf is enclosed by a bounding box. On very dense weed images, it can be difficult to identify all plants with high certainty. Here, noisy labels can be expected.
Furthermore, we differentiate between the two relevant subspecies of Rumex, namely R. obtusifolius L. and R. crispus L. and assign the relevant class to each bounding box. Both subspecies are equally undesired on dairy grassland fields and therefore, for most applications it is reasonable to treat both species as one class. The assignment decision in one of the above-mentioned classes was made purely based on the visual appearance in the images; no realworld plant phenotype such as root size, flowering branches or seed weight was considered. Since the majority of plants is captured in a relatively small growing state with no flowers present, the only way to visually differentiate between the two subspecies is by considering the leaf and stem characteristics, as described by Cavers and Harper (1964) and summarized in Table 2. During the annotation process, we experienced that the leaves' shape was the most important characteristic for our class assignment decisions.
We developed a custom annotation tool to take advantage of the image sequences as well as the navigational robot data. Initial bounding box annotations are tracked using odometry measurements as well as visual CRST (Lukežiě et al., 2018). If both tracker agree, the tracked bounding box is proposed to the annotator. The annotator can accept, modify or delete the proposal. Table 1 summarizes the datapoints that have been collected and annotated on four different days and at three different farms. It includes the number of foreground images-the ones including one or more Rumex objects, the number of pure background images and the total number of bounding box annotations for both classes (R. obtusifolius L., R. crispus L.). Since bounding box annotations are performed for the whole dataset, we provide additional characteristics with the total proportion of positive pixels versus all image pixels, as well as the average bounding box size as a percentage of the whole image size. Although we have a significantly larger number of foreground images-that is, images containing at least one object, the Rumex plant is still highly underrepresented when considering the positive pixel proportion for each data collection session. A more detailed distribution of object size as percentage of the image size is given in Figure 3. In Figure 4, the object distribution over the image plane is shown for all labeled foreground images. It can be clearly seen, that weed plants are more frequently present in the center of the image. This can be explained by the fact, that during data acquisition the robot was teleoperated manually with a joystick, aiming to catch as many weed plants as possible. The operators unconsciously tended to drive centered over the weeds.

| Pixel-wise annotation
Additionally, we used CVAT (Sekachev et al., 2020) to provide a small number of carefully manually-annotated ground truth masks for a random subset of 20 images per location and day, resulting in 100 images and 340 segmented bounding box crops in total. Figure 5 shows sample images of RumexWeeds with bounding box annotations and the corresponding manually segmented bounding box crops.

| DATA SET STRUCTURE
From the data acquisition process, we get raw rosbag-files as stated in Section 4. In a postprocessing step, we synchronize the recorded datapoints and extract relevant fractions in more general data formats. The overall structure of the extracted and clean dataset is illustrated in Figure 6. RumexWeeds contains five subfolders-one for each dataset collection specified in Table 1. Each dataset collection contains an arbitrary number of sequences, while each sequence consists of a number of consecutive images at 5 FPS. The corresponding datapoints of IMU, GNSS and Odometry are saved as dictionary in the seq<seq_id>/*.json files. We simply save the corresponding ROS message as dictionary, which is sen-sor_msgs/NavSatFix.msg for the GNSS, sensor_msgs/ Imu.msg for the IMU and nav_msgs/Odometry.msg for the Odometry data. The images and the navigational data are linked via the image file name, which serves as key in the json-dictionary. In other words all sensor points with the same key are timesynchronized. The file seq<seq_id>/rgb.json simply contains the timestamp in the standard timestamp format and is therefore redundant with the timestamp information embedded in the image name. The seq<seq_id>/annotations.xml contains all ground truth (i.e., performed by a human) bounding box and F I G U R E 2 During data acquisition, three different farms (Lundholm, Hegnstrup, and Stengard) in the hinterland of Copenhagen in Denmark were visited to capture a variety of grassland compositions. Images were collected approximately 3-4 weeks after the grass has been cut. GÜLDENRING ET AL. | 1643 segmentation annotations for the corresponding image sequence.
The format of the annotations follows the conventions of the CVAT annotation tool version 1.1 (Sekachev et al., 2020). Finally, the metadata.json includes information concerning all sequences within one data collection <collection_id>. It contains the intrinsic camera coefficients as well as transforms from base_link to the three sensors: → camera_link, → imu_link, → base_gnss as well as the transform base_footprint → base_link. The base_footprint-frame is positioned on the ground, while the base_link-frame lays within the robot, therefore giving insight on the robot height above ground. The folder dataset_splits contains dataset splits to compare ones results to our baseline results in Sections 7 and 8.
One might wonder, how the image sequences have been extracted.
To increase the proportion of positive objects within the data, we focused on extracting sequences containing positive images with Rumex objects. A sequence starts when a Rumex weed enters the image frame T A B L E 1 Summary of annotated datapoints from the five different data collection sessions on four different days and three different farms. Note: It includes the number of foreground (FG) images-images including one or more Rumex objects-, the number of pure background (BG) images and the number of sequences. Furthermore, the number of objects for both classes (Rumex obtusifolius L., Rumex crispus L.) and for both annotation types (bounding box, pixel-wise) are listed. Note, that the bounding box annotations are performed for the whole dataset-therefore, additional characteristics namely the total proportion of positive pixels vs all image pixels (below stated as: pos pixel proportion), as well as the average bounding box size as a percentage of the whole image size are given (below stated as: avg. object size).
T A B L E 2 Weed phenotype of Rumex obtusifolius L. versus Rumex crispus L. from Cavers and Harper (1964).  F I G U R E 4 Object distribution within the image plane for all foreground images. The weed object frequency is significantly higher in the image center due to human bias, while teleoperating the robot.  Figure 7, we show the number of images as well as the number of objects for each extracted sequence. It can be assumed, that sequences with a significant higher number of objects than images indicate a high weed density.

| RUMEX DETECTION BASELINE
In this section, we present baseline results on Rumex detection, considering our RumexWeeds dataset. Our targeted application is a Rumex weeding robot, which requires a rough Rumex position to successfully approach and remove the weed. We treat both species R. obtusifolius L. and R. crispus L. as one class, because they are equally undesired and are removed in the same way by the robot.
F I G U R E 7 The number of images as well the number of objects is shown for each sequence within the five different data collection sessions. Sequences with a significant higher number of object than images indicate a high weed density.  Missing detections are not so problematic, as long as the majority of weeds is removed successfully. It can be assumed, that the missed weed will be removed in another session, when Rumex features have developed further, that is, the plant grew bigger.
To get a more in-depth information about the model performance characteristics, we additionally analyze the sources of error according to Binch et al. (2018). Hereby, the following errors are possible for our 1-class detection problem: • Localization error (Loc): The prediction can be assigned to a ground-truth bounding box, but the IoU falls below the IoU thresh .
• Duplicate detection error (Dupl): The prediction can be assigned to a ground-truth bounding box with an IoU IoU ≥ thresh , but there is already another higher-scoring prediction that was assigned to the same ground-truth bounding box.
• Background error (Bkg): A prediction that does not overlap with any ground-truth bounding box.

| Impact of detection confidence score
Each object prediction comes with a confidence score, which represents the network's confidence in the correctness of the corresponding bounding box prediction. For real world applications the confidence threshold is of great importance: Only predictions above a certain confidence threshold are considered to provide reliable predictions with a low number of false positive predictions (FP). However, benchmarked detection models use rather low, unrealistic confidence thresholds, for example, the default confidence threshold of YOLOX is set to 0.01.
For our Rumex detection application, it is most important to reduce the number of Background prediction to minimize the resources during the weeding process. In return, some missed or less accurately located prediction can be accepted. In Table 4, the performance of our YOLOX-tiny model from the previous experiment is listed for different confidence thresholds. Hereby, we report the overall mAP 50:95 /mAP 50 as well as the different sources of error: localization error (Loc), duplicate detection error (Dupl), background error (Bkg) and missed ground truth error (Miss). The results confirm that with increasing confidence threshold the overall mAP reduces.
On the other hand, predictions are more reliable, because the share of background and localization error decreases significantly, while the share of missed ground truth error increases.

| Impact of different image views
As discussed, RumexWeeds includes whole image sequences instead of single images of the targeted object, which is particularly rare for image datasets. The availability of image sequences has two main advantages: • It allows one to take into consideration temporal image data, profiting from historical information. Since Rumex weeds tend to appear in clusters on the field, this is extremely valuable information. The detection of a plant such as Rumex should become more likely with the increasing density of Rumex weeds.
We will not investigate this further within this paper, because it would go beyond its scope.
• Each object will appear several times from different view points throughout the image plane. Since our camera is tilted by ∘ 75 , it is expected that plants within the lower half of the image provide better view points with more relevant features. However, it is desired to detect the weed as soon as possible to assure forwardlooking navigation.
Considering the vertical field coverage of h = 1130mm, an average robot speed of 0. 4m/s and an image acquisition frame rate of 5 FPS, we can assume that a plant occurs approximately from (1.13∕0.4*5) = 14.1 different view points. In Table 5

| Impact of object size
It is generally known that object detection of small objects is less robust, because of the lack of relevant features. Furthermore, for RumexWeeds, the label noise of small Rumex annotations is expected to be higher, because even for a human it is challenging to distinct small Rumex plants from small negative plants. In Table 6, we evaluate our YOLOX-tiny model for different object sizes. During evaluation, only ground truth annotations as well as predictions that are above the object size threshold are considered. Again, the object size threshold is listed as percentage of the input image size. The results in Table 6 approve the expectation, that the detection performance increases with increasing object size threshold. However, on larger object size threshold >4%, the performance decreases slightly. A possible explanation for this decrease is, that the dataset contains a relatively low amount of very large object sizes with an overall average object size and standard deviation of 2.48 ± 3.37%. Table 7 includes the results for our day-and farm-specific evaluation protocol to show how well the model generalizes over unseen data.

| Model performance on unseen days and farms
Again, YOLOX-tiny is trained on the corresponding training-set. We take the best model checkpoint according to the validation-set to finally assess the model performance on the test-set, which includes unseen data from either a different data collection session, that is, new day, or even from a completely new farm.
In the following, we give a possible explanation for the performance increase/decrease for the different experiments in Table 7, compared to the performance on the random split in experiment 0.
• Experiment 1: Decreased performance, because one specific negative plant looks very similar to Rumex and occurs very frequently in 20210807_lundholm, but not in the other data T A B L E 4 YOLOX-tiny is evaluated for different confidence thresholds. The overall mAP decreases with increasing confidence threshold, but remaining predictions get more reliable: We can observe a decrease of Localization and Background Error at the cost of an increase of missing predictions. The accompanying Figure visualizes  as well as the Lovasz Loss (Berman & Blaschko, 2017). Furthermore, we apply cosine learning rate scheduling (Loshchilov & Hutter, 2017). Standard image augmentations such as random flipping, random rotation, random scaling are applied.

| Model evaluation
Our pixel-wise manually annotated bounding box crops are split into training-and test-set (proportion 8:2), while the test-set will only be considered during the assessment of the final model. Since we have limited data, we will use k-fold-cross-validation (with k = 5) on the training-set to select the best model among all. As performance measurement, we consider the mean Intersection over Union (mIoU)also called Jaccard Index (Jaccard, 1912). For background as well as foreground class, a separate IoU is computed by dividing the number of intersecting pixels between prediction and ground truth with the sum of all pixels presented in both. Finally, the mIoU is the averaged sum over both classes. On the test-set, the final model (ID 7) achieves a slightly better mIoU score of 0.776 compared to the training mIoU of 0.770.

| Experimental results
In Figure 8, we show some qualitative results of the test-set with ground truth masks in the top row and model predictions in the bottom row. The majority of segmentation masks look very satisfying and are definitely precise enough to determine an approximate plant size as well as root center. As soon the plant leaves get relatively thin and resemble grass structure, the predictions are less reliable, which mainly occurs for the specie R. crispus L.
T A B L E 6 YOLOX-tiny performance is evaluated for different object size thresholds as percentage of the input image. The detection performance increases significantly, when discarding smaller objects during evaluation. However, for larger object size thresholds >4%, the performance starts dropping slightly. The accompanying Figure visualizes (Redmon et al., 2016) or SSD (Liu et al., 2016) that take individual images as input. Temporal image information as well as the additional navigational data can be considered in the detection model to potentially increase the detection robustness and efficiency.
Making the first step of actively using RumexWeeds for weed detection, we also present detection as well as segmentation prediction results using state-of-the-art approaches. Our results not only can serve as baseline for the research community, but also elucidate interesting characteristics of the dataset and the considered weed detection task.
T A B L E 7 YOLOX-tiny is evaluated for unseen data collection sessions (i.e., days) as well for unseen farms.

DATA AVAILABILITY STATEMENT
The accompanying data is accessible through our project webpage: https://dtu-pas.github.io/RumexWeeds/. The link is also provided in the main document (abstract and introduction). T A B L E 8 Tuning of the convolutional segmentation network U-Net (Ronneberger et al., 2015). Note: We can observe strong impacts on the model performance for increased input image resolution and the depth of the network. Finally, the model of ID 4 performs best and we increase the number of training epochs until the performance saturates (ID 6, 7).
F I G U R E 8 Visual comparison of ground truth masks (a) to the predictions of model 7 (b) of the test-set. The majority of predictions looks satisfying. Lower quality predictions can be observed, when the plant leaves are very thin and resemble grass structures. Note the color encoding of the two classes, which is yellow for Rumex obtusifolius L. and orange for Rumex Crispus L. (a) Ground truth masks and (b) Rumex crop predictions.