Coconut trees detection and segmentation in aerial imagery using mask region-based convolution neural network

Food resources face severe damages under extraordinary situations of catastrophes such as earthquakes, cyclones, and tsunamis. Under such scenarios, speedy assessment of food resources from agricultural land is critical as it supports aid activity in the disaster hit areas. In this article, a deep learning approach is presented for the detection and segmentation of coconut tress in aerial imagery provided through the AI competition organized by the World Bank in collaboration with OpenAerialMap and WeRobotics. Maked Region-based Convolutional Neural Network approach was used identification and segmentation of coconut trees. For the segmentation task, Mask R-CNN model with ResNet50 and ResNet1010 based architectures was used. Several experiments with different configuration parameters were performed and the best configuration for the detection of coconut trees with more than 90% confidence factor was reported. For the purpose of evaluation, Microsoft COCO dataset evaluation metric namely mean average precision (mAP) was used. An overall 91% mean average precision for coconut trees detection was achieved.


I. INTRODUCTION
Natural disasters in the Kingdom of Tonga (South Pacific) are an unfortunate global reality. Their consequences can be damaging for the population of the south pacific who heavily depend on the local agriculture as a primary food source [1]. As per the 2015 statement of the Secretary-General of the UN on the "Implementation of the International Strategy for Disaster Reduction", approximately 1.5 trillion USD losses have incurred as a direct consequence of the natural catastrophes around the world 1 . The rate of recurrence as well as the magnitude of the severity of these disasters are increasing. Hence, there is a great demand to reinforce food security mechanisms and make appropriate assessments of the damages caused [2].
When cyclones strike, recognizing the area of damage is crucial for effective humanitarian response and securing undamaged food sources like coconut trees. The World Bank seeks qualified teams to develop machine learning based methods to automate the assessment of aerial imagery and to classify and locate the standing trees such as coconut trees within the aerial snapshot [3]. Manual aerial image *Correspondence to: hazratali@cuiatd.edu.pk This work is published in IET Computer Vision. doi: 10.1049/cvi2.12028 1 https://www.unisdr.org/ classification is a resource as well as skill−intensive task and requires a lot of time. More importantly, manual aerial image classification is not typically risk-free in disaster-hit regions.
OpenAerialMap, World Bank, and WeRobotics have collaboratively launched an open machine learning challenge to speed up the classification and analysis of high-resolution aerial imagery before and after humanitarian disaster [4]. The idea is to explore and develop machine learning solutions for the classification of various features of interest in aerial imagery obtained through UAV. The features thus obtained can then be utilized for object detection and classification to help in the assessment of damages caused. One of the tasks in the challenge is to build a model for coconut trees detection. In this task, we are given a spatial high-resolution image (about eight cm/pixel), which covers 50 km 2 Area of Interest (AOI) of the kingdom of Tonga in the south pacific region. The imagery is taken during October 2017, which is quite recent. Along with the aerial image, we are given shape (.shp) files, to recognize the geometric locations and classes of the targets (objects of interest) like roads and trees. Data for relevant features have been labeled by the group of volunteers from the humanitarian OpenStreetMap (OSM) community 2 .
Object detection in aerial imagery is an interesting task and has attracted the computer vision and machine learning research community [5], [6], [7]. Typically, these approaches use Convolutional Neural Networks (CNN) for object detection. The prevailing work has mostly been done on road detection and vehicle detection [5], [6], [7]. In this work, we choose to develop a framework to detect and locate coconut trees. More specifically, it addresses the task of coconut trees classification and localization. With the help of experimental results, we demonstrate how we can use mask R-CNN to detect coconut trees within the images. This is challenging as some of these images include mislabeled and missing ground truth entries. Besides, generic shapes that have different objects are difficult to be differentiated in the aerial image. Finally, there are many small objects occupied by densely concentrated regions in the aerial image.
This contributions of this paper are as follows: • We present a framework for automatic detection and localization of coconut trees within given aerial imagery. The framework is able to detect each individual coconut tree with a high confidence factor and provide a segmented mask. • The proposed framework for coconut trees detection provides a baseline approach, which can be easily extended to detect and identify other types of trees. • Agriculture resource management is a labor-intense and high-risk job. The proposed approach provides a lowcost solution for agriculture resource management and measurement of the impact of disasters on natural food resources while reducing the risk factors for human operators. The rest of the paper is organized as follows: In Section II we provide a brief overview of deep learning techniques for objection detection. In Section III, we provide a detailed description of the dataset, methods, and training mechanism used in this work. We present the results on coconut trees prediction with the help of figures in Section IV, and also provide a discussion on the model and results obtained. Finally, we conclude our work in Section V.

II. RELATED WORK
Over the last tens of years, satellite imagery has been often used in a diversified range of applications ranging from forestry [8] to agriculture [9], [10], target detection [11] and regional planning to warfare [12]. Satellite imagery has also been broadly employed to monitor natural disasters and various other adverse incidents to investigate their impact on the environment.
Deep learning modifies the traditional machine learning by addition of more "depth" in the model and transforming the information through several layers and non-linearity functions. This provides hierarchical data representation through abstraction of many levels [13]. Deep learning extracts useful features from raw data, with features from high levels of the hierarchy shaped through a combination of low level features [14]. The huge parallelization possible in deep learning models enables us to develop highly complex models for learning complex features and performing extremely well on many AI tasks [15]. So, deep learning models can enhance categorization efficiency or minimize error in regression problems, given adequate large data is available for a specific domain task.
The large capacity of models and highly hierarchical structure performs very well particularly on the prediction and classification tasks, being adaptable and flexible for a broad range of highly complex problems [15]. Although deep learning has got fame in various applications coping with rasterbased information (such as pictures, video), it can be applied to an array of different types of information, i.e. speech, audio and natural language, and other data types like population information [16], continuous data such as weather data [17] and soil chemistry [17]. The vital role of utilizing deep learning in the processing of images is the reduced need for feature engineering. In the past, conventional methods for image classification tasks were typically based on manual hand-engineered features. However, features engineering is a time-consuming, costly, and complex method that needs to be changed whenever the data-set or the problem changes. Thus, feature engineering involves a costly effort, which is based on the expert's ability and may not generalizes well [18]. Alternatively, a deep learning model does not rely on feature engineering and rather learn features through representation learning.
The region-based convolutional neural network (R-CNN) has proved to be very successful for segmentation tasks [19], [20]. In R-CNN, a selective search technique is applied to detect region proposals within the input image. Region proposals structure the features vector that is given to multiple classifiers to represent a distribution of class variables and also to a regression model to refine bounding boxes of regions of a proposal. Fast R-CNN and Faster R-CNN proposed in [21], [22] respectively, accelerate the detection procedure by first applying a deep CNN to the input image and then extracting features map and simply swapping selective-search by Region Proposal Network (RPN) to create region proposals, predicting bounding boxes and classes of the objects. The extension of Faster R-CNN to Mask R-CNN [23] puts a parallel branch to object detection to predict object masks with very small overhead. Mask R-CNN outperformed top models in the 2017 COCO competition in segmentation, object detection, and bounding-box detection.
Hence, we prefer Mask R-CNN as our model. Recently reported attempts on object detection and image detection include [24] and [25]. The work in [24] uses the U-Net architecture with a ResNet decoder and the work in [25] produces geometry preserved masking for a better fit on the object's boundaries. The selection of Mask R-CNN in our work was made based upon two considerations 1) The model outperformed top models on the most recent COCO competition for object detection and 2) the underlying software tools and tensor-flow libraries make the import of the pretrained weights relatively easier making the implementation more convenient.

III. PROPOSED METHODOLOGY
The overall pipeline is summarized in Figure 1. Briefly, we pre-process the data and divide the data into training and test sets. Since we utilize pre-trained weights, it is not necessary to train the whole neural network. We train the final layers (bounding box heads/classification) and select the configuration settings with the minimum validation error. At last, we evaluate the overall performance using the previously unseen test data. In the following sub-sections, we talk about all these phases in more detail. To generalize our model, we have taken the following steps: • We use an adaptive optimizer named as SWATS, an approach to switch optimization from Adam to Stochastic gradient descent and thus achieving better generalization [26].

A. Data Processing
As we discussed in the introduction section, we are provided with one single high-resolution aerial snapshot and a shape file (GIS file that stores geometric locations as well as attributes of geometric features, such as points, lines, and polygons). Our objective is to combine and convert both data sources into training, validation, and test sets appropriate for our object detection model. GDAL is an open-source library to deal with raster and vector geospatial data types 3 . Fiona is a very popular python library for writing and reading geospatial records 4 . We use Fiona to read the shape file into a JSON file format where we are given the geographical data and it is readily available for data processing. After that, we extract the positions of the items of interests, such as for example coconut trees by looking at the tags offered for every geo JSON object in the shape file. The positions are given in the latitudinal and longitudinal coordinate system. To be able to map these locations on to the very high-resolution aerial image, we convert the actual image into the latitude-longitude coordinate system by using GDAL tools. We furthermore map the latitude and longitude coordinates taken out from the shape file into the image-pixels by using the geographical metadata in the high-resolution aerial image. At the end of the procedure, we have an image combined with the pixels where objects of interests (coconuts trees) are located.
The input to the Mask R-CNN framework is the set of annotated train image tiles of size 1000 × 1000 pixels. We subdivide the actual image into patches of the dimensions 1000 × 1000. We take 70 such tiles and manually annotate every single tile by positioning the coconut trees and then drawing polygons around them. It really is a time-consuming procedure, but we believe correctly annotated dataset is vital for training a model with high prediction accuracy. We use VGG Image Annotator 5 to promptly annotate 70 image tiles. Every single annotation consists of JSON file format and keeps the positions of all polygons along with their tags. We now have approximately between 40 to 60 objects in each tile, therefore we conclude that the training dataset is substantial. A good example of an image tile annotated with the VGG Image Annotator is displayed in Figure 2.
It is very important to note, that during the annotation we find a lot of discrepancies. For example, quite a few coconut trees were mislabeled (such as coconut trees labelled as bananas, dark areas and shadows labeled as trees) and several trees were not marked at all.

B. Deep Learning Architecture and Training
As we discussed in the model selection phase, the foundation of our approach is based on the Mask R-CNN implementation. The ResNet101, a deep residual neural network with 101 layers, is the backbone architecture that extracts feature maps from the input image [27]. Residual networks enable us to efficiently train deep neural networks simply by introducing skip connections, in which weights coming from previous layers are copied into a more deep layer. It requires an image of 1000 × 1000 × 3 and then outputs feature map of dimension 32 × 32 × 2048. These features are moved to an RPN for training regression/classification of object classes and generation of bounding boxes. We initiated the training pro- cedure by downloading a model pre-trained on the Microsoft COCO dataset, one of the most widely used datasets for object detection and segmentation. In principle, we do not change the setting of earlier layers but modify few RPN parameters as we aim to train the final layers of the model (referred to as regression and classification heads). To accelerate the training process, although we aim to attain high accuracy, we specify the minimum region proposal confidence to 0.9, which means only regions with more than 90% confidence of potentially containing trees are considered. The confidence score of 0.9 is selected after an empirical evaluation of different possible values. Selecting a value less than 0.9 causes classifiers to incorrectly detect shadows and other trees like objects in the image as coconut trees (more false positives), as shown in Figure 6 . On the other, a confidence value greater than 0.9 results in missing out some of the coconut trees, such as trees behind light clouds, are not detected properly (more false negatives), eventually resulting in a lower detection accuracy, as shown in Figure 7. Furthermore, since in RGB aerial images coconut trees are expected to have approximately similar aspect ratios and sizes, anchor scales are set between 10 to 130. The learning rate is set to 0.001 while using the weight decay of 0.0001. We divided the available data (70 tiles) into training/validation/test sets of 50/10/10 image tiles and perform several experiments by changing the number of steps, the number of epochs, and the number of maximum possible Regions of Interest (ROIs).
According to the validation scores of multiple experiments that we executed, we select the  Bounding box predictions using ResNet101. All coconut trees prediction confidences are above 95%. The algorithm has missed out some coconut trees as threshold confidence is set high.

IV. RESULTS AND DISCUSSIONS
We apply weights of our model trained for 21 epochs to detect coconut trees on the test-set that consist of 10 images 6 . After performing several experiments best configuration settings are shown in Table 1. Our prediction results on some of the images are indicated in Fig. 5 and Fig. 8 The segmentation results are shown in Fig. 9 The overall training time for 5 and 10 batch size was less compared to batch size 1 but training and validation loss of batch size of 5 and 10 is greater than batch size = 1. So, we select batch size = 1 for this experiment, as shown in Fig. 10 and Fig. 11. A comparison table on training and validation loss for the different choices of batch sizes is shown in Table  III. Optimal learning rate of 0.001 (to avoid converging to local minima) after trying out different values among [0.0001, 0.0003, 0.001, 0.003, 0.01].

A. Configuration Settings
After performing several experiments, the best configuration settings which we set for our project are shown in Table I.

B. Discussion
We find that all coconut trees are detected with a considerably high confidence factor (which is > 90%). We achieved 96% classification accuracy (CA) with ResNet50 and 98% CA using ResNet101. For a more formal evaluation, we have selected the mAP metric, a commonly used metric for performance evaluation of object detection. The mAP is a mean of average precision, where not only the identification number but also the order of the correct predictions is evaluated. The highest mAP value achieved was 0.88 for ResNet50 and 0.91 for ResNet101. Detection results are visualized in Figure 5 and Figure 8. The mAP curves are accordingly shown in Figure  12 and Figure 13. F1 Score is 0.89 for ResNet50 and 0.92 for ResNet101. We have shown evaluation metrics in Table  II. The processing time of our approach is one minute for

C. Comparison with other techniques
Some excellent works reported on trees detection from high-resolution aerial imagery involve different datasets, preprocessing methods, models, parameters, and metrics. We do not make a direct comparison as the datasets used or the tasks performed in these approaches are different. However, it is still useful to provide a summary of the results of these approaches [35]. A summary of the data, model, score, and performance is reported in Table IV. As discussed earlier the  performance score of the algorithm differs depending upon the task. So, we have compared our results considering the ones that have used a similar performance. Milioto et al., [32] have reported accuracy of 84.62% on classification tasks while using a hybrid of PCA, logistic regression, and autoencoder. Luus et al., [29] have reported an accuracy of 93.48% on classification task while using CNNs. Sorensen et al. [33] has reported an accuracy of 97% on classification task while using DenseNet. Saldana et al., [34] has reported an 80% localization accuracy, 97.5% classification accuracy, and 0.89 of F1 score on localization and segmentation task while using an adapted version of YOLO and SegNet. In our work, we achieved 96% classification accuracy with ResNet50 and 98% classification accuracy achieved with ResNet101. F1 score of resnet50 is 89% and using ResNet101 it is 92%. It is worth mentioning that all of the aforementioned experiments except (Saldana et al., [34]) dealt only with the classification task. We propose an approach which not only performs classification but also locate coconut trees and segment the trees. We evaluate an additional performance metric, the mean Average Precision (mAP). We achieved 88% mAP using ResNet50 and 91% mAP with backbone architecture ResNet101. This metric shows how accurate the model is to locate and classify the coconut trees.

V. CONCLUSION AND FUTURE DIRECTION
In this paper, we have presented an approach for coconut trees detection and segmentation in aerial imagery of the kingdom of Tonga (South Pacific Islands). We have reported a Mask R-CNN based model using ResNet50 and ResNet101 backbone architectures. The model is trained on the data which we processed and prepared from a single high-resolution aerial image along with the shape file. Experimental results have shown that our model is able to predict coconut trees with quite a high accuracy (91% mean average precision). Our model can be effortlessly extended to classify and locate other kinds of food trees as well. A comparative setup showed that we get better accuracy for the ResNet101 architecture when compared with the performance of a ResNet50 based model. Moreover, it carries the benefits of faster R-CNN which is faster than conventional R-CNN and more accurate than CNN. The work carries significance in food resource assessment, humanitarian aid services, and damage analysis in disaster-hit areas, using high-resolution satellite imagery.
The research work is one of the attempts to classify and locate coconut trees based on remote sensed aerial imagery dataset. There is much more potential for future studies in this area. One task of particular significance is to get a cleaner dataset and have methods to get better annotations as these will improve the model training. Our future task includes model development to detect other types of food trees (mango, banana, papaya), as well as road conditions and their types.