Volume 9, Issue 3
ADVANCED REVIEW

Hyperparameters and tuning strategies for random forest

Philipp Probst

Corresponding Author

E-mail address: probst@ibe.med.uni-muenchen.de

Institute for Medical Information Processing, Biometry und Epidemiology, Ludwig‐Maximilians‐Universität München, Munich, Germany

Correspondence

Philipp Probst, Institute for Medical Information Processing, Biometry und Epidemiology, Ludwig‐Maximilians‐Universität München, Marchioninistr. 15, 81377 Munich, Germany.

Email: probst@ibe.med.uni-muenchen.de

Search for more papers by this author
Marvin N. Wright

Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany

Search for more papers by this author
Anne‐Laure Boulesteix

Institute for Medical Information Processing, Biometry und Epidemiology, Ludwig‐Maximilians‐Universität München, Munich, Germany

Search for more papers by this author
First published: 28 January 2019
Citations: 52

Funding information: Deutsche Forschungsgemeinschaft, Grant/Award Number: BO3139/2‐3BO3139/6‐1; Bundesministerium für Bildung und Forschung, Grant/Award Number: 01IS18036A

Abstract

The random forest (RF) algorithm has several hyperparameters that have to be set by the user, for example, the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a presenting brief overview of tuning strategies, we demonstrate the application of one of the most established tuning strategies, model‐based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.

This article is categorized under:

  • Algorithmic Development > Biological Data Mining
  • Algorithmic Development > Statistics
  • Algorithmic Development > Hierarchies and Trees
  • Technologies > Machine Learning

Abstract

Random forest has several hyperparameters that have to be set by the user. In this paper, we provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. Moreover, we compare different tuning strategies and algorithms in R. The trees for this picture are taken from the website https://plants.swtexture.com, the water can is from Jstein and the chainsaw from Henry Mühlpfordt, both from the website https://commons.wikimedia.org.

Number of times cited according to CrossRef: 52

  • Understanding Dynamics of Truck Co-Driving Networks, Complex Networks and Their Applications VIII, 10.1007/978-3-030-36683-4_12, (140-151), (2020).
  • The physiological effects of noninvasive brain stimulation fundamentally differ across the human cortex, Science Advances, 10.1126/sciadv.aay2739, 6, 5, (eaay2739), (2020).
  • A machine learning approach to modelling escalator demand response, Engineering Applications of Artificial Intelligence, 10.1016/j.engappai.2020.103521, 90, (103521), (2020).
  • Performance analysis of hyperparameter optimization methods for ensemble learning with small and medium sized medical datasets, Journal of Discrete Mathematical Sciences and Cryptography, 10.1080/09720529.2020.1721871, 23, 1, (115-123), (2020).
  • Machine learning for characterizing tropical tuna aggregations under Drifting Fish Aggregating Devices (DFADs) from commercial echosounder buoys data, Fisheries Research, 10.1016/j.fishres.2020.105613, 229, (105613), (2020).
  • Conditional permutation importance revisited, BMC Bioinformatics, 10.1186/s12859-020-03622-2, 21, 1, (2020).
  • Machine learning prediction of self-diffusion in Lennard-Jones fluids, The Journal of Chemical Physics, 10.1063/5.0011512, 153, 3, (034102), (2020).
  • undefined, 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 10.1109/VL/HCC50065.2020.9127271, (1-10), (2020).
  • Enabling the assessment of trauma-induced hemorrhage via smart wearable systems, Science Advances, 10.1126/sciadv.abb1708, 6, 30, (eabb1708), (2020).
  • Machine learning for predicting greenhouse gas emissions from agricultural soils, Science of The Total Environment, 10.1016/j.scitotenv.2020.140338, 741, (140338), (2020).
  • National Scale 3D Mapping of Soil pH Using a Data Augmentation Approach, Remote Sensing, 10.3390/rs12182872, 12, 18, (2872), (2020).
  • Artificial Intelligence Applications for Friction Stir Welding: A Review, Metals and Materials International, 10.1007/s12540-020-00854-y, (2020).
  • Developing comprehensive geocomputation tools for landslide susceptibility mapping: LSM tool pack, Computers & Geosciences, 10.1016/j.cageo.2020.104592, (104592), (2020).
  • undefined, 2020 International Conference on Smart Energy Systems and Technologies (SEST), 10.1109/SEST48500.2020.9203507, (1-6), (2020).
  • nAutonomous pH Control by Reinforcement Learning for Electroplating Industry Wastewater, Computers & Chemical Engineering, 10.1016/j.compchemeng.2020.106909, (106909), (2020).
  • Compact representations of microstructure images using triplet networks, npj Computational Materials, 10.1038/s41524-020-00423-2, 6, 1, (2020).
  • A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand, Remote Sensing, 10.3390/rs12030355, 12, 3, (355), (2020).
  • Comparative Metabolomics and Molecular Phylogenetics of Melon (Cucumis melo, Cucurbitaceae) Biodiversity, Metabolites, 10.3390/metabo10030121, 10, 3, (121), (2020).
  • Soil Color and Mineralogy Mapping Using Proximal and Remote Sensing in Midwest Brazil, Remote Sensing, 10.3390/rs12071197, 12, 7, (1197), (2020).
  • Unbiased variable importance for random forests, Communications in Statistics - Theory and Methods, 10.1080/03610926.2020.1764042, (1-13), (2020).
  • Estimation of Hourly near Surface Air Temperature Across Israel Using an Ensemble Model, Remote Sensing, 10.3390/rs12111741, 12, 11, (1741), (2020).
  • Non-Invasive Skin Cancer Diagnosis Using Hyperspectral Imaging for In-Situ Clinical Support, Journal of Clinical Medicine, 10.3390/jcm9061662, 9, 6, (1662), (2020).
  • Metabolomics and Machine Learning Approaches Combined in Pursuit for More Accurate Paracoccidioidomycosis Diagnoses, mSystems, 10.1128/mSystems.00258-20, 5, 3, (2020).
  • Forecasting Weekly Influenza Outpatient Visits Using a Two-Dimensional Hierarchical Decision Tree Scheme, International Journal of Environmental Research and Public Health, 10.3390/ijerph17134743, 17, 13, (4743), (2020).
  • Analyzing Links between Spatio-Temporal Metrics of Built-Up Areas and Socio-Economic Indicators on a Semi-Global Scale, ISPRS International Journal of Geo-Information, 10.3390/ijgi9070436, 9, 7, (436), (2020).
  • A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2030061, 2, 3, (61), (2020).
  • Mapping Invasive Lupinus polyphyllus Lindl. in Semi-natural Grasslands Using Object-Based Image Analysis of UAV-borne Images, PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 10.1007/s41064-020-00121-0, (2020).
  • A Random Forest-Based Accuracy Prediction Model for Augmented Biofeedback in a Precision Shooting Training System, Sensors, 10.3390/s20164512, 20, 16, (4512), (2020).
  • Long-Term Groundwater Level Prediction Model Based on Hybrid KNN-RF Technique, Hydrology, 10.3390/hydrology7030059, 7, 3, (59), (2020).
  • A Method of Human Activity Recognition in Transitional Period, Information, 10.3390/info11090416, 11, 9, (416), (2020).
  • Predicting Forest Cover in Distinct Ecosystems: The Potential of Multi-Source Sentinel-1 and -2 Data Fusion, Remote Sensing, 10.3390/rs12020302, 12, 2, (302), (2020).
  • Machine learning techniques for computer-based decision systems in the operating theatre: application to analgesia delivery, Logic Journal of the IGPL, 10.1093/jigpal/jzaa049, (2020).
  • Exploration of machine learning methods for the classification of infrared limb spectra of polar stratospheric clouds, Atmospheric Measurement Techniques, 10.5194/amt-13-3661-2020, 13, 7, (3661-3682), (2020).
  • Machine learning and soil sciences: a review aided by machine learning tools, SOIL, 10.5194/soil-6-35-2020, 6, 1, (35-52), (2020).
  • A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2040061, 2, 4, (61), (2020).
  • Marginally-calibrated deep distributional regression, Journal of Computational and Graphical Statistics, 10.1080/10618600.2020.1807996, (1-41), (2020).
  • Comparative Analysis of Gradient Boosting Algorithms for Landslide Susceptibility Mapping, Geocarto International, 10.1080/10106049.2020.1831623, (1-22), (2020).
  • Estimating Current and Future Rainfall Erosivity in Greece Using Regional Climate Models and Spatial Quantile Regression Forests, Water, 10.3390/w12030687, 12, 3, (687), (2020).
  • A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2040075, 2, 4, (75), (2020).
  • A machine learning approach based on multifractal features for crack assessment of reinforced concrete shells, Computer-Aided Civil and Infrastructure Engineering, 10.1111/mice.12509, 35, 6, (565-578), (2019).
  • Continuous Hyper-parameter Configuration for Particle Swarm Optimization via Auto-tuning, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 10.1007/978-3-030-33904-3_43, (458-468), (2019).
  • A comparison of machine learning methods for predicting the compressive strength of field-placed concrete, Construction and Building Materials, 10.1016/j.conbuildmat.2019.08.042, 228, (116661), (2019).
  • Hybrid models to improve the monthly river flow prediction: Integrating artificial intelligence and non-linear time series models, Journal of Hydrology, 10.1016/j.jhydrol.2019.06.025, (2019).
  • New margin-based subsampling iterative technique in modified random forests for classification, Knowledge-Based Systems, 10.1016/j.knosys.2019.07.016, (2019).
  • Machine learning approach to literature mining for the genetics of complex diseases, Database, 10.1093/database/baz124, 2019, (2019).
  • undefined, 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 10.1109/IC4ME247184.2019.9036587, (1-6), (2019).
  • Discovery of food identity markers by metabolomics and machine learning technology, Scientific Reports, 10.1038/s41598-019-46113-y, 9, 1, (2019).
  • Randomized Lasso Links Microbial Taxa with Aquatic Functional Groups Inferred from Flow Cytometry, mSystems, 10.1128/mSystems.00093-19, 4, 5, (2019).
  • Adjusting Emergent Herbaceous Wetland Elevation with Object-Based Image Analysis, Random Forest and the 2016 NLCD, Remote Sensing, 10.3390/rs11202346, 11, 20, (2346), (2019).
  • A reduced model using random forest: application on car crash optimization, SeMA Journal, 10.1007/s40324-019-00208-8, (2019).
  • Mapping at 30 m Resolution of Soil Attributes at Multiple Depths in Midwest Brazil, Remote Sensing, 10.3390/rs11242905, 11, 24, (2905), (2019).
  • Pattern Discovery in White Etching Crack Experimental Data Using Machine Learning Techniques, Applied Sciences, 10.3390/app9245502, 9, 24, (5502), (2019).

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.