Hyperparameters and tuning strategies for random forest
Funding information: Deutsche Forschungsgemeinschaft, Grant/Award Number: BO3139/2‐3BO3139/6‐1; Bundesministerium für Bildung und Forschung, Grant/Award Number: 01IS18036A
Abstract
The random forest (RF) algorithm has several hyperparameters that have to be set by the user, for example, the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain, and the number of trees. In this paper, we first provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. It is well known that in most cases RF works reasonably well with the default values of the hyperparameters specified in software packages. Nevertheless, tuning the hyperparameters can improve the performance of RF. In the second part of this paper, after a presenting brief overview of tuning strategies, we demonstrate the application of one of the most established tuning strategies, model‐based optimization (MBO). To make it easier to use, we provide the tuneRanger R package that tunes RF with MBO automatically. In a benchmark study on several datasets, we compare the prediction performance and runtime of tuneRanger with other tuning implementations in R and RF with default hyperparameters.
This article is categorized under:
- Algorithmic Development > Biological Data Mining
- Algorithmic Development > Statistics
- Algorithmic Development > Hierarchies and Trees
- Technologies > Machine Learning
Abstract
Random forest has several hyperparameters that have to be set by the user. In this paper, we provide a literature review on the parameters' influence on the prediction performance and on variable importance measures. Moreover, we compare different tuning strategies and algorithms in R. The trees for this picture are taken from the website https://plants.swtexture.com, the water can is from Jstein and the chainsaw from Henry Mühlpfordt, both from the website https://commons.wikimedia.org.
Citing Literature
Number of times cited according to CrossRef: 52
- Gerrit Jan de Bruin, Cor J. Veenman, H. Jaap van den Herik, Frank W. Takes, Understanding Dynamics of Truck Co-Driving Networks, Complex Networks and Their Applications VIII, 10.1007/978-3-030-36683-4_12, (140-151), (2020).
- Gabriel Castrillon, Nico Sollmann, Katarzyna Kurcyus, Adeel Razi, Sandro M. Krieg, Valentin Riedl, The physiological effects of noninvasive brain stimulation fundamentally differ across the human cortex, Science Advances, 10.1126/sciadv.aay2739, 6, 5, (eaay2739), (2020).
- Semen Uimonen, Toni Tukia, Jussi Ekström, Marja-Liisa Siikonen, Matti Lehtonen, A machine learning approach to modelling escalator demand response, Engineering Applications of Artificial Intelligence, 10.1016/j.engappai.2020.103521, 90, (103521), (2020).
- Vinod Jagannath Kadam, Shivajirao Manikrao Jadhav, Performance analysis of hyperparameter optimization methods for ensemble learning with small and medium sized medical datasets, Journal of Discrete Mathematical Sciences and Cryptography, 10.1080/09720529.2020.1721871, 23, 1, (115-123), (2020).
- Y. Baidai, L. Dagorn, M.J. Amande, D. Gaertner, M. Capello, Machine learning for characterizing tropical tuna aggregations under Drifting Fish Aggregating Devices (DFADs) from commercial echosounder buoys data, Fisheries Research, 10.1016/j.fishres.2020.105613, 229, (105613), (2020).
- Dries Debeer, Carolin Strobl, Conditional permutation importance revisited, BMC Bioinformatics, 10.1186/s12859-020-03622-2, 21, 1, (2020).
- Joshua P. Allers, Jacob A. Harvey, Fernando H. Garzon, Todd M. Alam, Machine learning prediction of self-diffusion in Lennard-Jones fluids, The Journal of Chemical Physics, 10.1063/5.0011512, 153, 3, (034102), (2020).
- Shahed Anzarus Sabab, Adnan Khan, Parmit K. Chilana, Joanna McGrenere, Andrea Bunt, undefined, 2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 10.1109/VL/HCC50065.2020.9127271, (1-10), (2020).
- Jonathan Zia, Jacob Kimball, Christopher Rolfes, Jin-Oh Hahn, Omer T. Inan, Enabling the assessment of trauma-induced hemorrhage via smart wearable systems, Science Advances, 10.1126/sciadv.abb1708, 6, 30, (eabb1708), (2020).
- Abderrachid Hamrani, Abdolhamid Akbarzadeh, Chandra A. Madramootoo, Machine learning for predicting greenhouse gas emissions from agricultural soils, Science of The Total Environment, 10.1016/j.scitotenv.2020.140338, 741, (140338), (2020).
- Pierre Roudier, Olivia R. Burge, Sarah J. Richardson, James K. McCarthy, Gerard J. Grealish, Anne-Gaelle Ausseil, National Scale 3D Mapping of Soil pH Using a Data Augmentation Approach, Remote Sensing, 10.3390/rs12182872, 12, 18, (2872), (2020).
- Berkay Eren, Mehmet Ali Guvenc, Selcuk Mistikoglu, Artificial Intelligence Applications for Friction Stir Welding: A Review, Metals and Materials International, 10.1007/s12540-020-00854-y, (2020).
- Emrehan Kutlug Sahin, Ismail Colkesen, Suheda Semih Acmali, Aykut Akgun, Arif Cagdas Aydinoglu, Developing comprehensive geocomputation tools for landslide susceptibility mapping: LSM tool pack, Computers & Geosciences, 10.1016/j.cageo.2020.104592, (104592), (2020).
- Waqas Khan, Shalika Walker, Katarina Katic, Wim Zeiler, undefined, 2020 International Conference on Smart Energy Systems and Technologies (SEST), 10.1109/SEST48500.2020.9203507, (1-6), (2020).
- Douglas A. Goulart, Renato Dutra Pereira, nAutonomous pH Control by Reinforcement Learning for Electroplating Industry Wastewater, Computers & Chemical Engineering, 10.1016/j.compchemeng.2020.106909, (106909), (2020).
- Michiel Larmuseau, Michael Sluydts, Koenraad Theuwissen, Lode Duprez, Tom Dhaene, Stefaan Cottenier, Compact representations of microstructure images using triplet networks, npj Computational Materials, 10.1038/s41524-020-00423-2, 6, 1, (2020).
- Nam Thang Ha, Merilyn Manley-Harris, Tien Dat Pham, Ian Hawes, A Comparative Assessment of Ensemble-Based Machine Learning and Maximum Likelihood Methods for Mapping Seagrass Using Sentinel-2 Imagery in Tauranga Harbor, New Zealand, Remote Sensing, 10.3390/rs12030355, 12, 3, (355), (2020).
- Annick Moing, J. William Allwood, Asaph Aharoni, John Baker, Michael H. Beale, Shifra Ben-Dor, Benoît Biais, Federico Brigante, Yosef Burger, Catherine Deborde, Alexander Erban, Adi Faigenboim, Amit Gur, Royston Goodacre, Thomas H. Hansen, Daniel Jacob, Nurit Katzir, Joachim Kopka, Efraim Lewinsohn, Mickael Maucourt, Sagit Meir, Sonia Miller, Roland Mumm, Elad Oren, Harry S. Paris, Ilana Rogachev, Dominique Rolin, Uzi Saar, Jan K. Schjoerring, Yaakov Tadmor, Galil Tzuri, Ric C.H. de Vos, Jane L. Ward, Elena Yeselson, Robert D. Hall, Arthur A. Schaffer, Comparative Metabolomics and Molecular Phylogenetics of Melon (Cucumis melo, Cucurbitaceae) Biodiversity, Metabolites, 10.3390/metabo10030121, 10, 3, (121), (2020).
- Raúl Roberto Poppiel, Marilusa Pinto Coelho Lacerda, Rodnei Rizzo, José Lucas Safanelli, Benito Roberto Bonfatti, Nélida Elizabet Quiñonez Silvero, José Alexandre Melo Demattê, Soil Color and Mineralogy Mapping Using Proximal and Remote Sensing in Midwest Brazil, Remote Sensing, 10.3390/rs12071197, 12, 7, (1197), (2020).
- Markus Loecher, Unbiased variable importance for random forests, Communications in Statistics - Theory and Methods, 10.1080/03610926.2020.1764042, (1-13), (2020).
- Bin Zhou, Evyatar Erell, Ian Hough, Alexandra Shtein, Allan C. Just, Victor Novack, Jonathan Rosenblatt, Itai Kloog, Estimation of Hourly near Surface Air Temperature Across Israel Using an Ensemble Model, Remote Sensing, 10.3390/rs12111741, 12, 11, (1741), (2020).
- Raquel Leon, Beatriz Martinez-Vega, Himar Fabelo, Samuel Ortega, Veronica Melian, Irene Castaño, Gregorio Carretero, Pablo Almeida, Aday Garcia, Eduardo Quevedo, Javier A. Hernandez, Bernardino Clavo, Gustavo M. Callico, Non-Invasive Skin Cancer Diagnosis Using Hyperspectral Imaging for In-Situ Clinical Support, Journal of Clinical Medicine, 10.3390/jcm9061662, 9, 6, (1662), (2020).
- Estela de Oliveira Lima, Luiz Claudio Navarro, Karen Noda Morishita, Camila Mika Kamikawa, Rafael Gustavo Martins Rodrigues, Mohamed Ziad Dabaja, Diogo Noin de Oliveira, Jeany Delafiori, Flávia Luísa Dias-Audibert, Marta da Silva Ribeiro, Adriana Pardini Vicentini, Anderson Rocha, Rodrigo Ramos Catharino, Metabolomics and Machine Learning Approaches Combined in Pursuit for More Accurate Paracoccidioidomycosis Diagnoses, mSystems, 10.1128/mSystems.00258-20, 5, 3, (2020).
- Tian-Shyug Lee, I-Fei Chen, Ting-Jen Chang, Chi-Jie Lu, Forecasting Weekly Influenza Outpatient Visits Using a Two-Dimensional Hierarchical Decision Tree Scheme, International Journal of Environmental Research and Public Health, 10.3390/ijerph17134743, 17, 13, (4743), (2020).
- Marta Sapena, Luis A. Ruiz, Hannes Taubenböck, Analyzing Links between Spatio-Temporal Metrics of Built-Up Areas and Socio-Economic Indicators on a Semi-Global Scale, ISPRS International Journal of Geo-Information, 10.3390/ijgi9070436, 9, 7, (436), (2020).
- Ahlam Mallak, Madjid Fathi, A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2030061, 2, 3, (61), (2020).
- Jayan Wijesingha, Thomas Astor, Damian Schulze-Brüninghoff, Michael Wachendorf, Mapping Invasive Lupinus polyphyllus Lindl. in Semi-natural Grasslands Using Object-Based Image Analysis of UAV-borne Images, PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science, 10.1007/s41064-020-00121-0, (2020).
- Junqi Guo, Lan Yang, Anton Umek, Rongfang Bie, Sašo Tomažič, Anton Kos, A Random Forest-Based Accuracy Prediction Model for Augmented Biofeedback in a Precision Shooting Training System, Sensors, 10.3390/s20164512, 20, 16, (4512), (2020).
- Omar Haji Kombo, Santhi Kumaran, Yahya H. Sheikh, Alastair Bovim, Kayalvizhi Jayavel, Long-Term Groundwater Level Prediction Model Based on Hybrid KNN-RF Technique, Hydrology, 10.3390/hydrology7030059, 7, 3, (59), (2020).
- Lei Chen, Shurui Fan, Vikram Kumar, Yating Jia, A Method of Human Activity Recognition in Transitional Period, Information, 10.3390/info11090416, 11, 9, (416), (2020).
- Kai Heckel, Marcel Urban, Patrick Schratz, Miguel D. Mahecha, Christiane Schmullius, Predicting Forest Cover in Distinct Ecosystems: The Potential of Multi-Source Sentinel-1 and -2 Data Fusion, Remote Sensing, 10.3390/rs12020302, 12, 2, (302), (2020).
- Jose M Gonzalez-Cava, Rafael Arnay, Juan Albino Mendez-Perez, Ana León, María Martín, Jose A Reboso, Esteban Jove-Perez, Jose Luis Calvo-Rolle, Machine learning techniques for computer-based decision systems in the operating theatre: application to analgesia delivery, Logic Journal of the IGPL, 10.1093/jigpal/jzaa049, (2020).
- Rocco Sedona, Lars Hoffmann, Reinhold Spang, Gabriele Cavallaro, Sabine Griessbach, Michael Höpfner, Matthias Book, Morris Riedel, Exploration of machine learning methods for the classification of infrared limb spectra of polar stratospheric clouds, Atmospheric Measurement Techniques, 10.5194/amt-13-3661-2020, 13, 7, (3661-3682), (2020).
- José Padarian, Budiman Minasny, Alex B. McBratney, Machine learning and soil sciences: a review aided by machine learning tools, SOIL, 10.5194/soil-6-35-2020, 6, 1, (35-52), (2020).
- Ahlam Mallak, Madjid Fathi, A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2040061, 2, 4, (61), (2020).
- Nadja Klein, David J. Nott, Michael Stanley Smith, Marginally-calibrated deep distributional regression, Journal of Computational and Graphical Statistics, 10.1080/10618600.2020.1807996, (1-41), (2020).
- Emrehan Kutlug Sahin, Comparative Analysis of Gradient Boosting Algorithms for Landslide Susceptibility Mapping, Geocarto International, 10.1080/10106049.2020.1831623, (1-22), (2020).
- Konstantinos Vantas, Epaminondas Sidiropoulos, Athanasios Loukas, Estimating Current and Future Rainfall Erosivity in Greece Using Regional Climate Models and Spatial Quantile Regression Forests, Water, 10.3390/w12030687, 12, 3, (687), (2020).
- Ahlam Mallak, Madjid Fathi, A Hybrid Approach: Dynamic Diagnostic Rules for Sensor Systems in Industry 4.0 Generated by Online Hyperparameter Tuned Random Forest, Sci, 10.3390/sci2040075, 2, 4, (75), (2020).
- Apostolos Athanasiou, Arvin Ebrahimkhanlou, Jarrod Zaborac, Trevor Hrynyk, Salvatore Salamone, A machine learning approach based on multifractal features for crack assessment of reinforced concrete shells, Computer-Aided Civil and Infrastructure Engineering, 10.1111/mice.12509, 35, 6, (565-578), (2019).
- Jairo Rojas-Delgado, Vladimir Milián Núñez, Rafael Trujillo-Rasúa, Rafael Bello, Continuous Hyper-parameter Configuration for Particle Swarm Optimization via Auto-tuning, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 10.1007/978-3-030-33904-3_43, (458-468), (2019).
- M.A. DeRousseau, E. Laftchiev, J.R. Kasprzyk, B. Rajagopalan, W.V. Srubar, A comparison of machine learning methods for predicting the compressive strength of field-placed concrete, Construction and Building Materials, 10.1016/j.conbuildmat.2019.08.042, 228, (116661), (2019).
- Farshad Fathian, Saeid Mehdizadeh, Ali Kozekalani Sales, Mir Jafar Sadegh Safari, Hybrid models to improve the monthly river flow prediction: Integrating artificial intelligence and non-linear time series models, Journal of Hydrology, 10.1016/j.jhydrol.2019.06.025, (2019).
- Wei Feng, Gabriel Dauphin, Wenjiang Huang, Yinghui Quan, Wenzhi Liao, New margin-based subsampling iterative technique in modified random forests for classification, Knowledge-Based Systems, 10.1016/j.knosys.2019.07.016, (2019).
- Jessica Schuster, Michael Superdock, Anthony Agudelo, Paul Stey, James Padbury, Indra Neil Sarkar, Alper Uzun, Machine learning approach to literature mining for the genetics of complex diseases, Database, 10.1093/database/baz124, 2019, (2019).
- Md. Rifaet Ullah, Md. Al Mehedi Hasan, Julia Rahman, Md. Khaled Ben Islam, undefined, 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 10.1109/IC4ME247184.2019.9036587, (1-6), (2019).
- Alexander Erban, Ines Fehrle, Federico Martinez-Seidel, Federico Brigante, Agustín Lucini Más, Veronica Baroni, Daniel Wunderlin, Joachim Kopka, Discovery of food identity markers by metabolomics and machine learning technology, Scientific Reports, 10.1038/s41598-019-46113-y, 9, 1, (2019).
- Peter Rubbens, Marian L. Schmidt, Ruben Props, Bopaiah A. Biddanda, Nico Boon, Willem Waegeman, Vincent J. Denef, Randomized Lasso Links Microbial Taxa with Aquatic Functional Groups Inferred from Flow Cytometry, mSystems, 10.1128/mSystems.00093-19, 4, 5, (2019).
- David F. Muñoz, Jordan R. Cissell, Hamed Moftakhari, Adjusting Emergent Herbaceous Wetland Elevation with Object-Based Image Analysis, Random Forest and the 2016 NLCD, Remote Sensing, 10.3390/rs11202346, 11, 20, (2346), (2019).
- S. Assou, Y. Tourbier, E. Gstalter, M. Charrier, O. Dessombz, L. Jézéquel, A reduced model using random forest: application on car crash optimization, SeMA Journal, 10.1007/s40324-019-00208-8, (2019).
- Raúl R. Poppiel, Marilusa P. C. Lacerda, José L. Safanelli, Rodnei Rizzo, Manuel P. Oliveira, Jean J. Novais, José A. M. Demattê, Mapping at 30 m Resolution of Soil Attributes at Multiple Depths in Midwest Brazil, Remote Sensing, 10.3390/rs11242905, 11, 24, (2905), (2019).
- Baher Azzam, Freia Harzendorf, Ralf Schelenz, Walter Holweger, Georg Jacobs, Pattern Discovery in White Etching Crack Experimental Data Using Machine Learning Techniques, Applied Sciences, 10.3390/app9245502, 9, 24, (5502), (2019).





