Breast Cancer Resistance Protein (BCRP, gene ABCG2) is an efflux transporter from the ABC transporters family. It is known to be responsible for the multidrug resistance phenomenon observed in some cancers and is also involved in drug-drug interactions in the liver. Prediction and assessment of inhibition of BCRP is of great interest in the drug development process. This paper presents the largest open dataset currently available for BCRP inhibition, along with the methodology used to compile it. It contains 978 unique compounds with corresponding bioactivities, extracted from 47 studies. The presence of duplicates allowed us to set up thresholds on reported activities to obtain a labelled dataset suitable for learning classification models. Exploratory data analysis and predictive modelling lead to the identification of substructures important for inhibition. We find that the substructures that characterize inhibitors are in line with known SAR relationships of BCRP inhibitors, while the substructures characterizing the non-inhibitors are novel.