We present a sample of 2865 emission-line galaxies with strong nebular emissions in the Sloan Digital Sky Survey Data Release 7 and use this sample to investigate the origin of this line in star-forming galaxies. We show that star-forming galaxies and galaxies dominated by an active galactic nucleus form clearly separated branches in the versus diagnostic diagram and derive an empirical classification scheme which separates the two classes. We also present an analysis of the physical properties of 189 star-forming galaxies with strong emissions. These star-forming galaxies provide constraints on the hard ionizing continuum of massive stars. To make a quantitative comparison with observation, we use photoionization models and examine how different stellar population models affect the predicted emission. We confirm previous findings that the models can predict emission only for instantaneous bursts of 20 per cent solar metallicity or higher, and only for ages of ∼4–5 Myr, the period when the extreme-ultraviolet continuum is dominated by emission from Wolf–Rayet stars. We find, however, that 83 of the star-forming galaxies (40 per cent) in our sample do not have Wolf–Rayet features in their spectra despite showing strong nebular emission. We discuss possible reasons for this and possible mechanisms for the emission in these galaxies.