The evolution of the polycystic ovary syndrome (PCOS) has followed a typical career in medical folklore. Initially there were the case reports, then a brief case series and after an appropriate gestation came the recommendations, guidelines and consensus statements. However, as the spectrum and the implications of the syndrome have become better understood, so the poorly defined margins of the syndrome have led to a degree of unravelling of the consensus, not to mention confusion amongst gynaecologists and endocrinologists.1
The original case series of Stein and Leventhal described seven women with two apparent phenotypes, all of whom had enlarged polycystic ovaries and amenorrhoea.2 Over the years, this has been distilled down to a definition that encompasses chronic anovulation and hyperandrogenism after all other causes of irregular menses had been excluded. This definition reached a consensus supported by the National Institutes of Health (NIH) in 1990 even though in Europe and Australasia ovarian imaging had become an important component in the diagnosis, largely because ovarian morphology was part of the original disease description. Despite a high degree of concordance between the NIH consensus and the European addition of ovarian imaging, it became necessary to try to gain transatlantic harmony with a new consensus held under the auspices of the European Society for Human Reproduction and Embryology and the American Society for Reproductive Medicine (ESHRE/ASRM) which resulted in what are known as the Rotterdam criteria.3
The proposal agreed in Rotterdam was that PCOS should be diagnosed on the basis of the presence of two out of the following three features: oligo- or anovulation, clinical and/or biochemical hyperandrogenism and polycystic ovaries. This has therefore created the possibility of several phenotypes and the recognition that PCOS is a syndrome with heterogeneity in its constellation of features. This has led to some disagreement4,5 and a series of studies considering the phenotypes of PCOS that have been added to the syndrome which have had the effect of increasing the patient base by a factor of approximately 20%6 compared with the NIH consensus and is more in line with European prevalence data.7 Indeed, there has been a subsequent position statement from the Androgen Excess Society trying to bond the NIH and Rotterdam positions.8 It is generally recognized that the Rotterdam consensus criteria now include women with PCOS who have a milder disease and are also less likely to be overweight6 (see Table 1).
|NIH (1990)||To include all of the following:|
|1: Hyperandrogenism and/or hyperandrogenaemia|
|3: Exclusion of related disorders|
|ESHRE/ARMS (Rotterdam 2003)||To include two of the following, including the exclusion of related disorders:|
|1: Oligo- or anovulation|
|2: Clinical and/or biochemical signs of hyperandrogenism|
|3: Polycystic ovaries|
|Androgen Excess Society (2006)||To include all of the following:|
|1: Hirsutism and/or hyperandrogenaemia|
|2: Oligo-anovulation and/or polycystic ovaries|
|3: Exclusion of androgen excess or related disorders|
We feel that a major aspect of PCOS that has not been given adequate emphasis is in the metrics of the diagnostic characteristics. Although these are well known, are they sufficiently defined so as to be the anchors for diagnosis? Are they repeatable, robust and sufficiently reliable to allow data from one study to be compared with another? We believe they are not.
Acne, hirsutism and androgenic alopecia are due to androgenic stimulation of the pilosebaceous unit and all occur in women with PCOS. The exact prevalence of these conditions in the wider population is not precisely known. Acne occurs in almost all teenagers and a degree of physiological acne occurs in 54% of women over 25 years of age with 3% showing clinical acne.9 A study of pregnant women reported that 26% complained of acne prior to pregnancy10 and a study of PCOS reported that 58% of their control women had acne.11 These proportions of unselected populations with acne make the link with PCOS unconvincing until sufficiently powered studies or studies including appropriate acne grading are performed.
The position for hirsutism is complicated by the fact that the standard scoring system designed by Ferriman and Gallwey in London defined abnormal hair growth in mathematical terms, that is, more that two standard deviations above the mean or approximately 1·2% of the population.12 DeUgarte et al. confirmed this finding but clarified that many women with lesser hair scores were using hair removal therapies indicating a mismatch between patient and physician scores.13 The UK Institute of Electrology report that 80% of women have some sort of unwanted hair.14
The diagnosis of hirsutism can be very subjective and should it be defined by the patient or her physician?15 Is a woman with facial hair but no body and limb hair more or less hirsute than a woman with a hairless face but hairy trunk and limbs? A systematic study of hair growth in Scandinavian women showed that there was an 18% overlap in hair growth scores between women complaining of hirsutism and a similar sized control group.16 And of course the same situation occurs with acne and women who are obese are more likely to find their hair and acne more troublesome.13,17
There are several scoring systems for hirsutism but the vast majority of investigators now use the system devised by Ferriman and Gallwey in which the body is divided into 11 zones, each of which is scored 0–4 on a subjective basis. Nine of the zones are used to compute a hirsutism score. There is no flexibility within this system to allow for nonstandard patterns of hair growth. This is a serious deficiency as women exhibit several different patterns of facial and body hair growth and are more likely to present with hirsutism if they have hair growth on the face, chest and upper back than on other sites.18 The score is likely to be inflated by hairy thighs which occur in 45% of premenopausal Scandinavian women16 and hair on this site does not appear to respond to antiandrogen therapy.19
A greater problem lies in the standardization of hair growth scores. We have previously described the difference in mean (and variance) of hair growth scores between studies.20 The severity of hirsutism should be fairly similar between studies as subjects recruited need to be sufficiently hirsute for adequate hair scores to be made before and after treatment. Yet the difference is great, and our reason for this prejudice is that this is largely due to variation in observer scores; this has been confirmed by Wild et al. who reported a within patient difference in Ferriman and Gallwey hair scores of 12 points between different members of his own research group, and for which the maximum score is only 36.21 Furthermore, only a tiny minority of studies report the investigator variability in this key clinical measurement in comparison with the extensive quality data for laboratory measurements.
As acne occurs in a significant number of women after the age of 25 years, how can it be considered a feature of PCOS unless it is defined as abnormal? Although the dermatological literature is replete with scoring systems which have been shown to be reproducible from photographic images,22 the PCOS literature merely indicates the presence or absence of acne.
The position for androgenetic alopaecia is less clear. Whilst it is undoubtedly an androgen-mediated process, there has been difficulty relating this type of hair loss to increased circulating androgens23,24 and a link to polycystic ovaries has been proposed.25 Indeed, it may be related more closely to iron deficiency than androgens.24 Despite this there are two well-recognized patterns of androgen-mediated hair loss26,27 and the incidence of balding in premenopausal women in the UK is 13%.28
None of the myriad of PCOS guidelines give any real indication of the definition of the term biochemical hyperandrogenaemia, although some of the issues were raised in the Rotterdam consensus paper.3 There is a lack of clarity on which androgen should be measured, how often should it/they be measured to exclude normal values, what is a normal androgen and which analytical technique should be employed. The latter aspect has been addressed by the Endocrine Society in a recent position paper but its strength is in highlighting the problems rather than solutions.29 However, even this paper fails to mention that one of the better performing direct testosterone assays is beset by a significant cross reactivity with dehydroepiandrosterone sulphate (DHEAS).30
First, which androgen should be measured? There is an implication that testosterone, both total and free, DHEAS and possibly androstenedione should be measured. There are significant issues raised over the analytical performance of free testosterone assays and therefore an alternative is the free androgen index [(total testosterone / sex hormone binding globulin (SHBG)) × 100] but this will be dependent on another measurement – SHBG, itself influenced by the degree of hyperinsulinaemia. There appears to be an underlying philosophy in the guidelines that the aim should be to identify a state of biochemical hyperandrogenism and this is the reason for including free testosterone because it is more frequently elevated. However, this is a circular argument as an elevated androgen is a diagnostic criterion. Nevertheless, it should be noted that most laboratories define their reference limits as 95% confidence intervals and therefore 2·5% of the normal population will be above these limits. If the four androgens mentioned above are measured there will be a 10% chance that one will be abnormal and this will be increased if calculated variables are included. It should be obvious that if subjects are investigated on multiple occasions, then the chance of an abnormal result will increase by multiples on each occasion.
Second, what is a normal value? The Endocrine Society position statement suggests that each laboratory should collaborate with its endocrinologists to determine its own reference limits for testosterone. This is good practice but reference limits require a population of 120 subjects and when one considers the range of preanalytical factors that influence serum testosterone (see Table 2), it is clear that this is not practicable for any other than a research unit. The sum of the factors relating to day-to-day variation of total testosterone measurement have been considered by the assessment of its biological variation in both women with and without PCOS. Samples taken in the morning throughout the cycle showed an individual index of variation of 0·43% and 0·69%, respectively.31
|Pulsatile release during the day44|
|Diurnal rhythm: am > pm42,45,46|
|Menstrual cycle: luteal > follicular31,47|
|Season (no variation in total testosterone free testosterone shows 30% difference): summer > winter43,48|
|Age (years) in women with and without polycystic ovary syndrome (PCOS): 20s > 40s49,50|
|Cross reactivity with other endogenous steroids30|
|Interference by endogenous antibodies51|
|Poor performance in the female range: < 8 nmol/l1,29|
Moreover, the data in the table has been based on immunoassays, some of which will have been direct, and will potentially need to be repeated with mass spectrometry techniques as these become routinely available.32 Finally, it has been proposed that even subjects with normal circulating androgens on repeated testing can exhibit an occult form of hyperandrogenaemia which is uncovered by hCG stimulation testing.33
The Rotterdam consensus has discarded the use of LH and FSH ratios as diagnostic criteria which is a good decision. Fauser et al.34 showed that only 50% of patients with polycystic ovaries (PCO) had an elevated LH and 43% of patients with high LH had PCO. Given the population of 99 subjects of whom 35 had ultrasound proven PCO, a single LH measurement has a positive predictive power (PPV) of only 18%; if we reduce the prevalence to 7% as in the populations described by Escobar-Morreale et al.35 and Knockenhauer et al.,36 the PPV falls further to 3·6%.
There is an inherent problem in developing a robust definition of hyperandrogenaemia as it is one of the defining characteristics of PCO. It is therefore only possible to define the term in a negative way. Moreover, because the incidence of PCOS, as opposed to the presence of polycystic ovaries, has only recently been established (see above), the majority of previous studies of the diagnostic performance of circulating androgens are fatally flawed, as they did not include sufficient controls to balance the appropriate incidence of PCOS.
Assessment of anovulation
Anovulation is assessed by measuring the serum progesterone during the mid-luteal phase. The peak value for progesterone only remains for a short time. Indeed, the commonest reason for a low value is that the sample was not taken at the appropriate time. The most widely used value in the UK to indicate ovulation is 30 nmol/l on Day 21. It is of note that there is significant bias between methods currently available and this is not reflected by the interpretation given on laboratory report forms.37 This bias currently ranges +12% to –14% between the lowest and the highest different method means and this clearly introduces further diagnostic confusion regarding the ovulatory status of an individual.
The definition of polycystic ovaries in the Rotterdam consensus was based on an assessment of the published data correlating the ‘best fit’ for ultrasound morphology with endocrine and clinical features of the syndrome.38 Since the publication, there have been papers questioning the defining characteristics. Should the stroma be included?6 Is the ovarian volume correct?39 Is it sufficient to include only 12 follicles?40 However, the most important recent finding which agrees with the argument above for clinical examination and biochemical measurements is that ultrasound imaging are markedly dependent on the operator. Amer and colleagues41 recorded 27 scans and showed them randomly in duplicate to four experienced imagers. The four imagers agreed with themselves in 63–74% of the cases whereas the observers agreed with each other in only 51% of the cases, which showed a high probability of inconsistency between operators.
Proposal for new criteria
We suggest that formal definitions be developed and propose the following criteria:
Clinical hyperandrogenism. All studies use and cite well-validated scoring systems for acne, balding and hirsutism and include the within and between investigator variability.
Laboratory hyperandrogenism. In order to reduce variability, we would suggest that a single blood sample for testosterone is taken in the morning on Days 1–5 of a menstrual cycle. Specific criteria will need to be addressed for women with amenorrhoea. Measurement of multiple analytes will serve only to add variation and should not be used in the definition. The testosterone method and precision should be cited; and age-related reference ranges for that method should be developed from a cohort of at least 120 subjects who are defined as not having PCOS using clinical and imaging technique alone so that the diagnosis is not affected by testosterone method bias.
Diagnosis of anovulation. The progesterone method and precision used to confirm anovulation should be cited and the analytical bias relative to an international reference preparation should be stated, for example, Institute for Reference Materials and Methods (http://www.irmm.jrc.be).
Ultrasound. Specific criteria for ultrasound imaging have already been addressed.38
Insulin sensitivity. There is considerable interest in insulin sensitivity in PCOS and consideration should be given to including this variable in the diagnostic strategy. However, at present there may be too many preanalytical42 and analytical issues43 to allow this diagnostic test to be included.