Researchers and organizations often use evidence from randomized controlled trials (RCTs) to determine the efficacy of a treatment or intervention under ideal conditions. Studies of observational designs are often used to measure the effectiveness of an intervention in 'real world' scenarios. Numerous study designs and modifications of existing designs, including both randomized and observational, are used for comparative effectiveness research in an attempt to give an unbiased estimate of whether one treatment is more effective or safer than another for a particular population.
A systematic analysis of study design features, risk of bias, parameter interpretation, and effect size for all types of randomized and non-experimental observational studies is needed to identify specific differences in design types and potential biases. This review summarizes the results of methodological reviews that compare the outcomes of observational studies with randomized trials addressing the same question, as well as methodological reviews that compare the outcomes of different types of observational studies.
To assess the impact of study design (including RCTs versus observational study designs) on the effect measures estimated.
To explore methodological variables that might explain any differences identified.
To identify gaps in the existing research comparing study designs.
We searched seven electronic databases, from January 1990 to December 2013.
Along with MeSH terms and relevant keywords, we used the sensitivity-specificity balanced version of a validated strategy to identify reviews in PubMed, augmented with one term ("review" in article titles) so that it better targeted narrative reviews. No language restrictions were applied.
We examined systematic reviews that were designed as methodological reviews to compare quantitative effect size estimates measuring efficacy or effectiveness of interventions tested in trials with those tested in observational studies. Comparisons included RCTs versus observational studies (including retrospective cohorts, prospective cohorts, case-control designs, and cross-sectional designs). Reviews were not eligible if they compared randomized trials with other studies that had used some form of concurrent allocation.
Data collection and analysis
In general, outcome measures included relative risks or rate ratios (RR), odds ratios (OR), hazard ratios (HR). Using results from observational studies as the reference group, we examined the published estimates to see whether there was a relative larger or smaller effect in the ratio of odds ratios (ROR).
Within each identified review, if an estimate comparing results from observational studies with RCTs was not provided, we pooled the estimates for observational studies and RCTs. Then, we estimated the ratio of ratios (risk ratio or odds ratio) for each identified review using observational studies as the reference category. Across all reviews, we synthesized these ratios to get a pooled ROR comparing results from RCTs with results from observational studies.
Our initial search yielded 4406 unique references. Fifteen reviews met our inclusion criteria; 14 of which were included in the quantitative analysis.
The included reviews analyzed data from 1583 meta-analyses that covered 228 different medical conditions. The mean number of included studies per paper was 178 (range 19 to 530).
Eleven (73%) reviews had low risk of bias for explicit criteria for study selection, nine (60%) were low risk of bias for investigators' agreement for study selection, five (33%) included a complete sample of studies, seven (47%) assessed the risk of bias of their included studies,
Seven (47%) reviews controlled for methodological differences between studies,
Eight (53%) reviews controlled for heterogeneity among studies, nine (60%) analyzed similar outcome measures, and four (27%) were judged to be at low risk of reporting bias.
Our primary quantitative analysis, including 14 reviews, showed that the pooled ROR comparing effects from RCTs with effects from observational studies was 1.08 (95% confidence interval (CI) 0.96 to 1.22). Of 14 reviews included in this analysis, 11 (79%) found no significant difference between observational studies and RCTs. One review suggested observational studies had larger effects of interest, and two reviews suggested observational studies had smaller effects of interest.
Similar to the effect across all included reviews, effects from reviews comparing RCTs with cohort studies had a pooled ROR of 1.04 (95% CI 0.89 to 1.21), with substantial heterogeneity (I2 = 68%). Three reviews compared effects of RCTs and case-control designs (pooled ROR: 1.11 (95% CI 0.91 to 1.35)).
No significant difference in point estimates across heterogeneity, pharmacological intervention, or propensity score adjustment subgroups were noted. No reviews had compared RCTs with observational studies that used two of the most common causal inference methods, instrumental variables and marginal structural models.
Our results across all reviews (pooled ROR 1.08) are very similar to results reported by similarly conducted reviews. As such, we have reached similar conclusions; on average, there is little evidence for significant effect estimate differences between observational studies and RCTs, regardless of specific observational study design, heterogeneity, or inclusion of studies of pharmacological interventions. Factors other than study design per se need to be considered when exploring reasons for a lack of agreement between results of RCTs and observational studies. Our results underscore that it is important for review authors to consider not only study design, but the level of heterogeneity in meta-analyses of RCTs or observational studies. A better understanding of how these factors influence study effects might yield estimates reflective of true effectiveness.
研究者和組織通常利用來自隨機對照試驗 (randomized controlled trial, RCT) 的證據，判斷某種治療在理想狀態下的療效。採觀察設計的試驗，通常用於測量某種介入在「實際狀況」中的療效。療效比較研究則採取許多試驗設計或改良既有的設計，包括隨機試驗和觀察性試驗，試圖精準估計某種治療對於某特定族群，是否較另一種治療更有效或更安全。
必須針對所有類型的隨機和非實驗性觀察試驗的試驗設計特色、偏差風險、參數解讀和效果量 (effect size)，進行系統性的分析，找出設計類型和潛在偏差的具體差異。本次文獻回顧摘述方法學文獻回顧的結果，包括比較探究相同問題的觀察性試驗和隨機試驗的結果，以及比較各類型觀察性試驗的結果。
評估試驗設計 (包括RCT和觀察性試驗設計) 對估計效果量的影響。
除了MeSH名詞和相關關鍵詞以外，我們採用兼顧敏感度－特異性的有效策略，找出PubMed中的文獻回顧，使用1個名詞 (論文標題中的「文獻回顧」) 加強搜尋，以便更適當鎖定敘述性的文獻回顧。本次文獻回顧並未設定語言限制。
我們檢視專為進行方法學文獻回顧而設計的系統性文獻回顧，相對於觀察性試驗，比較介入療效測量試驗的定量效果量估計值。所進行的比較包含RCT和觀察性試驗 (包括回溯性世代試驗、前瞻性世代試驗、個案對照設計和橫斷式 [cross-sectional] 設計)。至於比較隨機試驗和其他採用某種同時分配 (concurrent allocation) 試驗的文獻回顧，則不符合本次文獻回顧的納入條件。
一般而言，結果測量包括相對風險或率比 (RR)、勝算比 (OR) 和危險比 (HR)。我們以得自觀察性試驗的結果作為參照組，檢驗已發表的估計值，瞭解其對勝算比的比值 (ROR) 是否具有較大或較小的影響。
我們匯集每篇納入文獻回顧的觀察性試驗和RCT估計值 (若未提供觀察性試驗和RCT的估計比較結果)，然後再以觀察性試驗為參照類別，估計各篇納入文獻回顧的率比 (風險比或勝算比)。我們整合所有納入文獻回顧的這些比值，取得RCT和觀察性試驗的ROR比較結果。
我們在第一次搜尋時就找到4406項唯一參考文獻 (unique reference)。有55篇文獻回顧符合納入條件；其中有14篇文獻回顧納入定量分析。
我們所納入的文獻回顧，分析來自1583項後設分析 (meta-analysis) 的資料，涵蓋228種不同的醫學疾病。每篇論文平均納入178項試驗 (範圍介於19項至530項之間)。
有11篇 (73%) 文獻回顧的條列式試驗篩選準則 (explicit criteria) 偏差風險偏低；有9篇 (60%) 文獻回顧的試驗主持人試驗篩選一致性 (investigators' agreement) 偏差風險偏低；有5篇 (33%) 文獻回顧包含完整的試驗樣本；有7篇 (47%) 文獻回顧評估納入試驗的偏差風險；
有7篇 (47%) 文獻回顧控制各試驗間的方法學差異；
有8篇 (53%) 文獻回顧控制各試驗間的異質性；有9篇 (60%) 文獻回顧分析相似的結果測量值；有4篇文獻回顧 (27%) 經判定通報性偏差風險偏低。
我們的主要定量分析包含14篇文獻回顧，結果顯示RCT和觀察性試驗的綜合ROR比較效果為1.08 (95% [CI] 為0.96至1.22)。在本次分析所納入的14篇文獻回顧中，有11篇 (79%) 文獻回顧發現觀察性試驗和RCT並無顯著差異。有1篇文獻回顧指出觀察性試驗的關注效應較大 (effects of interest)，有2篇文獻回顧指出觀察性試驗的關注效應相似。
與所有納入文獻回顧所得效果相似，比較RCT和世代試驗的文獻回顧顯示，綜合ROR為1.04 (95% CI為0.89至1.21)，具有相當高的異質性 (I2= 68%)。有3篇文獻回顧比較RCT和個案對照設計的效果 (綜合ROR為1.11 [95% CI為0.91至1.35])。
異質性、藥物介入或傾向分數 (propensity score) 調整子群體的點估計 (point estimate)，並無顯著差異。並無文獻回顧針對採用2種最常見因果推論法 (causal inference method) (工具變數 [instrumental variable] 和邊緣性結構模型 [marginal structural model]) 的試驗，比較RCT和觀察性試驗。
我們分析所有文獻回顧 (綜合ROR為1.08)，結果發現執行方式類似的文獻回顧所報告的結果非常相似。因此我們也得到類似的結論；一般而言，少有證據顯示觀察性試驗和RCT之間的效果估計值，具有顯著差異，無論特定的觀察性試驗設計、異質性或藥物介入試驗的納入條件。在探討導致RCT和觀察性試驗的結果缺乏一致性的原因時， 仍須考慮試驗設計本身以外的因素。我們所得到的結果強調，文獻回顧作者除了考慮試驗設計，也必須考慮RCT或觀察性試驗後設分析的異質性極為重要。更深入瞭解這些因素對試驗效果的影響，或許能產生足以反映真實效果的估計量。