Journal of the Royal Statistical Society: Series A (Statistics in Society)

Cover image for Vol. 180 Issue 3

Edited By: J. Carpenter and H. Goldstein

Impact Factor: 1.852

ISI Journal Citation Reports © Ranking: 2016: 11/49 (Social Sciences Mathematical Methods); 21/124 (Statistics & Probability)

Online ISSN: 1467-985X

Associated Title(s): Journal of the Royal Statistical Society: Series B (Statistical Methodology), Journal of the Royal Statistical Society: Series C (Applied Statistics), Significance

176:2


Gender wage differentials in Mexico: a distributional approach, by G. K. Popli, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 295–319

DATA
The specific data set used for this paper is:
ENIGH: Encuesta Nacional de Ingresos y Gastos de los Hogares (National Survey on Household Income and Expenditure) for the years 1996 and 2006.

The original dataset can be obtained from:
INEGI: Instituto Nacional de Estadistica y Geografia (National Institute of Statistics and Geography).
http://www.inegi.org.mx/ Note: the original data is in Spanish

What is provided here is the clean version of the data.
Data files for 1996 (1996data.txt) and 2006 (2006data.txt) are provided.
A programme is also provided (progFV.txt).
This programme along with the data provided will reproduce the empirical results shown in the paper, and any robustness analysis discussed in the paper.
The data was analyzed using STATA v. 11.
A file called varlist.txt is also provided which gives details of the variable names in the data files.

Gurleen K. Popli
Department of Economics
University of Sheffield
9 Mappin Street
S1 4DT
UK
E-mail: g.popli@shef.ac.uk

Dataset (3MB)

Epidemics in semi-isolated communities: statistical perspectives on acute childhood diseases in English public boarding schools, 1930–1939, by M. Smallman-Raynor and A. D. Cliff, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 321 -- 346

This paper uses information contained in two databases.

Database 1: Unpublished Secretary’s Reports

The database consists of the aggregate termly count for 24 disease categories in 27 boys’ and girls’ Schools, Lent Term 1930–Lent Term 1939. Information abstracted from School Epidemics Committee, Secretary's Report, National Archives (Files FD 1/5063 -1/5069).

Data column codes
Column 1: Year (1930, . . ., 1939)
Column 2: Term (Christmas, Lent, Summer)
Columns 3–26: Case count (24 disease categories)

Database 2: Published MRC Special Reports

For six common acute childhood infections, the database consists of collated cases and case rates per 100 population exposed to risk by school and term. Information is presented for 599 observation terms over the set of 27 schools. Observation terms with attack rates 1.0 per 100 population exposed to risk for a given disease have been attributed zero values (counts and rates). Information abstracted from School Epidemics Committee (1938, Tables LXXIII, LXXVII, LXXIX–LXXXI, LXXXIII, LXXXV pp. 145–67 passim) and Cheeseman (1950, Tables XXIV–XXVI, XXIX– XXXII, pp. 38–51 passim).

Data column codes
Column 1: Year (1930, . . ., 1939)
Column 2: Term (Christmas, Lent, Summer)
Column 3: School
Column 4: Gender (Boys school, Girls school)
Column 5: Chickenpox case count
Column 6: Measles case count
Column 7: Mumps case count
Column 8: Rubella case count
Column 9: Scarlet fever case count
Column 10: Whooping cough case count
Column 11: Chickenpox attack rate per 100 population exposed to risk
Column 12: Measles attack rate per 100 population exposed to risk
Column 13: Mumps attack rate per 100 population exposed to risk
Column 14: Rubella attack rate per 100 population exposed to risk
Column 15: Scarlet fever attack rate per 100 population exposed to risk
Column 16: Whooping cough attack rate per 100 population exposed to risk

References
Cheeseman, E.A. (1950) Epidemics in Schools: An Analysis of the Data Collected During the Years 1935 to 1939. Medical Research Council Special Report Series No. 271. London: HMSO. School Epidemics Committee (1938) Epidemics in Schools: An Analysis of the Data Collected during the First Five Years of a Statistical Inquiry. Medical Research Council Special Report Series No. 227. London: HMSO.

Matthew Smallman-Raynor
School of Geography
University of Nottingham
University Park
Nottingham
NG7 2RD
UK

E-mail: matthew.smallman-raynor@nottingham.ac.uk

Dataset

The group size and loyalty of football fans: a two-stage estimation procedure to compare customer potential across teams, by L. Brandes, E. Franck and P. Theiler, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 347 -- 369

There are four files availabe for this paper:

(i) Loyalty_Bundesliga-1996-2004.txt
(ii) Code_Tables2-4_Estimations.txt
(iii) mytobit_lf.txt
(iv) Code_Table5.txt

File Loyalty_Bundesliga-1996-2004.txt contains the data that were used for all estimations and the creation of the tables
File Code_Tables2-4_Estimations.txt contains Stata code to generate tables and estimations
File mytobit_lf.txt is a Stata ado file to estimate a Tobit model using the likelihood funtion code in Stata. The content of this file needs to be stored as mytobit_lf.ado in order to execute Code_Tables2-4_Estimations.txt

File Code_Table5.txt contains Stata code to generate table 5 and calculates correlation coefficients

The data contains the variables in the following order:

  seasonseason number (from 2 to 9, own coding, relative to 1995/096 season)
  fixturefixture within season (3-34)
  heimteamname of home team
  gastteamname of away team
  homehome team (coded)
  awayaway team (coded)
  attendmatch attendance
  soldoutindicator variable: match soldout = 1
  hstandhome: ranking
  astandaway: ranking
  hbudghome: budget
  abudgaway: budget
  midweekindicator variable: match on Mon-Thurs = 1
  hpromhome: promoted (indicator variable: promoted = 1)
  apromaway: promoted (indicator variable: promoted = 1)
  htop2home: top2 (indicator variable: top2 = 1)
  atop2away: top2 (indicator variable: top2 = 1)
  dbtimetravel time with German Railway
  unempunemployment rate
  hrep20home: reputation
  arep20away: reputation
  hseries3home: three wins in a row (indicator variable: 3 consecutive wins = 1)
  aseries4away: four wins in a row (indicator variable: 4 consecutive wins = 1)
  hchampnewhome: championship contention (indicator variable: contention = 1)
  achampnewaway: championship contention (indicator variable: contention = 1)
  hrelnewhome: relegation contention (indicator variable: contention = 1)
  arelnewaway: relegation contention (indicator variable: contention = 1)
  stadwrkstadium under construction? (indicator variable: Yes = 1)
  newstadnew stadium? (indicator variable: Yes = 1)
  seas2indicator variable: season == 2 => 1
  seas3indicator variable: season == 3 => 1
  seas4indicator variable: season == 4 => 1
  seas5indicator variable: season == 5 => 1
  seas6indicator variable: season == 6 => 1
  seas7indicator variable: season == 7 => 1
  seas8indicator variable: season == 8 => 1
  seas9indicator variable: season == 9 => 1
  rainrain on match day before kick-off? (indicator variable: Yes == 1)
  snowsnow on match day before kick-off? (indicator variable: Yes == 1)
  hdortindicator variable: home team is Dortmund = 1
  hmunindicator variable: home team is Bayern Munich = 1
  hschalindicator variable: home team is Schalke = 1
  hmglbindicator variable: home team is Moenchengladbach = 1
  hhamindicator variable: home team is Hamburg = 1
  hrostindicator variable: home team is Rostock = 1
  h1860indicator variable: home team is 1860 Munich = 1
  hbremindicator variable: home team is Bremen = 1
  hstuttindicator variable: home team is Stuttgart = 1
  hfreibindicator variable: home team is Freiburg = 1
  hkolnindicator variable: home team is Koeln = 1
  hdussindicator variable: home team is Dusseldorf = 1
  hleverindicator variable: home team is Leverkusen = 1
  hstpaulindicator variable: home team is St. Pauli = 1
  hkaisindicator variable: home team is Kaiserslautern = 1
  hfrankindicator variable: home team is Frankfurt = 1
  huerdindicator variable: home team is Uerdingen = 1
  hbochindicator variable: home team is Bochum = 1
  hduisindicator variable: home team is Duisburg = 1
  hbieleindicator variable: home team is Bielefeld = 1
  hberlinindicator variable: home team is Berlin = 1
  hwolfsindicator variable: home team is Wolfsburg = 1
  hnurnindicator variable: home team is Nuremberg = 1
  hunterhindicator variable: home team is Unterhaching = 1
  hulmindicator variable: home team is Ulm = 1
  hcottbindicator variable: home team is Cottbus = 1
  hhannindicator variable: home team is Hannover = 1
  weekendindicator variable: match on Fri - Sun
  uoomatch uncertainty of outcome
  logpricelogprice
  dbtimesqrtravel time with German Railway, squared
  temperaturetemperature in degrees Celsius
  hstand_lshome: ranking at end of previous season
  astand_lsaway: ranking at end of previous season
  soldout_instrumenthome: number of soldout games in previous season
  seasonticketshome: number of season tickets sold
  available_capacityhome: stadium capacity minus season tickets sold
  capacityhome: stadium capacity
  away_contingenthome: stadium contingent for away fans
  cap2home: adjusted available capacity
  soldout2adjusted sell outs
  attendnst2attendance, no season tickets, adjusted for away fan contingent
  hmarkethome: market size (male population)
  potentialTVmatch was potentially broadcasted in TV
  season_ticket_shareshare of season tickets sold

           
            
Leif Brandes
Department of Business Administration
University of Zurich
Plattenstrasse 14
8032 Zurich
Switzerland

E-mail: Leif.Brandes@business.uzh.ch

Dataset

Assessing the accuracy of non-random business conditions survey: a novel approach, by D. de Munnik, M. Illing and D. Dupuis, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 371 -- 388

Three program files are provided.

All code has been written in FAME input files.

First, the file “deMunnikIllingDupuis_PopulationCreationFile_21Feb2012.inp” provides an example of how to construct the FAME database for the pseudo population. Most of the first specific information has been suppressed because permission was not granted to release it.

Second, the file “deMunnikIllingDupuis_FullModel_Feb2012.inp” provides an example of how to construct the Monte Carlo simulation and save the results. These results also produce the coverage results by recording when a firm is selected by the model.

Third, the file “deMunnikIllingDupuis_SRSModel_Feb2012.inp” provides an example of how to run the stratified random sample for comparison.

Daniel de Munnik
Canadian Economic Analysis Department
Atlantic Regional Office
Bank of Canada
1701-13 Hollis Street
Halifax Nova
Scotia
B3J 1V4
Canada

E-mail: dmunnik@bankofcanada.ca

Dataset

A spatial Poisson hurdle model for exploring geographic variation in emergency department visits, by B. Neelon, P. Ghosh and P. F. Loebe, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 389 -- 413

NOTE 1: YOU REALLY ONLY NEED TO RUN RUN MODEL.R, WHICH EXCECUTES R2WINBUGS. IF YOU WANT TO RUN INTERACTIVELY IN WINBUGS, YOU CAN USE MODEL.TXT, DATA.TEXT AND INITS.txt WITHIN WINBUGS

NOTE 2: YOU WILL HAVE TO CHANGE FILE PATH NAMES

NOTE 3: THIS EXAMPLE DOESN'T INCLUDE B-SPLINES AS IN THE JRSS-A PAPER -- See the Splines package in R (specifically the bs function) for details on fitting b-splines

File List:

1) Make Data.r = Simulates Poisson Hurdle Model using Durham County block group adjacency matrix, A.txt

2) A.txt = Durham County adjacency matrix

3) writeDatafileR.txt = Program for writing data for use by WinBUGs (Written by Terry Elrod)

4) data.txt = Data File

5) model.txt = WinBUGS model

6) inits.txt = WinBUGS initial values

7) Run Model.r = Calls WinBUGS via R2WinBUGS -- USE THIS TO RUN MODEL

8) Results.r = compiles results and produces posterior means

Brian Neelon
Department of Biostatistics
Box 2721
Duke University Medical Center
Durham
NC 27710-2721
USA

E-mail: brian.neelon@duke.edu

Dataset

The Master of the Royal Mint: how much money did Isaac Newton save Britain?, by A. Belenkiy, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 481 -- 498

The data for Table from the Jury Verdicts at the trials of the Pyx, 1696-1727 (Mint 7.130, ff. 50-80. The National Archives, Kew, Richmond, Surrey) are contained in the file 'Newton at the Mint_DATA'.

Column 1: Year of trial of the Pyx
Column 2: Master of the Mint
Column 3: Total amount of gold in the Pyx by weight, W (p.oz.dwt.gr)
Column 4: Total amount of gold in the Pyx by weight, converted into coins, 44.5·W (£:sh:d)
Column 5: Total amount of gold coins in the Pyx, T (guineas)
Column 6: Deficiency per pound of gold coins, D = 5760(T-44.5·W)/T (grains)
Column 7: Weight of a sample pound of gold coins, w (oz.dwt.gr)
Column 8: Deficiency in grains per sample pound d = 5760–w (grains)

Unknown values in the data are denoted by '?'.

Ari Belenkiy
8-8191 Francis Road
Richmond
British Columbia
V6Y 1A5
Canada

E-mail: ari.belenkiy@gmail.com

Dataset

Variance estimation of the Gini index: revisiting a result several times published, by M. Langel and Y. Tillé, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176 (2013), part 2, pages 521 -- 540

Data set:

PENN WORLD TABLE 5.6

Heston, A., Summers, R., and Aten, B. (1995). Penn World Table. Technical report, Center for International Comparisons of Production, Income and Prices at the University of Pennsylvania.

available online at: pwt.econ.upenn.edu

Data of year 1970 has been used on available data for 133 countries. The variable of interest is real consumption per capita (with prices in constant USD based on year 1985).

Relevant data has been regrouped in the file consumption_pwt1970.txt where the variable of interest is called "con".

Program:

The computer code is separated in 2 independent files. The code should be run using the R software (available for free at cran.r-project.org)

-> code_table1.txt contains the R-commands to produce results for Table 1 of the paper.
-> code_table2.txt contains the R-commands to produce results for Table 2 of the paper.

Note : Results of Table 2 may very slightly differ from those in the paper due to random sampling.

Matti Langel
University of Neuchâtel
Pierre à Mazel 7
2000 Neuchâtel
Switzerland

E-mail: matti.langel@unine.ch

Dataset

Which method predicts recidivism best?: a comparison of statistical, machine learning and data mining predictive models, by N. Tollenaar and P. G. M. van der Heijden, Journal of the Royal Statistical Society, Series A, Statistics in Society, Volume 176, part 2 (2013), pages 565 -- 584

Description of the data sets: General.dat, violence.dat, sexual.dat. These data are available on request via the first author.

rec4: reconviction in four years yes/no (1/0)
sekse: gender (0=male, 1=female)
cgebland: country of birth (1=Netherlands, 2= Morocco, 3=Neth. Antilles/Aruba, 4=Surinam, 5=Turkey, 6=Other Western countries, 7=Other non-Western countries)
vgalguz: number of previous convictions
vgalgc3: number of previous convictions partially categorised (12=11-20 previous convictions, 13=21 or more previous convictions)
leeftijd: age in years
lftinsz1inclvtt: age at first penal case
delcuziv: crime type of most serious offence in case: 1=violence, 2=sexual, 3=property with violence, 4=property without violence, 5=public order, 6=drugs, 7=motoring, 8=Miscellaneous
dichtheid: conviction density
vggev: number of previous prison terms
vgtaak: number of previous CSO's
vgboete: number of previous fines
vgtrans: number of previous PPD's (public prosecutor's disposals)
vggeweld: number of previous cases with violence offence
vgzeden: number of previous cases with sexual offence
vgvermg: number of previous cases with property with violence offence
vgvermgng: number of previous cases with property offence
vgvernoo: number of previous cases with public order offence
vgopium: number of previous cases with drug offence
vgverkeer: number of previous cases with motoring offence
vgoverig: number of previous cases with miscellaneous offence
zvermg: property with violence offence in index case
zvermgng: property offence in index case zopium: drug offence in index case
zvernoo: public order offence in index case
zverkeer: motoring offence in index case zoverig: miscellaneous offence in index case
zzeden: sexual offence in index case
zgeweld: violence offence in index case

Code: All code for reproducing the results is contained in Statrec_general__violence_sexual.R

Functions:
- platt: used for Platt calibration of outputs;
- acc2: calculate accuracy when sensitivity equals specificity;
- perfbat: used for computing the list of performance criteria;
- calbr: used for computing the calibration error around either the base rate or another cutoff value;
- svmlinfit: function to fit a series of support vector machines using class weights.

N. Tollenaar
Research and Documentation Centre
Ministry of Security and Justice
Schedeldoekshaven 131
2311 EM
Den Haag
Zuid-holland
he Netherlands

E-mail: n.tollenaar@minvenj.nl

Dataset

SEARCH

SEARCH BY CITATION