The International Corpus of English project: A progress report

ThisarticlebeginsbyintroducingtheInternationalCorpusofEnglish project and proceeds to summarize the findings and outcomes of an extensive reviewbywrittenquestionnaire conducted bythe present authors (Kirk & Nelson, 2017). Although critically concerned with practice hitherto, the review also discusses possible second generation components, and the issues in need of addressing before they should begin. The report contains many comments from a questionnaire that respondents complete, giving a flavour of the importancewithwhichthecorpusisvalued.Respondentsalsoraisea numberoffundamentalquestionsaboutthenatureofL2varietiesof English in multilingual contexts. An Appendix sets out a prospectus for a possible component of electronic texts. Other Appendices list the corpus ' s text categories and their quantities as well as the 27 national components and their directors.


INTRODUCTION
The study of world Englishes has come a long way in the past thirty years or so, ever since the establishment of the journals English World-Wide in 1980 and World Language English in 1981 and the impetus of the regional overviews collected in Bailey and Görlach (1982). From then on, the field has developed in no small measure through an explosion of national-variety data that have increasingly become available as well as the empirical research which has conducted on those data often on a comparative basis. Foremost amongst such comparative resources has been the International and instructional writing to persuasive and creative writing (see Appendix 1); inevitably, the written texts approximate to local varieties of standardized English. The 300 spoken texts (each again of 2,000 words) encompass 15 discourse situations, ranging from private and public dialogues to scripted and unscripted monologues; because of their situational contexts (broadcasting, law courts, education, and so on) and from the language used in them, an approximation towards spoken standardized English may also be inferred. In the light of subsequent research (most notably Schneider, 2007), not all ICE L2 (or ESL) varieties in their spoken form are standardized. All speakers are expected to be adults (over 18 years of age) and have completed their high school education -in fact, a great many speakers are graduates.
Following the publication of A comprehensive grammar of the English language (Quirk et al., 1985), which had been somewhat informed by the Survey of English Usage Corpus (the spoken part of which had been computerized as the London-Lund Corpus of Spoken British English) and which had shown awareness of British and American differences in the standardized language, it seemed a natural development to extend the description of the standardized language to national varieties of English worldwide, both where English is a mother tongue or native language, and those where it is an official or second or additional language. As McEnery and Hardie (2012, p. 75) (Greenbaum, 1988). For its initiator and primum mobile, Sidney Greenbaum, the principal aim and objective of ICE was 'to provide the resources for comparative studies of the English used in countries where it is either a majority first language (ENL) (for example, Canada and Australia) or an official additional language (ESL) (for example, India and Nigeria). In both language situations, English serves as a means of communication between those who live in these countries. The resources that ICE is providing for comparative studies are computer corpora, collections of samples of written and spoken English from each of the countries that are participating in the project' (Greenbaum, 1996, p. 3). Following Greenbaum's initial proposal (Greenbaum, 1988), discussions were held in 1989 at the 10th international conference on computer corpora on English language research held in the name of the International Computer Archive of Modern and Medieval English (ICAME), which was held in Bergen (Johansson & Stenström, 1991), where the project was inaugurated. At the 11th ICAME conference, in Berlin, in 1990, the main details of the corpus were discussed and agreed (Leitner, 1992a). At subsequent ICAME meetings, further arrangements were made for annotation schemes.
Thirty years on, given the many successes and achievements which the corpus has facilitated, with radical changes in technology, with ICE teams having dropped a generation themselves, and moreover with a changing world with increased global travel and migration, increased literacy, education and entry to higher education, it seemed warranted to review its practices and to consider how best the project may be developed in the next 30 years. With those 27 national L1 and L2 components compiled or being compiled (Appendix 2), ICE is truly a worldwide project. As the review came to show, however, complications inevitably arise from the need for conformity to and replication of a prearranged plan, such as the different legal and cultural contexts in which data are collected and treated, to which some cognizance has had to be given. Copyright law is not uniform, and some countries are more restrictive than others, leading to different solutions. Financial, even material resources for corpus compilation differ from country to country, and some countries could only be included because funding has come from European universities. Attitudes towards local as well as international co-operation differ from country to country, making data collection far from easy or uniform.
With teams having started at different times and are at various stages of completion, some components are no longer contemporaneous with others. Despite an agreed set of protocols for collection, transcription, markup and annotation, which most teams have endeavoured to follow, a few teams have sought to go their own way. Although, as ICE Coordinator, the second named author of this paper has strenuously pursued the creation of proverbial unity amongst diversity to prevent fragmentation, he has found it increasingly challenging to create overall a prevailing, unifying ethos. As Nelson cautions, the project's single most important objective should be: 'to ensure that the project does not fragment into separate, non-comparable regional corpus projects. As time goes on, this fragmentation becomes more and more likely, as teams increasingly disregard the original agreed protocols. This is particularly true of some new teams' . (This article includes a number of verbatim quotations from the questionnaire responses, without author attribution.) Early critical discussion was concerned about the Britishness of the text categories and the difficulty of collection in L2 countries; which countries for inclusion; questions of sampling and social representativeness; and funding (Leitner, 1992b;Schmied, 1996). In the early days, orthographic transcriptions and the insertion of structural mark-up were undertaken with only very general guidelines, and usually only later was any annotation (for example, for part-ofspeech tagging) added. In the meantime, hardware and corpus-exploitation software have been revolutionized, greatly facilitating accessibility and convenience for researchers, and further necessitating this review. All the same, the goal of creating an or even the international corpus has inspired everyone, and the success lies not just in the compatibility of a great many of the national components but in the countless theses, monographs and research articles which have been based on the ICE material hitherto, now all too copious to encapsulate within a single ICE bibliography.
The purpose of the present article is to summarise the main critical findings as well as outcomes of the review (Kirk & Nelson, 2017), thereby offering an insight into how ICE is used and valued by its main creators and users.

GREENBAUM'S VISION
As expressed in various articles 2 culminating in the volume which he edited in 1996, Greenbaum's vision for a collection of comparable corpora of English which would underpin the envisaged comparative studies particularly of lexico-syntax and of spoken and written registers has been amply fulfilled; the claim certainly holds more strongly for written texts than spoken. 'With the written component, there is a fair amount of confidence in the claims made by various authors' , writes one questionnaire respondent, but cautions: 'With the spoken component, though, the lack of prosodic/phonetic indicators for studies on the mere lexical corpus does not particularly inspire confidence in the claims made by various people' . Greenbaum's vision contained many technical components including part-of-speech tagging and syntactic parsing as integral parts of the corpus. On the basis of the TOSCA tagger (Oostdijk, 1991), a specially dedicated ICE tagset was developed (Greenbaum, 1993;Greenbaum & Ni, 1996; and subsequently an ICE parser (Buckley, 1996;Fang, 1996). In the end, only ICE-GB made use of those tools (released in 1998, Nelson, Wallis, & Aarts, 2002), although an attempt was made to use them to annotate ICE-Philippines (Wallis, n.d.). In addition to the annotations for ICE-GB, software for undertaking analyses of-as well as displaying and outputting the results from-the tagged and parsed ICE-GB corpus were developed as the International Corpus of English Corpus Utility Package (ICECUP). As a corpus exploration platform, ICECUP, now in version 4, developed by Sean Wallis, 3 is designed to make it easy for researchers to investigate especially a parsed corpus and output their results Wallis, Aarts, & Nelson, 2000). More recently, under an initiative set up by the second named author of this paper, as Coordinator, ten ICE corpora were POS-tagged using the CLAWS7 tagset. 4 At the same time each was semantically tagged using the UCREL semantic analysis system (USAS) (Rayson, Piao, & Archer, 2004 (Kirk, Kallen, Lowry, Rooney, & Mannion, 2011)) has been pragmatically tagged with respect to speech acts (after the defining notions by Searle, 1969Searle, , 1976, utterance tags (which include declarative as well as interrogative polarity-marking sentence tags, but also vocatives and discourse markers used sentence-or utterance-finally), quotatives (citations of speech attributed to another speaker and supposedly rendered verbatim or directly) and, of course, discourse markers (such as well or kind of) (Kallen & Kirk, 2012;Kirk, 2016). A further initiative has been the adoption of standoff architecture using ANNIS, 5 as developed by an independent project for ICE-Ireland (Kirk, 2017).

PROJECT COORDINATION
Since its inception, the ICE project has been identified not simply with the Coordinator-initially Greenbaum, and since 2001 the second named present author-but especially with the website, managed privately by Nelson, through which user licences, corpora and manuals have been distributed and much primary descriptive information about the project is made available. As an outcome of the review and subsequent discussion, the home base has moved to Zurich, within the last ten years or so, these components have adopted XML-format as standard, replacing the SGML-format originally used (Wong, Cassidy, & Peters, 2011) and are benefitting from the technological aids for transcription or data management which are now available. Symposia were held and a set of papers appeared in the ICAME Journal 34 (April 2010). These papers already constitute a review of ICE at that stage and make innovative proposals (and hence the designation 'ICE Age 2').
The move is to be welcomed for another reason: Zurich is the home base of the ICEonline project (https://es-iceonline.uzh.ch), developed over a number of years by Hans Martin Lehmann and Gerold Schneider (Lehmann, 2015;Lehmann & Schneider, 2012). Through a web-based front-end server, ICEonline currently provides online access to nine completed and a further six partially-completed ICE-corpora, individually or collectively, in any combination. Moreover, the corpora have been homogenized and regularized for markup and have also been POStagged with CLAWS7 and automatically parsed using a functional dependency grammar developed by Schneider (2008). Moreover, ICEonline has synchronized with the text the biodata for its component corpora insofar as those data are available for mutual investigation, often of a sociolinguistic kind. With the component corpora in one place, ICEonline is the most fully developed version of the international corpus as a single composite and consolidated entity. and support for users. At the same time, it is envisaged that the ICEonline project will continue to homogenize, regularize and generally tidy up both the text and the biodata of first generation corpora as they become completed, POS-tag and parse the corpora, make them available to the wider community, and provide support. In so doing, it will meet many of the strong desires for homogenization and regularization expressed in the review. What is envisaged is that the new arrangements will go some way towards satisfying the needs for access ('the consolidation of existing ICE resources into one unified super-corpus which can be accessed and searched at one online site') and dissemination (under license, through download from a central server). 6 The review found strong desire for regular meetings and other communications, so that it seems likely that the lead in arranging annual meetings in conjunction with ICAME conferences will be taken by the Zurich team.

SECOND GENERATION CORPORA
A motivation for the review was the desire for some of the earliest-completed national components to be replicated a generation or so later, as has happened, for instance, with the so-called Brown family of corpora of written English (McEnery & Hardie, 2012, pp. 97-100), that is, 'to compile parallel components for the first generation components which will allow diachronic comparison' . 7 However, there was also a feeling that additional corpora or second generation corpora should not be at the expense of or 'as an alternative to completing and fully annotating the first generation corpora' . One respondent urges firmly that 'the first generation ICE corpora should, in the first instance, be all completed and available for comparison' . Appendix 2 provides completion dates, where known, for corpora still being compiled.
Enthusiasm for second generation (replication) corpora was indicated in many responses, such as: 'I believe it is already a good time to compile second generation components, most especially because diachronic analyses of Englishes have already become a trend in English linguistics'; 'They should be updated to allow diachronic research and for comparisons with newer ICE corpora (every 25 years would be great!)'; 'Updating the earliest components would be highly beneficial and open up many new avenues for research; it would be expedient to devise an updated corpus design first and to update the earliest components accordingly'; 'That is a fantastic idea and in every sense possible it should be structured as closely as possible as the original corpus to achieve diachronic comparability' . Another respondent comments that second generation corpora 'should be constructed so that there is as much comparability to first generation as possible. Of course, adjustments will have to be made (old text types don't exist anymore, new text types have come into being)' . And, indeed, some urge the inclusion of electronic texts and emerging text types that have increased in importance in recent years. More specifically, what ICE should do, urges one respondent, is to 'set a realistic date for a new suite of corpora, to enable recording of spoken texts to be conducted in as short as possible a time frame (say, 2021/2), so we get a Brown-like three-decade interval between the early 1990s and the next suite, for diachronic studies' . There emerged some confusion over the name 'second generation' . For the present authors, 'second generation' is intended to refer to a second, later corpus of the same national variety. As such, at present, there is, then, no second generation corpus. However, 'second generation' or 'Generation 2' was understood by some to refer to the set of 'ICE Age 2' corpora because, although first-time corpora, they are using updated technological methods including audio alignment as well as making some changes to the text categories being collected. One respondent comments: 'I think any new work now must be seen as a 'Generation 2' corpus, and there can be new rules for what constitutes a Generation 2 corpus with regard to annotation, text type, and regional-demographic-political definition.
If a clear picture of ICE Generation 2 is developed, then there is no reason why there shouldn't be a Generation 2 corpus for the existing corpora' .
However reservations were expressed, too. One respondent cautions that 'Again new text categories and sources are only desirable if the standard sampling frame can still be imposed on the material' . Another thinks that 'it's only really feasible if we agree on really easily accessible electronic texts. And if there's some automated way of getting transcription of spoken language. I don't see it as feasible if we try doing the data collection and transcription the way we did it before' . As another points out, this issue is answered: 'The new corpora should follow the compilation process of the "ICE-Age 2" corpora' . Specifically, the review suggested the following components should be considered candidates for being replicated as second generation corpora: ICE-India and ICE-East Africa (because of nonconformities to the standard ICE protocol and numerous complaints); ICE-Australia (because the data were never generally released); ICE-USA (the spoken component was never completed); ICE-GB (because a second generation corpus is greatly desired). As we received no responses about ICE-Namibia or ICE-Pakistan, we are unsure of their progress or status. Before second generation corpora get off the ground, we feel that it is important for serious consideration to be given to the types of text which should be gathered in a multilingual society where English (as an L2) is only one of a number of competing official languages, so that those texts may be taken as truly representive of the status and use of English in those countries. A related issue was raised: in L2 countries, was it really 'educated' speech that was being targeted or the use of the acrolectal variety (that is the capturing of stable features and varieties of English use in that country)? There was also the question of how mixed codes should be handled. Such discussion about which texts categories would best suit L2 countries and, indeed, for which L2 countries second generation corpora should be undertaken might make a suitable agenda for an ICE meeting in the near future. This topic is further developed in Section 5 below.
Separate from second generation corpora, although their guidelines may come to be followed, the review yielded suggestions for additional corpora, particularly more 'regional' or subnational corpora, such as Wales, California, Francophone Canada, or other parts of Africa. The separate proposal by Ozón, Ayafor, Green, and FitzGerald (2017) to include Cameroon Pidgin raises a challenge to the inference that the language to be found in the spoken public as well as private ICE categories is amounting to a form of standardized English. Yet it is the local creolized/pidginized variety rather than a local standardized variety that, according to Mair (2013, p. 264), would be a better reflection of Jamaica within a 'world system of Englishes' , and more likely to be borrowed from, than standardized Jamaican English. The issue of extending ICE corpora to Kachru's 'Expanding Circle' where English has no official status (as an EFL) was also tested in the questionnaire but did not receive much support, although the 'theoretical, methodological and empirical basis the expansion of the ICE concept to the Expanding Circle' is vigorously presented as 'ICE Age 3' in Edwards (2016), Edwards and Laporte (2015) and now Edwards (2017)

Sampling periods
Whereas the sampling period for the initial set of completed corpora was held constant (1990)(1991)(1992)(1993)(1994), with teams subsequently starting at different times over the next twenty years or so, as already mentioned, the sampling period is now no longer parallel or identical among first generation corpora. There are limits to retrospective collection; and there is an obvious preference for the here and now, the immediate and contemporary. By stealth, temporal variation has crept in, and 'having exactly parallel sampling periods is no longer feasible' . One solution is to annotate the data with a timestamp and/or specify dates/period of collection in the corpus handbook. As one respondent comments: 'for collecting/transcribing/digitizing speech, time is the challenge, since changes in spoken languages (especially colloquial registers), can happen very quickly, and at different rates in different places. If the dates of data collection are clear, then at least researchers know how to factor them in as a variable' . If second generation ICE-corpora were restricted to (say) 2020-2023, it may be that some first generation corpora still being compiled would be more closely aligned to those dates than the original collection period. For one respondent, 'an updated ICE-GB would be more comparable with certain L2 corpora currently being compiled than the original ICE-GB' .

Speaker education
An ICE corpus contains the speech of adults over the age of 18 with completed school education. A great many speakers turn out to be university or college graduates or students, as with ICE-Ireland. Moreover, it was felt that an ICE-corpus was a corpus of adult speech (having attained and completed school education), and not of children's speech, even if, as defended in one case, 'they are aspiring to education' . Whereas the review showed considerable agreement for the level of education to remain constant with regard to speaker choice, not least because of its importance for sociolinguistic purposes, the question was raised whether, in L2 countries, secondary education in English should always be stipulated? A solution might again be more flexibility, with the details about the level of education recorded as a variable.

Spoken and written texts
As for possible revisions to the present contents, the review showed overwhelming desire for the present set of spoken and written text categories and their quantities for second generation corpora to remain unaltered, above all to ensure comparability. However, there was also a willingness to accept that each corpus need not fill text categories for which there are no data in their country. Some completed corpora have not filled some text categories where they have had difficulty gathering certain texts (legal texts; parliamentary debates; social & business letters; and so on), and this practice is acknowledged by the review under a principle of flexibility: certain text categories may simply not be available. 'Certain text types that are not as easily obtained in certain localities due to policies, restrictions and also language (that is the language used may not be English)' . Nevertheless, the review urges that, wherever possible, each text category should be filled with the full specified quota of texts. The issue of sampling was also raised. 'As far as possible, we should control for stable proportions of participant words by gender, age, education etc. between text categories in the same corpus' . There is probably no perfect answer to the sampling issue, but best efforts should be  (2015), Mukherjee (2015), Nelson (2015) and Peters (2015), in their responses to Davies and Fuchs (2015). A list of possible electronic texts is presented in Appendix 3 for discussion. As a consequence of the flexibility principle, the total number of words may come to differ from corpus to corpus; and some corpora may become larger than the present words total. Whereas one-million words is a convenient norm, but with the reality of missing texts as well as the prospect of additional texts, second generation components may come to have different total numbers of words. This should not be a problem, however, as all inter-corpus comparisons can be relativized. Besides, in line with the flexibility principle, some first generation corpora have not filled every category. As one respondent remarks: 'Enough techniques for frequency norming do exist to overcome any length-related issues, provided that they are used sensibly, that is that the norming factor is based on a common denominator' .

Markup and annotation
As important as the choice and amounts of written texts and spoken transcriptions are, two further properties are essential for analysis and exploitation: markup and annotation. In ICE terms, markup serves to indicate the identity of texts and speakers as well as the identification of many formal aspects of an utterance such as paragraphs and sentences, turns and utterances, overlaps, pauses, comments about paralanguage, editorial insertions, and any normalizations, if indeed, included. Moreover, different practices for copy editing, anonymization, regularization, normalization and layout exist among the present constituent corpora (Nelson, 1991(Nelson, , 1995(Nelson, , 1996. Some harmonization has been undertaken for the components included in ICEonline. By contrast, annotations are additions to texts in what from now on will be an XML-format which indicate linguistic properties that a particular item or structure might have. Annotations might indicate a part of speech, a syntactic structure or function, a semantic classification, or a pragmatic function or tone movements, and so on. The review recommends that, wherever possible, corpus texts be annotated and post-edited with regard to POS tags-either using the ICE tagset (ideally) or the by now ipso facto standard tagset, CLAWS7-as a minimum. ICEonline has made considerable advancement, offering POS tagging with CLAWS7 and the PENN treebank tagset and syntactic annotation (Lehmann & Schneider, 2012;Schneider, 2008). As one respondent remarked, the addition of 'more annotations […] will make any linguistic analysis easier' . However, there was some recognition that annotation often required time-consuming manual insertion, as happened with the prosodic and pragmatic annotations in SPICE-Ireland (Kirk, 2017

Biodata
An essential constituent of a corpus is the provision of sociolinguistic information about speakers. Those biodata are best presented in a database to which transcriptions can be linked for analysis and exploitation. The review recommended that detailed biodata be made easily available in an electronic format and, wherever possible, be linked to the transcription for interactive searching and exploitation. Several respondents urged for 'more standardized sociobiographical speaker annotation in all corpora (in the corpus or possibly as standoff annotation)'; for 'more uniform handbooks that go with the individual corpora'; for 'an improvement on the documentation of corpora […] to ensure cross-component comparability and the regional character of feature differences'; for 'more detailed guidelines in relation to speaker selection, text genres to be sampled, transcription, etc.' . Handbooks (in any format of dissemination) with detailed biodata would certainly be desirable for each component. Some handbooks do exist (Kallen & Kirk, 2008 but, for each corpus, where it exists, the handbook has a different format. However, in ICEonline, the biodata are encoded (again, where they were available) and, following that model, a template for biodata guidelines should be drawn up for completing first and second generation corpora. Biodata should be collated in a uniform series of electronic databases for downloading by end-users. Good biodata will be crucial for second generation corpora for highlighting generational differences as well as for profiling language change in apparent time, as shown by Hansen (2017) using data from ICE Hong Kong. 'Looking ahead' , writes one respondent, 'if information on speaker education, year of production and text type differences is available and the factors are statistically controlled for when analysing features, these parallels are not maximally important; if they are not systematically controlled for in, for example, comparisons of frequencies of features across ICE components, it is paramount that the extralinguistic characteristics are as similar as possible so that frequency differences can clearly be attributed to regional (and not educational or text-type) variation' . As Edwards (2017) shows, good biodata drawing attention to educational and social backgrounds of speakers in Expanding Circle varieties is essential for comparisons with Outer as well as Inner Circle varieties; her approach certainly provides a suitable model.

Technology and software
As an entirely computer-based project, a major concern is the need now to revise and update the methodology (which was initially addressed by the 'ICE Age 2' initiative (Gut & Fuchs, 2017;ICAME Journal 34, 2010). For one respondent, we need to 'update to modern technologies and data format' . For another, we need to 'bring the format up to date, that is convert it to XML' For yet another we need to 'establish modern corpus compilation standards-some teams still use text processors (for example Microsoft Word) to type up the transcriptions of spoken material. This is the technological standard of the 1980s. Such methods are much less efficient and more error-prone than modern software for corpus compilation. We need a dialogue among ICE teams that will result in recommendations of what software and corpus compilation methods should be used because the current approach (of much diversity in such these methods) negatively impacts the comparability of the corpora' . Hopefully that dialogue will come to be held at future annual meetings  8 and Pacx as a corpus management system (Gut & Fuchs, 2017;Wunder, Voormann, & Gut, 2010). 9 The use of Pacx creates XML encoding for structural markup reliably and is particularly suitable for annotation, including the linking of socio-biographical information with the transcription. Their stand-off architectures, which enable additions to the annotation to be incorporated easily, are similar to ANNIS, into which ICE-Ireland and SPICE-Ireland have been converted (Kirk, 2017). Some urged for adoption of new computer software and techniques including the transfer of annotation from SGML to XML; 10 the development of procedures for corpus compilation and annotation software, and the development of software with filters for sociobiographical and other parameters. Also urged in this connection was a way should be found to align the biodata with the texts, so that social variables (such as age and sex) can be used as search arguments in constructing queries (Hansen, 2017). One respondent urged for 'the development of a single, convenient tool with which to search all the speaker and other situational factors across ICE-corpora, ideally online […] and a search tool to exploit the annotation for all ICE corpora taken together' . This has been done for ICE-GB and, again, in ICEonline, but in most instances, the corpus and biodata are stored as separate (usually Excel) files-a relationship which Pacx has been devised to handle (Gut & Fuchs, 2017).

COMPARABILITY
The review questionnaire posed the following question: How important is it that the ICE corpora be exactly parallel to each other in terms of text types, sampling periods, speaker education, and so on? One of the strongest messages to emerge from the review was the crucial need for homogenization and regularization of all the major elements in each component: text format, markup, annotation and the biodata; another was for the consolidation of the entire project. Those not inconsiderable tasks have already been achieved through the painstaking checking and editing carried out over many years largely by Hans Martin Lehmann on those corpus texts which have hitherto come to be included in ICEonline. Nevertheless, it behoves corpora still being compiled as well as second generation corpora to comply with the guidelines and protocols as fully and as accurately as possible to ensure maximum harmonization and regularization before components ever reach the server in Zurich and incorporated into ICEonline. Completion of first generation corpora has featured as a major priority in the review (as presented above); as already mentioned, several compilers have indeed indicated that they expect to have completed their components within the next couple of years (see Appendix 2).
The point of the above question about strict parallelity was affirmed very strongly: 'the basic idea and great advantage of ICE' . Similar views were plentiful: 'they [the corpora] should be exactly parallel (that was the main point when ICE was set up)'; that comparability should be the explicit goal'; that 'keeping the external or situational variables as stable as possible ensures the comparability of the corpora-one of the key strengths of the ICE corpora'; that 'without the core sampling frame being accessible for each regional component the whole enterprise is put in danger'; that [it is] 'very important-otherwise full comparability cannot be made'; that 'it's crucial, otherwise there is no point in the project' . 'It's very important, and should be retained as an overall objective. But we must be flexible enough to adapt to local circumstances (as we have been)' . More specifically, it was urged that: '[i]t is paramount that the extralinguistic characteristics are as similar as possible so that frequency differences can clearly be attributed to regional (and not educational or text-type) variation' . Other voices suggested that the comparability should be more 'aspirational' than exactly or relatively parallel. Rather, the corpora should be 'as parallel as possible' , 'as close to a common standard as possible, with local adjustments where necessary' , for 'the linguistic reality is of course different in different countries' and there arises 'a need to make compromises' . 'It is very important, insofar as it is achievable' . 'We should actively strive to maintain a measure of comparability though. They aren't exactly parallel, except a number (most?) do have the same text types' . As another comments: 'though strict uniformity will never be possible, it is an essential feature However, following on from the discussion about second generation corpora in Section 4, deeper problems were also raised, such as ecological validity: 'What are authentic English-medium genres in multilingual English cultures?
Whereas ICE text categories are well representative of language distribution and use in L1 countries, it may be that Expanding Circle (EFL) as well as Outer Circle (ESL) countries, and which will be key to the planning of text categories for second generation corpora. As previously mentioned, choice of text categories will inevitably determine overall corpus size.
A further question was more philosophical: 'How can ICE reckon with criticisms about how strictly delineating varieties stem from a (possibly now outmoded) view of languages and language varieties as bounded and discrete?' This issue almost certainly relates to the Kachruvian model and the Quirkian notion of a 'monochrome international standard language' with only local deviations. Saraceni (2015, p. 4) argues that the world Englishes framework is 'lagging behind' sociolinguistic developments of globalization in the twenty-first century which are better explained in terms of 'super-diversity' , 'hybridity' , 'translanguaging' and 'metrolingualism' . As Saraceni (2015, pp. 132-134)  language worlds'-a shift away from analysing varieties of English as structural sets (such as in Kortmann and Schneider 2004) or 'decontextualized linguistic systems' (Mair, 2013, p. 254) to an approach which grapples with understanding 'language borders and of how people manipulate them creatively' (Mair, 2013, p. 264). Abstracting out from this dynamic, communicative approach calls to mind Mair's (2013, p. 264) new theoretical model of a 'world system of standard and non-standard Englishes' , claimed by him as 'better equipped to handle uses of English in domains beyond the post-colonial nation state' (Mair, 2013, p. 253). Exactly how second generation ICE corpora should reflect English in Outer Circle countries has thus become a much more challenging issue. Besides, in a study of 'the use of pidgins and creoles in web forums serving West African and Caribbean diasporas' (Mair, 2013, p. 253), much more is known about Outer Circle varieties, their differences from standard English no longer to be regarded as substratally deviant but rather as lexifications from local pidgins and creoles in a multilingual community where the use of English(es) is but a sociolinguistically-significant choice among competing languages (Mair, 2013(Mair, , 2015. How far are speakers in an Outer Circle ICE-corpus speakers of an acrolectal variety approximating to the standardized language (as envisaged at the outset of ICE) or rather simply a local mesolectal or basilectal/pidgin variety (Deuber, 2014;Mukherjee, 2015) mindful that local standardization may not have taken place (Schneider, 2007)? Add to that the ever-increasing role of English in the Expanding Circle countries, as already acknowledged, the blurring of status between EFL and ESL countries, and

CONCLUSION
This article has presented the main outcomes of the review of the ICE project which we undertook in 2016-2017 and begun to chart the gradual evolutionary way ahead, as the project enters its next 30-year phase, with a new home base in Zurich. The review uncovered many recommendations for second generation corpora, some of which-in respect of the actual corpus data, their markup and annotation, and also the biodata-may yet come to apply to unfinished first generation corpora as well. Other issues such as copyright and research ethics could not be touched upon in this report. The biggest challenges facing second generation corpora strike us as these: to square the desire for strict parallelity with first generation corpora with the need to sample genres of texts which appropriately and adequately represent the present uses and functions of English in the national variety in question, particularly in multilingual contexts; and to consider the inclusion of Expanding Circle varieties where the public uses and speaker profiles might more readily match Inner Circle varieties than Outer Circle varieties. The reality is that there are several competing and complementary objectives, which are not, however, exclusive and may be pursued simultaneously. Completion of first generation corpora certainly remains a high priority; but annotation and alignment as well as computer software figure as priorities, too, as do the inclusion of electronic texts, good biodata, good documentation, and the much-desired regular project meetings. As one seasoned respondent writes, 'All these objectives are important, but some are more important than others' . And another: 'All this is desirable, if and only if it does not compromise on the main objective of producing comparative descriptive studies' . Here, however, is not the place to review the very many studies and uses to which ICE corpora have been put. In the 30 years since ICE was first proposed, the project has come a remarkably long way and its progress has certainly been, by any standards, extraordinarily impressive. But much more remains to be done-and could and should be done, not least harnessing to advantage the massive technological developments in recent years. The compilation of an ICE corpus has proven a complex, demanding and responsible undertaking. The project, with currently 27 component corpora, has contributed immeasurably to research and to the advancement of knowledge on world Englishes. As Loureiro-Porto (2017, p. 448) rightly remarks: 'the validity of ICE is wholly unquestioned' . The need for ongoing continuity, effective coordination and directive leadership has never been greater.

ACKNOWLEDGEMENTS
We are deeply grateful to each of the many respondents of our questionnaire and of our Interim Report.  (2017), Gut and Fuchs (2017), Kirk (2017), Loureiro-Porto (2017), with an introductory overview by Nelson (2017) himself. 2 Early references to ICE are Greenbaum (1988Greenbaum ( , 1990aGreenbaum ( , 1990bGreenbaum ( , 1990cGreenbaum ( , 1991aGreenbaum ( , 1991bGreenbaum ( , 1992Greenbaum ( , 1992Greenbaum ( , 1993Greenbaum ( , 1994Greenbaum ( , 1996aGreenbaum ( , 1996bNelson, 1991Nelson, , 1995Nelson, , 1996Nelson, , 2002aNelson, , 2002b 6 One respondent urged: 'We need to find an optimal route (or routes) for disseminating corpora, so that the huge work carried out by ICE teams in the past can be used by the corpus linguistics community. If we do not do this, we will be seen as largely irrelevant in the world of 'big data' . A dissemination project may include parallel strands: publishing on an online platform possibly tagged and published; streamlining access to downloadable data with tools; and exemplification and publicity to motivate their use' . ICEonline will largely fulfil this desire. what follows is a proposal of our own. Electronic texts may be subdivided into four categories: extended-written; written-like, spoken-like, and multi-media.

A3.1 Extended written texts
Extended written texts are the written texts of websites of various kinds, often also containing multi-media: (public) institutional/administrative websites; corporate sites; commercial sites; cultural websites; clubs and societies websites; and so on. These are mostly equivalent to monologic, informational texts.

A3.2 Written-like in mode form and function, unlikely to have been professionally edited: the electronic medium begat the text category
There could be two sub-categories: written-formal and written-informal. Written-formal approximates to formal uses of writing (for example informational, transactional) as well as to norms of writing in such contexts. Written-informal approximates to social communication for purpose of maintaining good social relationships, using colloquial, oral and other linguistic devices marking informality.
E-mails are asynchronous written messages between people, sent to known or identifiable recipients using digital devices such as computers, tablets and mobile phones. They are generally similar in nature to many formal types of written communication (such as letters and memos), but, as they are also used for informal communication, may not always follow the same rules of formality in terms of capitalization, spelling, and so on, and may thus also have a more 'oral' character. In terms of corpus processing, it may be necessary to identify and remove quoted content repeated from the original message. However, one problem with this may be that a writer may use such quoted materials in an attempt to create 'synchronicity' with the original sender by referring to all or parts of the original message verbatim, rather than referring to the content by paraphrasing what is being responded to. Texts or SMSs (Short Message Service) are asynchronous written messages using mobile telephony systems. The activity is often referred to as 'texting' . Due to limited space and potential cost, these texts exhibit cost-saving features, such as employing 'telegraph style' , acronyms or other types of 'abbreviations' (for example LoL, 4 instead of for, Ur for your, and btw for 'by the way'). Tweets are asynchronous written messages using the social networking service Twitter, where users post and interact with written messages, each tweet message being restricted to 140 characters. Similar to above, for similar reasons.

A3.3 Spoken-like in mode form and function -monologic and dialogic text categories extended to the electronic medium
Chat may refer to short written messages conveyed over the Internet on a dialogic basis between a sender and receiver.
Chat messages are generally short in order to enable other participants to respond quickly, thereby creating a feeling similar to a spoken conversation. Although chats are designed to allow synchronous communication, asynchronicity may be often be introduced, due to one 'interlocutor' typing faster than another, with the other not responding quickly enough. In other words, even though chats are designed to produce 'adjacency pairs' , there is often no such regularity between initiation and response. Colloquial, informal features of spontaneous discourse abound, as well as acronyms and other cost-saving devices of the electronic medium. Discussion groups are used by individuals to exchange written comments in an interactive, dialogic, but asynchronous, way. They are often separated into individual threads, so that topics generally remain consistent. As contributions may contain considered responses, more typical of written language, they may have correspondingly fewer colloquial features. Regarding corpus processing, though, a similar issue with repeated/quoted content may exists.

A3.4 Multi-media-because multi-media, the electronic medium begat the text categories -crucially linkage of websites and video with text or speech
Facebook postings are asynchronously exchanged written and often visual messages between people online using this particular social media/social networking service. Skype (skypeing) is an online application which enables direct dialogic spoken, written (chat) and video exchanges of any length, and also includes options for leaving voice messages. Blogs (< weblogs) are written messages typically like a narrative to inform about personal information or a commentary or position-statement to articulate views about a topic of current public interest, usually written in an informal, colloquial style, and issued relatively frequently, sometimes daily, and usually dated. Blogs may include images or illustrations or embed video material. Some blogs enable readers to respond or comment. Blogs are often thought of as the most frequent or typical use of Internet communication. Vlogs (< video blog) and podcasts (a portmanteau of ipod and broadcast) are spoken forms of blogs and may embed video or have supporting text, images, and other metadata. Citizen broadcasting are spoken video broadcasts transmitted by individuals across the internet. They maybe be monologic or involve interaction between the participant speakers. They are not addressed to a particular audience. Text Length-Electronic texts are relatively short, so that many 2,000 word 'texts' will be composite texts. There are many views about length of individual texts with regard to balance and representativeness but we feel that the 2,000 word samplewhether only an excerpt in the case of a much longer work or in collations of (say) text messages (SMSs) or tweeksremains useful as a standard, comparable unit (certainly over 'whole' texts, regardless of length, as advocated by some).
See Table A2. Having said that, if we are applying the flexibility principle to the inclusion of text categories and to the number of texts in a category, then it follows that the principle should apply to text length. Instead of creating composite texts, for each electronic text category, the total number of words (say 20,000 words) could simply comprise all the texts individually which make up that total-if an SMS contains an average of 20 words, it would simply be a question of collecting 1000 SMSs individually, without recourse to ad hoc composite groupings of c. 2,000 words per grouping.