ICC   a new resource for corpus-based contrastive linguistics

An International Comparable Corpus

John Kirk & Anna Cermáková

There is broad agreement that the International Corpus of English project has been highly successful because it has facilitated numerous comparisons of L1 and L2 national varieties of English worldwide. Those comparisons encompass the lexical and morpho-syntactic structural levels, as well as comparisons of discourse types and written registers (cf. e.g. Greenbaum 1996; Hundt & Gut 2012; Aarts et al. 2013; and the papers in the Special Issues of World Englishes vol. 15(1) (1996) and vol. 36(3) (2017), to mention but a few key studies). No small part of this success rests with the fact that for each national variety there has been chosen a set of spoken and written text categories which are deemed to be representative of each national variety: 15 discourse situations (totalling 60%) and 17 written registers (totalling 40%). A major review of the ICE project has been undertaken and its results and outcomes are to be agreed upon at ICAME in Prague in May 2017. It seems likely that the text categories will be expanded to include electronic texts and some flexibility in text category choice will become possible.

At the same time, spoken and/or written corpora have been compiled for other languages (cf. list of non-English corpora in e.g. O’Keeffe et al. (2007: 294-296) or the non-English corpora discussed in Xiao (2008) or Ostler (2008)). Xiao makes comparisons with corpora of English: for instance, the Polish National Corpus replicates the structure of the British National Corpus (Xiao 2008: 387), as does the Czech National Corpus (Cermák 1997), which contains spoken texts similar to those of demographically sampled component of BNC (Xiao 2008: 388-389, Cermák 2009). However, no corpus of another language appears to be composed with the range and balance of text categories and quantities of texts as contained within an ICE corpus. The existing corpora in various languages are generally compiled on very different principles and do not allow direct cross-linguistic contrastive comparisons.

Corpus-based contrastive studies are a growing research area and researchers have voiced need for more rigorous analytical framework (e.g. Aijmer et al. 1996, Altenberg & Granger 2002, Marzo et al. 2012, Aijmer & Altenberg 2013, Altenberg & Aijmer 2013). The majority of contrastive studies are being carried out on two languages only, one of the reasons being the lack of comparable data. Contrastive analysis relies on two types of data (Granger 2003): translation (parallel) corpora and comparable corpora (cf. McEnery & Xiao 2007). While translation corpora contain original texts and their translations, comparable corpora contain original texts in two or more languages that have been selected on comparable criteria for text categories and quantities for each category, such as the Lancaster Corpus of Mandarin Chinese, which uses the same sampling frame of the Lancaster/Oslo-Bergen Corpus, or the Aarhus Corpus of Contract Law (both cited in McEnery & Hardie 2012: 19; cf. also e.g. Sharoff et al. 2014). Comparable corpora are an essential data source to support contrastive analyses, since the translation corpora are usually limited as far as text types are concerned (Johansson 2007).

What we are introducing is not a parallel translation corpus such as the English-Swedish Parallel Corpus, the English-Norwegian Parallel Corpus (ENPC), or the InterCorp corpus; rather, it is the creation of an International Comparable Corpus (ICC – pronounced to rhyme with lick) with as many languages as wish to come on board. Phase I will start with national, standard(ised) European languages. An expression of interest to collaborate on this project has been expressed for the following languages: German, French, Czech, Slovak, Polish, Finnish, Norwegian, Swedish, and Scottish Gaelic. The first collaborative meeting is to be held in June 2017 in Prague. 

The ultimate goal of this project is the facilitation of contrastive studies between English and other languages involving highly comparable datasets of spoken, written and probably electronic registers. A striking and unique feature of each new corpus will be its substantial spoken component, at present comprising 600,000 words (or 60% of the current total). The revised ICE format, to be adopted here, is likely to safeguard this large amount of spoken texts but will include electronic texts as well. Such provision of spoken data across 15 or so discourse situations for contrastive analysis will be unprecedented and invaluable for future research. This will then also allow the much-needed cross-linguistic comparisons of spoken language, further investigations may include the area of pragmatics, such as pragmatic discourse markers (cf. e.g. Aijmer & Vandenbergen 2006). 

The proposed comparable corpus ICC will allow substantially to add to existing contrastive corpus-based research (e.g. studies of English-German contrasts, such as König & Gast (2012), or English-Norwegian contrasts, such as Ebeling & Ebeling (2013)), and will allow replicability and comparisons with other languages, i.e. a corpus-based empirical approach to each pair of contrasts, with spin-offs for the others, would all become possible. A further application will almost certainly be possible in bilingual lexicography (as shown by the papers in Sharoff et al. 2013). 

Following the launch of ICE Phase II at the ICAME conference in Prague in May 2017 and the first ICC meeting will take place in Prague at the Institute of the Czech National Corpus, Charles University, in June 2017. Subsequently, we will introduce this exciting new international, multi-lingual corpus project at Corpus Linguistics 2017, Birmingham in July 2017, where we will present some of the issues and challenges it raises as well as the solutions being adopted.


