Corpus, concordance and Cups of COCA *


A few years back, Hugh Dellar posted a useful and provocative piece on his blog entitled ‘What have corpora ever done for us?’ For those interested, the post is still online, even if that specific blog has finished.

I replied at the time, (as did several others) suggesting that while corpora do not provide answers to every aspect of teaching, three things they have done are as follows:

1. Corpora have provided us with excellent learner dictionaries, packed full of real examples and useful information about frequency, meaning and usage.
2. Corpora have provided learners with an initial list of high frequency lexis on which they can focus at the early stages of learning. Without this, many learners (and teachers and textbooks) may spend a lot of time learning/teaching obscure items which are of very little use.
3. Corpora have provided teachers and researchers with real evidence about how language is used beyond sentence level.This is something which should at least inform what is taught in classrooms and must be better than guesswork. It is surely worth knowing that commonly taught aspects of language such as modal verbs are not the only  way to express modality and how such modality operates in discourse. As one example, in their corpus-informed grammar Carter and McCarthy (2006: 678) give examples of modal adverbs such as ‘possibly’ as one other way of expressing modality

Despite such advantages, working with pre-and in-service teachers has shown me that even these simple benefits are not always covered in training and as a result, teachers can be unsure of how to start using corpora to inform their teaching. So, in this post, I want to suggest some ways in which teachers could apply the above points before introducing a book which, in my view at least, is a helpful resource for teachers. So, how can corpora help? Here are some suggestions, related to the above points:

1.Learner dictionaries are now all based on corpora  and allow us to check frequency quickly and simply. Many follow an entry with codes such as W1 W2 or S1, S2 etc , which shows us that a word is in the first or second thousand spoken or written words in the corpus used. Some also use stars to show the frequency or give the most frequent words in red. Here are examples of both, from Macmillan Dictionary  and the Longman Dictionary of Contemporary English . This at least gives teachers and learners some guidance about whether they need to spend time on teaching or learning items – if an item is of high frequency, this can suggest it is likely to be worth more of our time than something which is low frequency. Dictionaries also provide examples of items in context with their sample sentences drawn from the corpus they have used. With some simple editing, these samples can easily be used to contextualise language in familiar tasks such as matching activities, rather than teachers feeling they have to invent examples for themselves. Target items can be highlighted within the sentence and matched to meanings, also taken from the dictionary. Here is an example, adapted from the Longman Dictionary of Contemporary English
Target item = crash diet
A crash diet will leave you hungry, you will eat more and you will not really lose weight
an attempt to lose a lot of weight quickly by strictly limiting how much you eat

2. A lot of researchers have argued that the first two thousand most frequent words need to be learnt quickly as they make up a great deal of texts which students hear and read, as well as what they will need to produce ( see O’Keeffe, McCarthy and Carter 2007 for more discussion of this point). On a simple level, lists of these words can easily be consulted, as a means of checking that there is ample coverage of these on any course. Here, for example, are the lists made for the British National Corpus, provided by Leech, Rayson and Wilson (2001).
Lextutor  also provides a simple means of checking texts for these words. Simply copy and paste a text into the ‘vocab profile’ section of the site, and choose (as an example) BNC 1-20 k and the software will highlight which words come from which frequency band. Related to this, Martinez and Schmitt(2012) have produced the Phrasal Expressions list of spoken and written lexical chunks, which are grouped in order of the first and second thousand most frequent words from the British National Corpus. You can find the list (alongside many other useful articles and ideas) on Norbert Schmitt’s website.

3. Many corpora are now open access and it is now relatively quick for teachers to check how items are used in spoken and written context. If we take the example of modality, a search in the conversation sub corpus of the BYU BNC (Davies 2004-) is relatively quick and simple to undertake. Choose the ‘ s_conv ‘ option from ‘sections’ and then simply type in the form being searched for in the search bar. If we take the highly frequent  ‘should’  as a simple example, we can quickly see it has 4379 occurrences in this section of the corpus, or 1, 091.4 occurrences per million words. Another search (* should *) tells us which items most often come before and after this word, which in this example is  most commonly ‘I should think’. When we click on these items and look at the concordance lines, we can quickly see that the most common use for this in this conversation sub-corpus is to express fairly certainly what we think to be true rather than to give advice with ‘you should'(the function which is most often taught first for this item). Here are some of the sample concordance lines, which I have edited slightly.
1. (SP:PS029) Eh? (SP:PS02E) I should think you were getting rea–, a real panic then. (SP:PS029) When? (SP:P
2. and (pause) weighs nearly (SP:PS02H) Mm. (SP:PS02G) she must weight fourteen, fifteen stone I should think.
3. it must be some (SP:PS02H) Can’t she? (SP:PS02G) (unclear) Her latest beau I should think, I don’t know. (SP:PS02H) Maybe. Maybe maybe. (pause) (SP:PS02G)
4. (SP:PS02H) Chucked all the duff ones out (unclear) (SP:PS02G) At least, yeah, I should think nine tenths of them would go, very (unclear) (pause) there’s yesterday’s
5. coming up, must be a year now mustn’t it? (SP:PS6TB) Yeah I should think it is that. (SP:PS02G) No I can’t remember exactly what month.

Samples such as these can be used by a teacher to inform their teaching of this item or (with some editing), as exercises in class for students to analyse the item. We might ask about the above samples questions such as : a) which words come before ‘should’ and immediately after ‘should’ in these examples and b) is ‘should’ used to give advice here or to make a guess about what the speaker  thinks is true? c) why might a speaker choose ‘I should think’ and not ‘I think’? Such exercises are nothing new – Johns first suggested activities of this type (and many more excellent ideas) back in 1991 in his proposals for Data Driven Learning (DDL).

Teachers preferring a more step by step approach to using corpora may well seek some kind of published guide. An excellent example of such a book has recently been published by Mura Nava, entitled ‘Quick cups of COCA’,  a free e-book which works with the Corpus of Contemporary American English  (Davies 2008-). Nava takes the reader through a series of ways to use the corpus to inform teaching decisions in a user-friendly way and the examples are usefully grounded in his own teaching. Here is a short sample from the book (Nava 2016: 3):

A student in my TOEIC class, when we were looking at adjective endings -ED and -ING, asked what was the difference between “unmotivated” and “demotivated”. I replied that demotivated describes someone after some experience whereas unmotivated is a general state of being. I wasn’t too sure if that was sufficient so whilst the class was engaged in the following part of the lesson I used the wildcard asterisk, see image above.I found out that the instances of “demotivated” were pretty low compared to “unmotivated”. I only transmitted the frequency information to the said student. If I had more time and a projector hooked up to the computer I would have looked at the example sentences in each case.

Amongst the many useful aspects of this book, readers can click on search images such as the one above and find the actual results to check understanding. The book also includes a range of different types of search, which allows for a helpful and varied exploration of the corpus and is highly likely to include the types of searches teachers may need to undertake. Available for free download or reading online here. Incidentally, Mura also published a response to the original Hugh Dellar post here 

Overall then,  it is fairly simple to use corpora to inform what is taught in classrooms and they give us valuable evidence about language form and function. Given the recent calls more generally for  evidence-based practice in ELT ( for example, Mayne 2014)  perhaps it is now time they became a standard part of teacher training.

*The title of this post is an admittedly poor play on ‘Corpus, concordance, collocation’, a seminal work in Corpus Linguistics by John Sinclair. Well worth a read – see references below.


Carter, R and McCarthy, M. (2006).  Cambridge grammar of English: A comprehensive guide to spoken and written grammar and usage.  Cambridge: Cambridge University Press.

Davies, M. (2004-). BYU-BNC. (Based on the British National Corpus from Oxford University Press). Available online at

Davies, M. (2008-). The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at

Johns, T. (1991). Should you be persuaded – two samples of data-driven learning materials.  In  Johns, T  and King, P (1991) (eds) “Classroom Concordancing”. ELR Journal. Vol. 4, 1-16.

Leech, G., Rayson, P and Wilson, A. (2001). Word frequencies in written and spoken English. London: Longman

Martinez, R and Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics , 33 (3), 299-320.

Mayne, R. (2014).  A guide to pseudo-science in English language teaching.  IATEFL Conference presentation. Available online at

Nava , M. (2013). This corpora-bashing parrot has ceased to be. Available online at

Nava, M. (2016). Quick Cups of COCA . Available online at

O’Keeffe, A., McCarthy, M and Carter, R. (2007.) From Corpus to Classroom: language use and language teaching. Cambridge: Cambridge University Press.

Sinclair, J.M. (1991).Corpus, concordance, collocation. Oxford: Oxford University Press.


About chrisjones70

Senior Lecturer in TESOL/Applied Linguistics at the University of Liverpool. All views and posts here are my own. Twitter : @ELTResearch

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: