Corpus Linguistics Glossary

Terms and Definitions

Alias: A user-designated synonym for a Unix command or sequence of commands. For example, if you designated m to be your alias for mailx, then typing m will always run this mail program.

Alignment: The matching or linking of a text and its translation(s), usually paragraph by paragraph and/or sentence. Texts are often aligned in this way so that bilingual CONCORDANCES can be retrieved. Some alignment can be done automatically by software, although best results are usually produced when a human user checks the automatic alignment and corrects where necessary.

Alphanumeric: Of ASCII characters, any string composed of only up-or lower-case English letters or Arabic numerals.

AMALGAM (Automatic Mapping Among Lexicon Grammatical Annotation Models)

Anaphora: Pronouns, noun phrases, etc. which refer to something already mentioned in a text; sometimes the term is used more loosely—and, technically, incorrectly— to refer in general to items which co-refer, even when they do not occur in the text itself (exophora) or when they refer forwards rather than backwards in the text (cataphora).

Annotation: (1) The practice of adding explicit additional information to the machine-readable text; (2) The physical representation of such information.

ARCHER: a Representative Corpus of Historical English Registers

ASCII: The American Standard Code for Information Interchange is a standard character set that maps character codes 0 through 127 (low ASCII) onto control functions, punctuation marks, digits, upper case letters, and other symbols.

Attribute: In SGML, a quantifier within the opening tag for an element which specifies a value for some named property of that element.

Authenticity: a feature that characterizes naturally occurring corpus data

BFT (Binary File Transfer): A way of sending files by ftp. The are sent in binary code, not translated into ASCII, which would risk some information loss.

CALL: computer-aided (or assisted) language learning

CAMET: Computer Archive of Modern English Texts, a project of Geoffery Leech of the Department of Language and Modern English in 1970.

Character encoding: a system of using numeric values to represent characters

COCOA (Computations in Commutative Algebra): A method of text encoding used by the Oxford Concordance Program and other software.

Colligation: the collocation of a node word with a particular grammatical class of words

Collocation: the characteristic co-occurrence of patterns of words

Comparable corpus: a corpus which is composed of L1 data collected from different languages using the same sampling techniques

Comparative corpus: a corpus containing components of varieties of the same language

Concordance: an alphabetical index of a search pattern in a corpus, showing every contextual occurrence of the search pattern

Corpus balance: the range of different types of language that a corpus claims to cover

Corpus header: the part of a corpus that provides necessary bibliographical information, taxonomies used and other metadata relating to a corpus

Corpuses: a less commonly used plural form of corpus

Cross-tabulation: a table showing the frequencies for each variable across each sample
co-text A more precise term than context or verbal context used to refer to the words on either side of a selected word or phrase.

Dispersion: a term in descriptive statistics which refers to a quantifiable variation of measurements of differing members of a population within the scale on which they are measured

Ditto tag: in corpus annotation assigning the same part-of-speech code to each word in an idiomatic expression

DTD: Document Type Definitions in markup languages such as HTML, SGML and XML

Error-tagging: assigning codes indicating the types of errors occurring in a learner corpus

Factor analysis: a statistical analysis commonly used in the social and behavioural sciences to summarize the interrelationships among a large group of variables in a concise fashion fisher's exact test: an alternative to the chi-square or log-likelihood test that measures exact statistical significance level

Frequency: also called raw frequency, the actual count of a linguistic feature in a corpus

Interlanguage: the learner’s knowledge of the L2 which is independent of both the L1 and the actual L2

Keyword: words in a corpus whose frequency is unusually high (positive keywords) or low (negative keywords) in comparison with a reference corpus

KWIC: key-word-in-context concordance

Lemmatization: grouping together all of the different inflected forms of the same word

Lexicon: an inventory of word forms in a given language

Log-likelihood test: also known as an LL test, an alternative to the chi-square test

Markup: a system of standard codes inserted into a document stored in electronic form to provide information about the text itself and govern formatting, printing or other processing

Mean: the arithmetic average, which can be calculated by adding all of the scores together and then dividing the sum by the number of scores

Merger: combination of two or more words (e.g. can’t and gonna)

Metadata: a term used to describe data about data, typically the contextual information of corpus samples

MI: mutual information, a statistical formula borrowed from information theory

Microconcord: a concordance package published the Oxford University Press

Monitor corpus: a corpus that is constantly supplemented with fresh material and keeps increasing in size

Normalization: a process which makes frequencies from samples of markedly different sizes comparable by bringing them to a common base

Parallel corpus: a corpus which is composed of source texts and their translations in one or more different languages;also known as a translation corpus

Parsing: also called treebanking or bracketing, a process that analyzes the sentences in a corpus into their constituents

Population: the entire set of items from which samples can be drawn

POS: part-of-speech

Post-editing: human correction of automatically processed data

Range: the difference between the highest and lowest frequencies

Reference corpus: a balanced representative corpus balanced for general usage; in keyword analysis, a corpus that is used to provide a reference wordlist

Representativeness: a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety
recoverability A term used to refer to the possibility for the user to recover the basic original text from any text which has been annotated with further information.

RP: Received Pronunciation, the notional standard form of spoken British English

Sample: elements that are selected intentionally as a representation of the population being studied

Sample corpus: as opposed to a monitor corpus, a sample corpus is of finite size and consists of text segments selected to provide a static snapshot of language

Semantic prosody: the collocational meaning arising from the interaction between a given node word and its collocates

SEU: Survey of English Usage

Skeleton parsing: also called shallow parsing, a parsing technique that uses less fine-grained constituent types rather than would be present in a full parse

Sort: arrange concordances or a wordlist in a certain order

Span: a term used to refer to the measurement, in words, of the co-text of a word selected for study.

Specialized corpus: a corpus that is domain or genre specific and is designed to represent a sublanguage

SPSS: Statistical Package for the Social Sciences

Standardized type-token ratio: similar to type-token ratio, but computed every n (e.g. 1,000) words as the WordSmith Wordlist goes through each text file

Sub corpus: a component of a corpus, usually defined using certain criteria such as text types and domains

Tagging: an alternative term for annotation, especially word-level annotation such as POS tagging and semantic tagging

Tagset: a collection of tags in the form of a scheme for annotating corpora.

Text chunking: the practice of dividing sentences into non-overlapping segments on the basis of fairly superficial analysis

Token: an occurrence of any given word form

Tokenization: also called segmentation, a process that divides running text into legitimate word tokens, especially important for languages such as Chinese that do not delimit words with white spaces

Transcription: converting spoken data into a written form

Treebank: an alternative term for a parsed corpus

T-test: an alternative statistical test to the chi-square test

Type: a word form

Type-token ratio: the ratio between types and tokens, useful when comparing samples of roughly equal length

Unicode: a character encoding system designed to support the interchange, processing, and display of all of the written texts of the diverse languages of the world

Wildcard: a special character such as an asterisk (*) or a question mark (?) that can be used to represent one or more characters in pattern matching

Wordlist: a list of words occurring in a corpus, possibly with frequency information

WordSmith: a corpus exploration package with sophisticated statistical analysis, published by the Oxford University Press

Z-test: an alternative statistical test to chi-square test


Baker, Paul, Andrew Hardie & Tony McEnery. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press, 2006.

Olohan, Maeve. Introducing Corpora in Translation Studies. New York: Routledge, 2004

Wang, Kefei. Research and Application of Bilingual Parallel Corpora. Beijing: Foreign Language Teaching and Research Press, 2004