Multi-CAST: Overview

Geoffrey Haig, Stefan Schnell

Multi-CAST (Multilingual Corpus of Annotated Spoken Texts) is a collection of non-elicited, spoken texts from different languages, most of them monologic narratives. The corpus was compiled and annotated under the supervision of Geoffrey Haig and Stefan Schnell, with technical implementation undertaken by LAC at the University of Cologne.

The collection is composed of texts from a variety of languages, all of which were made to adhere to the same principles of composition and design. For every text in each of the corpora, a sound file, translation, as well as morphological glossing and syntactic annotations using the GRAID annotation scheme are provided, along with background information on the recordings and additional sources. The annotated texts are available as EAF-files, an XML-based file format produced by the annotation software ELAN.

A detailed description of the corpus and its design is available in the Multi-CAST corpus overview and usage guide (Schiborr 2016). For details on the annotation scheme employed, please refer to the GRAID Manual 7.0 (Haig & Schnell 2014).


All material in Multi-CAST is licensed under the Creative Commons Attribution-NonCommerical-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0).

Citing Multi-CAST

Multi-CAST should be cited as follows:

Haig, Geoffrey & Schnell, Stefan (eds.). 2016. Multi-CAST (Multilingual Corpus of Annotated Spoken Texts),, date accessed.


Data collection and annotation of part of the collection were graciously supported by the Australian Research Council as part of the DECRA project Typology of language use (2012–2015), hosted by La Trobe University, and by the VolkswagenStiftung-funded Documentation of Endangered Languages (DOBES) project (2000–2007 and 2006–2012). The Department of General Linguistics at the University of Bamberg contributed departmental funding and research infrastructure to the project.