This is the electronic version of Chaper 10 (pp. 263-310) of the following book (by permission from the publisher): Edwards, Jane A. & Martin D. Lampert (eds). TALKING DATA: TRANSCRIPTION AND CODING IN DISCOURSE RESEARCH. London and Hillsdale, NJ: Erlbaum. 336 pp. 0-8058-0349-1 [ppr] US $27.50; 0-8058-0348-3 [hdbk] US $59.95; (Prepaid: $24.75 & $53.95) Discourse, spoken language corpora. Transcription and coding systems from contrasting approaches to spoken language situated in their theoretical frameworks with sample analyses. Overview chapters present global design principles. Includes a large compendium of computerized corpora and related resources. To order in US: 1-800-926-6579 I would appreciate knowing of inaccuracies or additional resources which should be mentioned in an update to be submitted as appropriate to the ICAME fileserver in Bergen (see below). -Jane Edwards (edwards@cogsci.berkeley.edu) -------------------------------------------------------------------------- Chapter 10: Survey of Electronic Corpora and Related Resources for Language Researchers Jane A. Edwards University of California at Berkeley CONTENTS 1. INTRODUCTION . . . 267 2. INFORMATION SOURCES . . . 269 A. Centers and Associations . . . 269 (1) NCCH (Norwegian Computing Centre for Humanities) . . . 269 (2) CTI (Computers in Teaching Initiative Centre for Textual Studies) . . . 269 (3) CETH (Center for Electronic Texts in the Humanities) . . . 270 (4) ACH (Association for Computers and the Humanities) . . . 270 (5) ALLC (Association for Literary and Linguistic Computing) . . . 271 (6) ACL (Association for Computational Linguistics) . . . 271 B. Electronic Mail Distribution Lists and Discussion Lists . . . 272 (1) HUMBUL . . . 272 (2) CORPORA . . . 272 (3) HUMANIST . . . 272 (4) LINGUIST . . . 273 (5) LN, Langage Naturel, . . . 273 (6) PROSODY . . . 274 (7) Comserve . . . 274 (8) Applied linguistics (TESL-L, SLART-L, MULTI-L, LTEST-L) . . . 274 (9) FUNKNET . . . 275 (10) info-childes and info-psyling . . . 275 (11) ASLING-Linguistics of Signed Languages . . . 275 (12) List of lists . . . 275 C. Email Addresses . . . 276 3. TEXT ENCODING STANDARDS (TEI, IPA, SAM, TOBI) . . . 276 4. DATA SOURCES . . . 278 A. Electronic Data Archives and Repositories . . . 278 (1) OTA (Oxford Text Archive) . . . 278 (2) ICAME (International Computer Archive of Modern English) . . . 278 (3) CHILDES (The Child Language Exchange System) . . . 279 (4) CETH (Center for Electronic Texts in the Humanities) . . . 279 (5) The AIATSIS Aboriginal Studies Electronic Data Archive . . . 280 (6) Project Gutenberg . . . 280 (7) Library of the Future . . . 280 B. Surveys of Electronic Language Data . . . 280 (1) Oxford Text Archive (OTA) catalogue . . . 280 (2) University of Lancaster Survey . . . 280 (3) Georgetown University Catalog of Archives and Projects . . . 281 (4) Walker and Zampolli survey . . . 281 (5) List of Electronic Texts in Philosophy . . . 281 (6) List of Electronic Dictionaries . . . 281 (7) Catalog of the University of Cambridge Literature and Linguistics Computing Centre . . . 282 (8) Linguistic Society of America List . . . 282 (9) The Marchand list of CD-ROM Projects . . . 282 (10) ARL Directory of Electronic Publications . . . 282 5. CORPORA AND TEXTBANKS . . . 282 A. Running text: English Language . . . 283 (1) Brown Corpus . . . 283 (2) Lancaster-Oslo/Bergen (LOB) . . . 284 (3) London-Lund Corpus . . . 285 (4) Lancaster Spoken English Corpus (SEC) . . . 285 (5) PIXI Corpora . . . 285 (6) Helsinki Corpus of Historical English . . . 286 (7) Macquarie (University) Corpus . . . 286 (8) Kolhapur Corpus of Indian English . . . 286 (9) American Heritage Intermediate Corpus . . . 286 (10) Birmingham Collection of English Text (BCET) . . . 286 (11) Longman/Lancaster English Language Corpus . . . 287 (12) Corpus of Spoken American English (CSAE) . . . 287 (13) International Corpus of English (ICE) . . . 287 (14) British National Corpus Initiative (BNC) . . . 287 (15) Bellcore Lexical Research Corpora . . . 288 (16) Association for Computational Linguistics Data Collection Initiative (ACL/DCI) . . . 288 (17) European Corpus Initiative (ACL/ECI) . . . 289 (18) Cambridge Language Survey (CLS) . . . 289 (19) Linguistic Data Consortium (LDC) . . . 289 (20) American News Stories . . . 290 (21) Nijmegen TOSCA Corpus . . . 290 (22) Melbourne-Surrey Corpus . . . 290 (23) Corpus of English-Canadian Writing . . . 290 (24) Warwick Corpus . . . 290 (25) Cornell corpus . . . 290 (26) NEXIS, LEXIS, MEDIS (Mead Data Central) and WESTLAW (West Corporation) . . . 291 B. Running text: French Language . . . 291 (1) OTA holdings . . . 291 (2) Hansard Canadian Parliamentary Sessions . . . 291 (3) Ottawa-Hull Corpus of Spoken French . . . 291 (4) Tresor de la Langue Francaise (TLF or ARTFL) . . . 291 C. Running text: German Language . . . 292 (1) Mannheim Corpus . . . 292 (2) Bonner Zeitungskorpus . . . 292 (3) Freiburger Corpus . . . 292 (4) LIMAS Corpus . . . 292 (5) Pfeffer Spoken German Corpus . . . 292 (6) Ulm Textbank . . . 292 (7) Muenster Textbank . . . 292 D. Running text: Italian Language . . . 292 (1) PIXI corpora . . . 292 (2) Pisa corpus . . . 292 E. Running text: Other Languages . . . 293 (1) Native American Languages . . . 293 (2) Australian Indigenous Languages . . . 293 (3) Danish . . . 293 (4) Estonian . . . 293 (5) Finnish . . . 293 (6) Spanish . . . 293 (7) Swedish . . . 293 (8) Yugoslavian . . . 293 F. Running text: Language Acquisition . . . 294 (1) Child Language Acquisition (CHILDES, PoW) . . . 294 (2) Adult Second Language Acquisition (ESFSLDB, Montreal) . . . 294 G. Phonetic Databases . . . 295 (1) DARPA Speech Recognition Research Databases . . . 295 (2) Phonetic Database (PDB) . . . 295 (3) Multi-Language Speech Database . . . 295 H. Electronic Dictionaries . . . 296 (1) See the Wooldridge list . . . 296 (2) Oxford Text Archive (OTA) holdings . . . 296 (3) Oxford English Dictionary (OED) . . . 296 (4) Le Robert Electronique . . . 296 I. Lexical Databanks . . . 296 (1) MRC Psycholinguistic Database . . . 296 (2) Consortium for Lexical Research (CLR) . . . 297 (3) Centre for Lexical Information (CELEX) . . . 297 (4) Acquisition of Lexical Knowledge (ACQUILEX) . . . 298 (5) Cambridge Language Survey (CLS) . . . 298 (6) Japanese Electronic Dictionary Research Project . . . 298 J. Treebanks . . . 298 (1) Lancaster-Leeds Treebank . . . 298 (2) Lancaster Parsed Corpus . . . 298 (3) Linguistic DataBase System (LDB) . . . 298 (4) Penn Treebank Project . . . 299 (5) Treebank of Written and Spoken American English . . . 299 K. Translation into English . . . 299 6. LITERATURE PERTAINING TO ELECTRONIC CORPORA . . . 300 ACKNOWLEDGMENTS . . . 300 REFERENCES . . . 301 APPENDIX . . . 307
This chapter is also available via anonymous ftp from cogsci.berkeley.edu in compressed format, in the "pub" directory, under the filename of "CorpusSurvey.Z and as the LINGUIST file "CORPORA FAQ", which is retrievable by email by sending the message "GET CORPORA FAQ LINGUIST" to LISTSERV@tamvm1.tamu.edu (For a list of all the archived LINGUIST files, send the message "INDEX LINGUIST" to this same address.)