Information compiled by Maite Taboada (Linguistics)
The Linguistics Department has or has had a membership with the Linguistic Data Consortium (http://www.ldc.upenn.edu) for the years 2002, 2004, 2005, 2006, 2007, 2008, 2009, 2013, 2015 and 2017. The LDC creates and distributes speech and text corpora and lexicons (in English and other languages) that could be of use to researchers in various areas (linguistics, computer science, communication, psychology, education...).
The membership is extended to all SFU students, faculty and staff. This means we have access to a number of corpora released this year, an on-line user account, and the possibility of purchasing other corpora at discount rates. Membership for 2005-2007 is a subscription membership, which means that we receive a copy of everything released in those years. Membership for 2008, 2009, 2013, 2015 and 2017 is standard, which means we are restricted to 16 corpora per year.
I will summarize here how you can access the data. You can also check the FAQ for members, the instructions on how to obtain corpora, and the general membership agreement (similar to the one we signed):
http://www.ldc.upenn.edu/Membership/FAQ_Members.shtml
http://www.ldc.upenn.edu/Obtaining/
http://www.ldc.upenn.edu/Membership/Agreements/nfp97.membership.html
The LDC releases a number of corpora each year, both text and speech. SFU owns all corpora released in 2002, 2004, 2005, 2006 and 2007, plus other that we purchased from earlier years, and a reduced number for 2008 and 2009. Lists:
http://www.ldc.upenn.edu/Catalog/ByYear.jsp
See http://www.ldc.upenn.edu/Catalog/ByYear.jsp for full descriptions:
LDC2013T02 Chinese-English Biology and Chemistry Abstract Parallel Text
LDC2011E14 MED-11 - Metadata and Documentation (MED 10, DEV-T Part 1 and Event Kits)
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2009T12 2008 CoNLL Shared Task Data
LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
LDC2009T22 Arabic Newswire English Translation Collection
LDC2009T23 FactBank 1.0
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2009T07 Unified Linguistic Annotation Text Collection
LDC2009T25 Web 1T 5-gram, 10 European Languages
LDC2009T13 English Gigaword Fourth Edition
LDC2009T29 ACL Anthology Reference Corpus
LDC2009T30 Arabic Gigaword Fourth Edition
LDC2009T22 Arabic Newswire English Translation Collection
LDC2009T24 OntoNotes
LDC2009T21 Spanish Gigaword
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
LDC2008T05 Penn Discourse Treebank Version 2.0
LDC2008T07 Chinese Proposition Bank 2.0
LDC2008T24 COMNOM v 1.0
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2008T23 NomBank v 1.0
LDC2008T13 BLLIP North American News Text, Complete
LDC2008T15 North American News Text, Complete
LDC2008T22 Czech Academic Corpus 2.0
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC2008T19 NYT annotated corpus
LDC2007T22 2001 Topic Annotated Enron Email Data Set
LDC2007S10 2003 NIST Rich Transcription Evaluation Data
LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data
LDC2007T40 Arabic Gigaword Third Edition
LDC2007S03 ARL Urdu Speech Database, Training Data
LDC2007T38 Chinese Gigaword Third Edition
LDC2007T36 Chinese Treebank 6.0 (CTB6.0)
LDC2007S08 CSLU: Foreign Accented English Release 1.2
LDC2007S18 CSLU: Kids` Speech Version 1.1
LDC2007S13 CSLU: Apple Words and Phrases
LDC2007S05 CSLU: Yes/No Version 1.2
LDC2007T02 English Chinese Translation Treebank v 1.0
LDC2007T07 English Gigaword Third Edition
LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech
LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
LDC2007T20 GALE Phase 1 Distillation Training
LDC2007T08 ISI Arabic-English Automatically Extracted Parallel Text
LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text
LDC2007S01 Levantine Arabic Conversational Telephone Speech
LDC2007T01 Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2007S09 Mandarin Affective Speech
LDC2007T19 MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)
LDC2007S15 Nationwide Speech Project
LDC2007T21 OntoNotes v 1.0
LDC2007T03 Tagged Chinese Gigaword
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
LDC2007V01 TRECVID 2005 Keyframes & Transcripts
LDC2006S44 2004 NIST Speaker Recognition Evaluation
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2006S46 Arabic Broadcast News Speech
LDC2006T20 Arabic Broadcast News Transcripts
LDC2006T02 Arabic Gigaword Second Edition
LDC2006S15 CSLU: Spelled and Spoken Words
LDC2006S14 CSLU: Stories v 1.2
LDC2006S35 CSLU: Multilanguage Telephone Speech Version 1.2
LDC2006S26 CSLU: Speaker Recognition Version 1.1
LDC2006S16 CSLU:Spoltech Brazilian Portuguese Version 1.0
LDC2006S01 CSLU Voices
LDC2006S39 CSLU:Names Release 1.3
LDC2006T10 English-Arabic Treebank V1.0
LDC2006T17 French Gigaword First Edition
LDC2006S43 Gulf Arabic Conversational Telephone Speech
LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts
LDC2006S45 Iraqi Arabic Conversational Telephone Speech
LDC2006T16 Iraqi Arabic Conversational Telephone Speech, Transcripts
LDC2006S42 Korean Broadcast News Speech
LDC2006T14 Korean Broadcast News Transcripts
LDC2006T03 Korean Propbank
LDC2006T09 Korean Treebank Annotations Version 2.0
LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech
LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts
LDC2006S33 Middle East Technical University Turkish Microphone Speech V 1.0
LDC2006T04 Multiple Translation Chinese (MTC) Part 4
LDC2006S13 N4 NATO Native and Non-Native Speech
LDC2006S31 NIST 2003 Language Recognition Evaluation
LDC2006T01 Prague Dependency Treebank 2.0
LDC2006S34 Russian through Switched Telephone Network (RuSTeN)
LDC2006T12 Spanish Gigaword First Edition
LDC2006S30 Speech Controlled Computing
LDC2006T18 TDT5 Multilingual Text
LDC2006T19 TDT5 Topics and Annotations
LDC2006T08 Timebank 1.2
LDC2006T13 Web 1T 5-gram Version 1
LDC2006S37 West Point Heroico Spanish Speech
LDC2006S36 West Point Korean Speech
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data V1.0
LDC2005T35 ANC Second Release
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3 , Transcripts
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v2.0 (MPG + Syntactic Analysis)
LDC2005T30 Arabic Treebank: Part 4 v1.0 (MPG Annotation)
LDC2005S22 Articulation Index
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2005T13 CCGbank
LDC2005S26 CSLU: 22 Languages Corpus
LDC2005T34 Chinese <-> English Name Entity Lists (v1.0)
LDC2005T10 Chinese English News Magazine Parallel Text
LDC2005T14 Chinese Gigaword Second Edition
LDC2005T06 Chinese News Translation Text Part 1
LDC2005T23 Chinese Proposition Bank 1.0
LDC2005T01 Chinese Treebank 5.0
LDC2005T01U01 Chinese Treebank 5.1
LDC2005T08 Discourse Graphbank
LDC2005T12 English Gigaword Second Edition
LDC2005S13 Fisher English Training Part 2, Speech
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2005T28 HARD 2004 Text
LDC2005T29 HARD 2004 Topics and Annotations
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2005T24 MDE RT-04 Training Data Text/Annotations
LDC2005S16 MDE RT04 Training Data Speech
LDC2005L01 Mawukakan Lexicon
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
LDC2005S25 Santa Barbara Corpus of Spoken American English Part-IV
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2005S30 The West Point Company G3 American English Speech Data Corpus
LDC2005S28 West Point Croatian Speech Corpus
LDC2004S10 Santa Barbara Corpus of Spoken American English Part-III
LDC2004T17 Arabic News Translation Text Part 1
LDC2004T15 2000 Communicator Dialogue Act Tagged
LDC2004T16 2001 Communicator Dialogue Act Tagged
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2004T11 Arabic Treebank: Part 3 v 1.0
LDC2004T05 Chinese Treebank Version 4.0
LDC2004S01 Czech Broadcast News Speech
LDC2004T01 Czech Broadcast News Transcripts
LDC2004V01 FORM1 Kinematic Gesture
LDC2004T08 Hong Kong Parallel Text
LDC2004S02 ICSI Meeting Speech
LDC2004T04 ICSI Meeting Transcripts
LDC2004S05 ISL Meeting Speech Part 1
LDC2004T10 ISL Meeting Transcripts Part 1
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
LDC2004S08 MDE RT-03 Training Data Speech
LDC2004T12 MDE RT-03 Training Data Text and Annotations
LDC2004T03 Morphologically Annotated Korean Text
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
LDC2004S09 NIST Meeting Pilot Corpus Speech
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata
LDC2004T14 Proposition Bank I
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC2004T18 Arabic English Parallel News Part 1
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC2004S12 Talkbank Ethology Data: Field Recordings of Vervet Monkey Calls
LDC2004S13 Fisher English Training Speech Part 1 Speech
LDC2004T19 Fisher English Training Speech Part 1, Transcripts
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2004T23 Prague Arabic Dependency Treebank 1.0
LDC2004T25 Prague Czech-English Dependency Treebank Version 1.0
LDC2003T20 ANC First Release
LDC2003S06 Santa Barbara Corpus of Spoken American English Part-II
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC2002S22 1997 HUB5 Arabic Evaluation
LDC2002T39 1997 HUB5 Arabic Transcripts
LDC2002S24 1997 HUB5 German Evaluation
LDC2002S25 1997 HUB5 Spanish Evaluation
LDC2002S10 1998 HUB5 English Evaluation
LDC2002S56 2000 Communicator Evaluation
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S12 2001 HUB5 Mandarin Evaluation
LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2002S37 Callhome Egyptian Arabic Speech Supplement
LDC2002T38 Callhome Egyptian Arabic Transcripts Supplement
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
LDC2002S28 Emotional Prosody Speech and Transcripts
LDC2002T26 Korean English Treebank Annotations
LDC2002T01 Multiple-Translation Chinese Corpus
LDC2002T07 RST Discourse Treebank
LDC2002S06 Switchboard-2 Phase III Audio
LDC2002T31 The AQUAINT Corpus of English News Text
LDC2002S04 Translanguage English Database (TED) Speech
LDC2002T03 Translanguage English Database (TED) Transcripts
LDC2002S35 Voicemail Corpus Part II
LDC2002S02 West Point Arabic Speech Corpus
LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts
LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
LDC2001T61 CALLHOME Spanish Dialogue Act Annotation
2000
LDC2000S85 Santa Barbara Corpus of Spoken American English Part-I
LDC2000T51 TREC Spanish
1999
LDC99T42 Treebank-3
1998
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)
LDC98T27 Hub-5 Spanish Transcripts
LDC98S70 Hub-5 Spanish Telephone Speech Corpus
LDC98S72 Taiwanese Putonghua Speech and Transcripts
1997
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC97S42 CALLHOME American English Speech
LDC97T14 CALLHOME American English Transcripts
LDC97T15 CALLHOME German Transcripts
1996
LDC96S35 CALLHOME Spanish Speech
LDC96T17 CALLHOME Spanish Transcripts
LDC96T18 CALLHOME Japanese Transcripts
LDC96S34 CALLHOME Mandarin Chinese Speech
LDC96T16 CALLHOME Mandarin Chinese Transcripts
LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
1994
LDC94T5 ECI Multilingual Text
LDC94T4A UN Parallel Text (Complete)
Some of the materials are in CD-ROM, some of them on-line. You can consult the library catalogue (search for "Linguistics Data Consortium" under Keywords or Author. Note that you may have to do either "Linguistic Data Consortium" or "Linguistics Data Consortium"). Search the catalogue frequently, new items arrive regularly, and they might take a little while to be catalogued:
- For the CD-ROMs:
Go to the 7th floor of the Bennet library (Maps area, Data department):
http://www.lib.sfu.ca/about/floorplans/maps.htm
Ask the staff in the Data department. You can borrow the CDs. If you have trouble, ask the staff in the Research Data Library: http://www.sfu.ca/rdl/
Do not email me about access to the library data!
- For the online resources:
Ask Maite Taboada (mtaboada@sfu.ca) or Anoop Sarkar (anoop@cs.sfu.ca) for access.
I have created a mailing list for all issues concerning the LDC. The list is (replace the text in quotes with @):
ldc-corpora "the at sign here" sfu.ca
I am trying to reach all members of the SFU community that might be interested. If you know of someone who would like to receive further information, please send them a pointer to this page, or have them send email to me or to the list.
mtaboada @ sfu.ca