mirage

Browsing Linguistics Datasets by Title

WakeSpace Repository

Browsing Linguistics Datasets by Title

Sort by: Order: Results:

  • Unknown author (2013)
  • Tracey, Jennifer; Lee, Haejoong; Strassel, Stephanie (2019-03-27)
    BOLT English Discussion Forums was developed by the Linguistic Data Consortium (LDC) and consists of 830,440 discussion forum threads in English harvested from the Internet using a combination of manual and automatic ...
  • Song, Zhiyi; Fore, Dana; Strassel,Stephanie; Lee, Haejoong; Wright, Jonathan (2019-03-27)
    BOLT English SMS/Chat was developed by the Linguistic Data Consortium (LDC) and consists of naturally-occurring Short Message Service (SMS) and Chat (CHT) data collected through data donations and live collection involving ...
  • Doran, Christine; Burger, John; Henderson, John; Zarrella, Guido (Linguistic Data Consortium, 2013-01-15)
    Chinese-English Biology and Chemistry Abstract Parallel Text was developed by The MITRE Corporation. It consists of parallel sentences from a collection of chemistry and biology-related scientific article abstracts published ...
  • Unknown author (2013)
  • CHM150 
    Mena, Carlos; Herrera, Abel (Linguistic Data Consortium, 2016-06-15)
    CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican ...
  • Unknown author (2015-08-31)
  • Unknown author (2018-01-03)
  • Unknown author (2020-02-27)
  • Unknown author (2017-04-04)
  • Taulé, Mariona; Martí, Maria Antonia; Bies, Ann; Gari, Aina; Nofre, Montserrat; Song, Zhiyi; Strassel, Stephanie; Ellis, Joe (Linguistic Data Consortium, 2018-01-16)
    DEFT Spanish Treebank was developed by the Linguistic Data Consortium (LDC) and the Language and Computation Center (CLiC), University of Barcelona. It contains treebank annotation of international Spanish newswire text ...
  • Unknown author (2012-11-06)
    Digital Archive of Southern Speech (DASS), Linguistic Data Consortium (LDC) catalog number LDC2012S03 and ISBN 1-58563-603-7, was developed by the University of Georgia. It is a subset of the Linguistic Atlas of the Gulf ...
  • Linguistic Data Consortium (Linguistic Data Consortium, 1994)
    The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are ...
  • Bies, Ann; Mott, Justin; Warner, Colin (Linguistic Data Consortium, 2015-07-15)
    English News Text Treebank: Penn Treebank Revised was developed by the Linguistic Data Consortium (LDC) with funding through a gift from Google Inc. It consists of a combination of automated and manual revisions of the ...
  • Muir, Kate; Joinson, Adam; Cotterill, Rachel; Dewdney, Nigel (Linguistic Data Consortium, 2016-07-15)
    English Speed Networking Conversational Transcripts was developed at the University of the West of England and contains 388 transcripts of English face-to-face and instant messaging conversations about business ideas ...
  • Unknown author (2021-12-13)
  • Unknown author (2013-03-27)
  • Li, Xuansong; Grimes, Stephen; Strassel, Stephanie; Ma, Xiaoyi; Xue, Nianwen; Marcus, Mitch; Taylor, Ann (Linguistic Data Consortium, 2015-03-16)
    GALE Chinese-English Parallel Aligned Treebank -- Training was developed by the Linguistic Data Consortium (LDC) and contains 229,249 tokens of word aligned Chinese and English parallel text with treebank annotations. This ...
  • Li, Xuansong; Grimes, Stephen; Strassel, Stephanie (Linguistic Data Consortium, 2014-11-15)
    GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 was developed by the Linguistic Data Consortium (LDC) and contains 65,069 tokens of word aligned Chinese and English parallel text enriched with ...
  • Li, Xuansong; Grimes, Stephen; Strassel, Stephanie (Linguistic Data Consortium, 2015-02-16)
    GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 was developed by the Linguistic Data Consortium (LDC) and contains 242,020 tokens of word aligned Chinese and English parallel text enriched with ...