mirage

Chinese Treebank 8.0

WakeSpace Repository

Show simple item record

dc.date.accessioned 2014-09-08T17:14:29Z
dc.date.available 2014-09-08T17:14:29Z
dc.date.issued 2013
dc.identifier.issn 1-58563-661-4
dc.identifier.uri http://hdl.handle.net/10339/39379
dc.description Chinese Treebank 8.0, Linguistic Data Consortium (LDC) Catalog Number LDC2013T21 and ISBN 1-58563-661-4, consists of approximately 1.5 million words of annotated and parsed text from Chinese newswire, government documents, magazine articles, various broadcast news and broadcast conversation programs, web newsgroups and weblogs. The Chinese Treebank project began at the University of Pennsylvania in 1998, continued at the University of Colorado and then moved to Brandeis University. The project's goal is to provide a large, part-of-speech tagged and fully bracketed Chinese language corpus. The first delivery, Chinese Treebank 1.0, contained 100,000 syntactically annotated words from Xinhua News Agency newswire. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01). Chinese Treebank 6.0 (LDC2007T36), released in 2007, consisted of 780,000 words. Chinese Treebank 7.0 (LDC2010T08), released in 2010, added new annotated newswire data, broadcast material and web text to the approximate total of one million words. Chinese Treebank 8.0 adds new annotated data from newswire, magazine articles and government documents. Data There are 3,007 text files in this release, containing 71,369 sentences, 1,620,561 words, 2,589,848 characters (hanzi or foreign). The data is provided in UTF-8 encoding, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed formats. All files were automatically verified and manually checked. en_US
dc.title Chinese Treebank 8.0 en_US
dc.type Dataset en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record