This proposal aims to create a treebank --- a database of syntactic parses of sentences in a text corpus --- for the Chinese Buddhist Canon. It is difficult to apply standard techniques for textual analysis to the Canon: with 52 million characters, divided into 1,514 individual texts, its sheer volume overwhelms the community of scholars and students who can read it. Furthermore, since the texts represent translations from Indic languages into Chinese from the 2nd to the 11th centuries, the Canon has a complexity that is challenging.
We need new techniques to study vast historical texts over time and space --- the Canon has not only 1000 years of linguistic data but also metadata such as the year and place of compilation and names of translators --- at a scale that was not feasible before we had access to digital collections and computational methods. These techniques would allow scholars to pose questions over the entire scope of the corpus such as, who influenced whom, which concepts survived over time, and macroscopic, structural patterns in the texts. Within this larger context, we propose to address two goals.
Our first goal is to create a treebank of syntactically analyzed texts for a subset of the Buddhist Canon. The past decade has seen many digitization and annotation projects for historical texts ranging from classical Arabic to Middle English; we will be the first to build a large-scale treebank of classical Chinese. In this pursuit, scholarship and pedagogy are intertwined: scholars exploit treebanks for quantitative evidence to their research, while students use them as reading support and contribute to them as a learning exercise.
Building treebanks is, however, labor intensive. Our second goal is to complete the treebank for the rest of the Canon by leveraging recent advances in automatic natural language parsing. A syntactic parser will be trained on the hand-crafted treebank from the first goal, and then applied on the remaining texts. The resulting treebank for the entire Canon will be openly available to the public.
The Buddhist Canon has been an important object of scholarship for centuries. This work advances the process, already underway, of applying emerging methods from computational linguistics to the study of historical languages. It has the potential to complement traditional research and pedagogical methodology with data-driven, quantitative analyses, and gives scholars and students the opportunity to effectively work with much larger volumes of texts than ever before.