On-going Projects

Title   Funding Amount Funding Source Duration

Developing a collaborative platform for collecting, coding, analyzing and visualizing patterns in language data

   HKD $130,000 2012-2013 CityU Strategic Research Grant  2012-

A Treebank for the Chinese Buddhist Canon

   HKD $783,000 2012 GRF/ECS Grant  2012-


Developing a collaborative platform for collecting, coding, analyzing and visualizing patterns in language data (2012-2013 CityU Strategic Research Grant)

Data collection and analysis are essential to linguistic research. With the increasing quantities of text data made available on a daily basis, people are confronted with two major problems: how to effectively annotate such data with linguistically sophisticated coding, and how to analyze the coded data for linguistic insight that enriches our understanding of the underlying functions and semantics.

In this project we propose to develop a collaborative platform for the analysis and coding of linguistic data, emphasizing the functional-semantic aspect of discourse. The platform consists of two integral components: a theory-neutral annotation tool for collecting and collaboratively annotating/coding heterogeneous linguistic data, and an innovative visualization interface for discovering patterns in big text data. Our visualization interface, called the Visual Semantics Interface (VSI), is an innovative tool for exploring the functional-semantic structure of discourse. We aim to provide a centrally managed linguistic resource hub for collecting, coding, analyzing and visualizing patterns in language data.

This proposed collaborative Web-based platform will facilitate multiuser annotation, querying and visualization of large-scale corpora for interactive functional-semantic analysis of discourse. Some of the major innovations featured in the project include: (1) An innovative ‘instant-visualization’ editor built on a generic multilingual text indexing system facilitates annotation and verifying structured text using real-time-generated graphical visualization. (2) Tools for aligning coded data about images, sound clips, videos streams with textual information facilitate multimodal investigation. (3) A collaborative Visual Semantics Interface generating visualizations in response to queries aimed at discovering lexical and grammatical patterning. (4) A plugin mechanism allowing for intelligent processing, involving an automatic coding system based on an ensemble of machine learning algorithms that learn from annotated instances to speed up further annotation and assist in automating the VSI.

Using the VSI platform, we have created an initial multi-purpose corpus for bootstrapping the intelligent coding system. Once completed, the annotation platform, the visualization interface and the corpus will be made freely available to the public to facilitate further collaboration. A recent trial-run of an initial implementation of the VSI with undergraduate students in a text linguistics course demonstrated the VSI’s potential for facilitating collaborative investigation of texts. Such experiments with the implementation of the VSI also provide input for further development and improvement of the design of the VSI, leading to enhanced multi-layer VSI graphics, and greater automation of the linguistic encoding of texts.



A Treebank for the Chinese Buddhist Canon (2012 GRF/ECS Grant)

This proposal aims to create a treebank --- a database of syntactic parses of sentences in a text corpus --- for the Chinese Buddhist Canon. It is difficult to apply standard techniques for textual analysis to the Canon: with 52 million characters, divided into 1,514 individual texts, its sheer volume overwhelms the community of scholars and students who can read it. Furthermore, since the texts represent translations from Indic languages into Chinese from the 2nd to the 11th centuries, the Canon has a complexity that is challenging.

We need new techniques to study vast historical texts over time and space --- the Canon has not only 1000 years of linguistic data but also metadata such as the year and place of compilation and names of translators --- at a scale that was not feasible before we had access to digital collections and computational methods. These techniques would allow scholars to pose questions over the entire scope of the corpus such as, who influenced whom, which concepts survived over time, and macroscopic, structural patterns in the texts. Within this larger context, we propose to address two goals.

Our first goal is to create a treebank of syntactically analyzed texts for a subset of the Buddhist Canon. The past decade has seen many digitization and annotation projects for historical texts ranging from classical Arabic to Middle English; we will be the first to build a large-scale treebank of classical Chinese. In this pursuit, scholarship and pedagogy are intertwined: scholars exploit treebanks for quantitative evidence to their research, while students use them as reading support and contribute to them as a learning exercise.

Building treebanks is, however, labor intensive. Our second goal is to complete the treebank for the rest of the Canon by leveraging recent advances in automatic natural language parsing. A syntactic parser will be trained on the hand-crafted treebank from the first goal, and then applied on the remaining texts. The resulting treebank for the entire Canon will be openly available to the public.

The Buddhist Canon has been an important object of scholarship for centuries. This work advances the process, already underway, of applying emerging methods from computational linguistics to the study of historical languages. It has the potential to complement traditional research and pedagogical methodology with data-driven, quantitative analyses, and gives scholars and students the opportunity to effectively work with much larger volumes of texts than ever before.