On-going Projects

Title   Funding Amount Funding Source Duration

Meeting the Challenge of Teaching and Learning Language in the University: Enhancing Linguistic Competence and Performance in English and Chinese

   HKD $3,398,646 UGC Teaching & Learning Related Proposals (2016-19 Triennium)  2017-

Developing a collaborative platform for collecting, coding, analyzing and visualizing patterns in language data

   HKD $130,000 2012-2013 CityU Strategic Research Grant  2012-

A Treebank for the Chinese Buddhist Canon

   HKD $783,000 2012 GRF/ECS Grant  2012-

Meeting the Challenge of Teaching and Learning Language in the University: Enhancing Linguistic Competence and Performance in English and Chinese (UGC Teaching & Learning Related Proposals)

This project aims to enhance university students’ linguistic competence and performance in meeting the academic challenges of higher education. It integrates FOUR mutually inclusive and complementary sub-components to provide a multi-dimensional and balanced teaching and learning environment for upgrading English and Chinese language abilities.

The goal of Sub-component #1 is to develop an intelligent web-based system for university students to improve their proficiency in EFL writing tasks and prepare them to cope with more demanding writing tasks in both occupational and academic settings. Through the project, students will gain mastery both in comprehension and production of the full range of alternatives between what is common in everyday speaking, and what is often expected and thus required in academic writing.

The goal of Sub-component #2 is to ensure an effective upgrade of language ability and a lasting impact on all language learners with a discovery-enriched adventure of making sense of Chinese and English. The goal will be achieved via FIVE crucial components of learning: Motivation, Engagement, Relevance, Innovation, and Transferring of knowledge. Through the implementation of discovery-enriched activities, the MERIT programme is designed to motivate students for active engagement in finding relevant issues and discovering innovative solutions for practical knowledge transfer. Students will work together to create a website called Ask-Me-Why to share and publicize the learning outcomes.

Sub-component #3 aims at improving adult learners’ Chinese language skills (both written and spoken Chinese (Cantonese and Putonghua)) and their ability to communicate through the creation of a web-based multi-modal environment for learners, known as the International Chinese Language Salon (ICLS). The ICLS platform will provide interactive speaking coaching and online writing tutoring, with individual feedback provided to each student, as well as a variety of online resources, such as video tutorials, audio recordings, useful grammatical and phonetic reference guides.

The goal of Sub-component #4 is to create An English Pronunciation Guide for Cantonese Students in Hong Kong (The Guide), which will include (1) a description of the International Phonetic Alphabet (IPA), the universal system of symbols for representing speech sounds of human languages; (2) a description of the English sound system and a comparison of the sound systems of British English and North American English; (3) a description of the Cantonese sound system and a comparison between the Cantonese and English sound systems; (4) lists of possible pronunciation errors in English made by Cantonese students.

Developing a collaborative platform for collecting, coding, analyzing and visualizing patterns in language data (2012-2013 CityU Strategic Research Grant)

Data collection and analysis are essential to linguistic research. With the increasing quantities of text data made available on a daily basis, people are confronted with two major problems: how to effectively annotate such data with linguistically sophisticated coding, and how to analyze the coded data for linguistic insight that enriches our understanding of the underlying functions and semantics.

In this project we propose to develop a collaborative platform for the analysis and coding of linguistic data, emphasizing the functional-semantic aspect of discourse. The platform consists of two integral components: a theory-neutral annotation tool for collecting and collaboratively annotating/coding heterogeneous linguistic data, and an innovative visualization interface for discovering patterns in big text data. Our visualization interface, called the Visual Semantics Interface (VSI), is an innovative tool for exploring the functional-semantic structure of discourse. We aim to provide a centrally managed linguistic resource hub for collecting, coding, analyzing and visualizing patterns in language data.

This proposed collaborative Web-based platform will facilitate multiuser annotation, querying and visualization of large-scale corpora for interactive functional-semantic analysis of discourse. Some of the major innovations featured in the project include: (1) An innovative ‘instant-visualization’ editor built on a generic multilingual text indexing system facilitates annotation and verifying structured text using real-time-generated graphical visualization. (2) Tools for aligning coded data about images, sound clips, videos streams with textual information facilitate multimodal investigation. (3) A collaborative Visual Semantics Interface generating visualizations in response to queries aimed at discovering lexical and grammatical patterning. (4) A plugin mechanism allowing for intelligent processing, involving an automatic coding system based on an ensemble of machine learning algorithms that learn from annotated instances to speed up further annotation and assist in automating the VSI.

Using the VSI platform, we have created an initial multi-purpose corpus for bootstrapping the intelligent coding system. Once completed, the annotation platform, the visualization interface and the corpus will be made freely available to the public to facilitate further collaboration. A recent trial-run of an initial implementation of the VSI with undergraduate students in a text linguistics course demonstrated the VSI’s potential for facilitating collaborative investigation of texts. Such experiments with the implementation of the VSI also provide input for further development and improvement of the design of the VSI, leading to enhanced multi-layer VSI graphics, and greater automation of the linguistic encoding of texts.

A Treebank for the Chinese Buddhist Canon (2012 GRF/ECS Grant)

This proposal aims to create a treebank --- a database of syntactic parses of sentences in a text corpus --- for the Chinese Buddhist Canon. It is difficult to apply standard techniques for textual analysis to the Canon: with 52 million characters, divided into 1,514 individual texts, its sheer volume overwhelms the community of scholars and students who can read it. Furthermore, since the texts represent translations from Indic languages into Chinese from the 2nd to the 11th centuries, the Canon has a complexity that is challenging.

We need new techniques to study vast historical texts over time and space --- the Canon has not only 1000 years of linguistic data but also metadata such as the year and place of compilation and names of translators --- at a scale that was not feasible before we had access to digital collections and computational methods. These techniques would allow scholars to pose questions over the entire scope of the corpus such as, who influenced whom, which concepts survived over time, and macroscopic, structural patterns in the texts. Within this larger context, we propose to address two goals.

Our first goal is to create a treebank of syntactically analyzed texts for a subset of the Buddhist Canon. The past decade has seen many digitization and annotation projects for historical texts ranging from classical Arabic to Middle English; we will be the first to build a large-scale treebank of classical Chinese. In this pursuit, scholarship and pedagogy are intertwined: scholars exploit treebanks for quantitative evidence to their research, while students use them as reading support and contribute to them as a learning exercise.

Building treebanks is, however, labor intensive. Our second goal is to complete the treebank for the rest of the Canon by leveraging recent advances in automatic natural language parsing. A syntactic parser will be trained on the hand-crafted treebank from the first goal, and then applied on the remaining texts. The resulting treebank for the entire Canon will be openly available to the public.

The Buddhist Canon has been an important object of scholarship for centuries. This work advances the process, already underway, of applying emerging methods from computational linguistics to the study of historical languages. It has the potential to complement traditional research and pedagogical methodology with data-driven, quantitative analyses, and gives scholars and students the opportunity to effectively work with much larger volumes of texts than ever before.