Multilingual and Multimodal English-Mandarin-Cantonese-Japanese parallel corpus and an online parallel concordancing platform for comparative linguistic studies


Previous corpus-based research mostly focuses on one particular language (i.e. monolingual) (Xiao & McEnery, 2004; Ishikawa (2008); McEnery & Hardie, 2011, Tsou, Kwong, Lu and Tsoi, 2011), and research on parallel corpora is less common (Chujo, Utiyama and Nishigaki, 2005; Wang, 2001; Wang, 2005; Lu, Chow and Tsou, 2012). Comparative linguistic research on genetically-unrelated languages adopting a multilingual and multimodal corpus-based approach is even rare. As multilingualism has been identified as one of the key research agendas of LML and given the research backgrounds of the proposed team members on four major world languages, namely Mandarin, English, Japanese, and Cantonese, we believe that a multilingual corpus project focusing on comparative linguistic studies of these languages will provide a research platform that promotes quality team research in this area in the department. We plan to develop research agendas in a number of directions:

  • Compilation of a multilingual and multimodal English-Mandarin-Cantonese-Japanese parallel corpus for research and pedagogical purposes;
  • Development of a parallel concordancing computer program that allows users to search the multilingual parallel corpus and the system will automatically produce parallel concordance examples and other useful corpus statistics;
  • Development of an online platform that hosts the multilingual parallel corpus and the parallel concordancing program, allowing researchers and language teachers/learners to fully explore the corpus data.
  • Comparative linguistic studies based on the multilingual parallel corpus data: Mandarin-English, Mandarin-Cantonese, English-Cantonese, English-Japanese, Japanese-Mandarin, and Japanese-Cantonese comparative studies. Linguistic investigation of certain features across four languages can also be carried out.
  • Translation studies based on the multilingual parallel corpus data.

The long-term aim of this project is to build on the research team’s expertise in parallel corpus development and Mandarin-English comparative studies (Wang Lixun), Japanese-English comparative studies (Zoe Luk), Japanese-Mandarin / Cantonese comparative studies (Kataoka Shin) and Cantonese-Mandarin comparative studies (Chin Chi On) to develop team research agendas for LML in the area of multilingualism, comparative language studies, and translation studies. The short-term aim is to conduct a pilot project involving the development of an online platform that hosts 6 searchable parallel corpora (Mandarin-English; Mandarin-Cantonese; English-Cantonese; English-Japanese; Japanese-Mandarin; Japanese-Cantonese) for research and pedagogical purposes. From a theoretical perspective, comparing different languages will give us a better understanding of what human language is because different languages encode messages/concepts differently. From a pedagogical perspective, the findings from comparative linguistics can improve language teaching, because we are yet to explain many subtle differences between languages (e.g., why a transitive verb is used in one language whereas an intransitive verb is used in another language to describe the same event), and being able to explain these differences explicitly may help language learners in grasping concepts that are absent in their first language.


Principal researcher:

Wang Lixun


Chin Chi On, Kataoka Shin, Zoe Luk Pei Sui

The Pilot Project
In the pilot project, for each of the 6 parallel corpora, around 100,000 words of language data in each language will be collected and compiled into a parallel corpus. The source of data will be subtitles from movies (Hollywood, Japanese, Mandarin, Cantonese movies) including the original texts plus the equivalent translations. This multilingual parallel corpus can be regarded as multimodal since researchers can get access to and watch the actual movies easily if they want when carrying out research based on the corpus data. After the completion of the multilingual parallel corpus, the research team intends to carry out comparative linguistic studies on topics such as the use of deixis and ellipsis in English, Mandarin, Cantonese and Japanese.

Deliverable outcomes:

A multilingual and multimodal English-Mandarin-Cantonese-Japanese parallel corpus;

A parallel concordancing program that can search the parallel corpus and automatically produce parallel concordance examples;

A website hosting the parallel corpus and the concordancing program, so that researchers and language teachers/learners can explore the parallel corpus data online conveniently.

