The Automatic Annotation of the SWBD-ISO Corpus of Spoken Dialogue

Introduction

The Switchboard Corpus of Transcribed Dialogue, distributed through the Linguistic Data Consortium, contains 95 hours of sound recordings of 1,155 conversations composed of 115,000 speaker turns. It was previously annotated for dialogue acts according to the DAMSL scheme. In this project, it was re-annotated according to the new international standard for dialogue act annotation certified by the International Organization for Standardization (ISO), of which Alex Fang is one of the authors.

Motivation

The project was carried out in order to construct an appropriately annotated corpus of spoken dialogues in order to identify the semantics and communicative functions of the spoken utterance, to understand better the interactive mechanism of human speech, and to help design a computational model of speech-mediated interaction. More specifically, the resource will generate useful knowledge to enhance the naturalness of human-machine dialogues, to reduce dysfluencies in human–computer dialogue, to produce a better design of the dialogue manager of interactive dialogue systems, and to bring greater user satisfaction.

Methodology and results

Linguistic insight was applied automatically to the transcriptions and achieved 94.2% coverage of the corpus. The remainder was annotated manually.

Top-15ISO DAs Tokens Token% Cum%
Inform 1,265,448 82.88 82.88
AutoPositive 66,501 4.35 87.23
PropositionalQuestion 39,602 2.60 89.83
SetQuestion 15,841 1.04 90.87
Answer 11,784 0.77 91.64
CheckQuestion 11,053 0.72 92.36
InitialGoodbye 9,442 0.62 92.98
Question 5,068 0.33 93.31
ChoiceQuestion 4,502 0.29 93.60
Completion 3,188 0.21 93.81
Stalling 2,998 0.20 94.01
Disconfirm 1,660 0.11 94.12
AutoNegative 798 0.05 94.17
Offer 590 0.04 94.21
AcceptApology 358 0.02 94.23
1,438,833 94.23

Conclusion

The fact that over 94% of the transcriptions can be dealt with automatically in the project demonstrates that linguistic knowledge, when properly applied through computer technologies, translates into a powerful instrument that helps to reduce time and labour and to increase accuracy and efficiency, hence maximising productivity.

References

Fang, A C, Cao, J, & Liu, X (2013). SWBD-ISO: A Corpus-Based Study of the Most Recent ISO Standards for Dialogue Act Annotation. Contemporary Linguistics 《當代語言學》, 15:4. pp 439-458.
Fang, A C, Cao, J, Liu, X, & Bunt, H (2013). Lexical Characteristics of Dialogue Acts in the Switchboard Corpus of Telephone Conversations. Journal of Foreign Language Teaching and Research 《外語教學與研究》, No 3, Vol 44. pp 28-40.
Fang, A C, Bunt, H, Cao, J, & Liu, X (2011). Relating the Semantics of Dialogue Acts to Linguistic Properties: A machine learning perspective through lexical cues. In Proceedings of the 5th