Coding schemes for MM research -

Codes, according to Knight (2011: 197-198) are ‘physical forms of discourse-level annotations which mark specific semantic, pragmatic, and/or functional characteristics of discourse in a corpus. These are often used as a point of entry for analysing specific features of corpora.’ Though many coding schemes have been implemented and some of these are adaptable to other projects, there remains an absence of a standard coding scheme that can be universally adopted. As coding for multiple modes of behaviours (and across multiple speakers) can be time-consuming, it is recommended that a coding scheme be set from the outset that is tailored to the needs of a given research project. Variation in schemes is dependent on a number of factors, including:

The specific modes and non-verbal behaviours that are intended to be analysed.
The system of annotation used and the tools used to facilitate this from annotations embedded into transcripts to sophisticated use of multi-modal tools that represent levels of co-occurrence on tiers.
The level of detail that the researcher intends to analyse. This will place a scheme on a spectrum of broad to fine-grained (Yang et al. 2022). For example, some researchers may be interested in phases of behaviour, while others may be interested only in instances.

The degree to which automaticity is integrated into the system of annotation using tools such as the SPeeding Up the Detection of Non-iconic and Iconic Gestures (SPUDNIG) tool which aims to speed up the annotation of hand gestures in ELAN. This list is not intended to be definitive or exhaustive but rather to be a document of examples of a variety of approaches to coding which others may use a starting point or reference for how to undertake the creation or adaption of a bespoke coding scheme. This is a list of notable projects that have either created multi-modal datasets and/or bespoke and replicable coding schemes.

Project/corpus: Video-mediated English as a Lingua Franca Conversations (ViMELF)

Reference: Brunner, M. L., & Diemer, S. (2021). Multimodal meaning making: The annotation of nonverbal elements in multimodal corpus transcription. Research in Corpus Linguistics, 9(1), 63-88.

Link: https://varieng.helsinki.fi/CoRD/corpora/ViMELF/

System: Annotations integrated into textual transcription

Research focus: The analysis of multi-modal features of communication using English as a lingua franca in virtual meetings.

Description: Consists of 55 Non-Verbal Elements (NVEs). Nine are associated with facial expressions, four with the head, including gaze features, three with physical stance and two with the speakers’ background. 39 features are associated with hand or body movement; these include movement, but also actions such as standing up, walking, or camera movements that force a shift of perspective.

Project/corpus: MUMIN – A Nordic Network for MUltiModal Interfaces

Reference: Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C. and Paggio, P. (2007a), The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. Language Resources and Evaluation 41(3): pp. 273–87.

Link: https://www.semanticscholar.org/paper/The-MUMIN-multimodal-coding-scheme-Allwood-Cerrato/08014558f8558592a0991c50e7af4e8016e84015

System: Annotations on tiers using Anvil multi-modal corpus tool

Research focus: Originally created to experiment with annotation of multimodal communication in short clips from movies and in video clips of interviews taken from Swedish, Finnish and Danish television broadcasting but also intends to be a general instrument for the study of gestures and facial displays in interpersonal communication, in particular the role played by multimodal expressions for feedback, turn management and sequencing.

Description: This coding scheme works on two levels: form and function.

Form is tagged depending on the feature. Hand forms are tagged for ‘handedness’ i.e. both or single and trajectory i.e. up, down, sideways, complex, other.

The functions relate to feedback, turn management and sequencing. Each of these have sub-categories and short tags. For example, ‘sequencing’ is categorised as opening, continuing and closing tagged as S-open, S-continue and S-close respectively. Semiotic types are categorised as Indexical Deictic, Indexical Nondeictic, Iconic and Symbolic.

Project/corpus: HamNoSys, The Hamburg Sign Language Notation System component of the DGS corpus of German sign language

Reference: Hanke, T. (2004), “HamNoSys – representing sign language data in language resources and language processing contexts.” In: Streiter, Oliver, Vettori, Chiara (eds): LREC 2004, Workshop proceedings: Representation and processing of sign languages. Paris: ELRA, 2004, – pp. 1-6.

Link: https://www.sign-lang.uni-hamburg.de/dgs-korpus/index.php/hamnosys-97.html

Research focus: A system used by the DGS corpus of German sign language which aims to collect sign language data from deaf/Deaf people and make it accessible in an annotated corpus.

System: HamNoSys symbols are available as a Unicode font, with the characters mapped into the Private Use area of Unicode

Description: Uses a range of symbols to code for hand gesture and facial features while signing. Hand signs are coded for shape, orientation, action and location. A system for mouth gestures enumerates identified gestures. For example, C01 signifies cheeks puffed.

Project/corpus: NA

Reference: Cerrato, L. (2004) A coding scheme for the annotation of feedback phenomena in conversational speech. In Proceedings of the LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 25-28).

Link: Proceedings of LREC workshop 2004 available via www.researchgate.net

Research focus: To produce a system to label feedback phenomena in conversational speech

System: Used with the annotation tool Multitool

Description: Feedback expressions are typologically labelled as: Words (W), Phrases (P), Sentences (S) and Gestures (G). Direction of feedback is coded as given (Giv) or elicited (Eli).

The following functions of feedback expressions are coded with labels in parentheses: Continuation (CpI/CpY), Acceptance (A), Refusal (R), Expressive (Ex), Require Confirmation (Req-C), Check that the interlocutor is following (Fol) and desire to receive more information (Mo).

Project/corpus: An unnamed corpus of thirty-eight native English-speaking students of the University of Manchester. narrating cartoon stories while interacting with one of the experimenters who asked them questions about the actions and the characters in the cartoons

Reference: Holler, J. and Beattie, G. ‘How iconic gestures and speech interact in the representation of meaning: are both aspects really integral to the process?’ SEMIOTICA-LA HAYE THEN BERLIN, 146, 81-116.

Link: DOI: 10.1515/semi.2003.083

System: Binary system, no tools mentioned

Research focus: An investigation of the communicative role of iconic hand gestures

Description: 0 and 1 used to represent iconic gestures where 0 = The information that is contained in the accompanying speech or in the original scene is represented by the iconic gesture and 1 = The information that is contained in the accompanying speech or in the original scene is not represented by the iconic gesture.

Project/corpus: Annotation of audiovisual corpora: MultiModal MultiDimensional (M3D) labeling scheme

Reference: Rohrer, P.L.; Vilà-Giménez, I.; Florit-Pons, J.; Esteve-Gibert, N.; Ren, A.; Shattuck-Hufnagel, S.; Prieto, P. (2020). ‘The MultiModal MultiDimensional (M3D) Labelling Scheme for the Annotation of Audiovisual Corpora’, In Proceedings of the 7th Gesture and Speech in Interaction (GESPIN), KTH Speech, Music & Hearing and Språkbanken Tal, Stockholm, Sweden, 7–9 September 2020.

Link: https://osf.io/ankdx/

System: Tiers in ELAN with a template that has been accessible to the public (see link). The prosodic dimension is analysed using Praat.

Description: The M3D scheme proposes the labelling of gestures and other communicative movements across three dimensions: the Form Dimension, the Semantic/Pragmatic Dimension, and the Prosodic Dimension.

The form dimension is coded using parent tiers that code the articulator and further tiers that code form, for example Handshape, Palm Orientation, Trajectory Direction and Trajectory Shape, coded with controlled vocabularies.

The pragmatic categories are: Referential, Operational, Modal, Performative, Discourse Marking, and Interactional.

Project/corpus: Nottingham Multi-Modal Corpus (NMMC)

Reference: Knight, D. (2011). Multimodality and active listenership: A corpus approach. Bloomsbury Publishing.

Link: (to purchase book) https://www.bloomsbury.com/us/multimodality-and-active-listenership-9781441167231/

Research focus: The pragmatic functions of signals of active listenership; both spoken backchannels and head nods.

System: The now discontinued Digital Replay System (DRS)

Description: Using a novel coding matrix with alphabetised types, nod backchannel forms are coded for intensity and duration and functions derived from O’Keeffe and Adolphs’ (2008) model of response token functional categories (Continuer, Convergence, Information Receipt and Engagement Response).

Project/corpus: VACE meeting corpus

Reference: L. Chen, R.T. Rose, Y. Qiao, I. Kimbara, F. Parrill, H. Welji, T.X. Han, J. Tu, Z. Huang, M. Harper, F. Quek, Y. Xiong, D. McNeill, R. Tuttle, T. Huang (2006). ‘Vace multimodal meeting corpus’, in: Proceedings of the International Workshop on Machine Learning for Multimodal Interaction, 40–51

Research focus: multimodal cues for understanding meetings focusing on the interaction among speech, gesture, posture, and gaze in meetings.

System: MacVissta multimodal analysis tool

Description: to facilitate the role of gaze in turn taking, gaze is coded for each speaker in terms of the object of that gaze (who or what gaze is directed at) for each moment.

Project/corpus: Bielefeld Speech and Gesture Alignment corpus (SaGA)

Reference: Lücking, A., Bergman, K., Hahn, F., Kopp, S. and Rieser, H. (2010), The Bielefeld speech and gesture alignment corpus (SaGA). In Proceedings of the LREC Workshop on Multimodal Corpora, Mediterranean Conference Centre, Malta, 18 May 2010, pp. 92–8.

Link: https://pub.uni-bielefeld.de/record/2001935

System: Tiers in ELAN

Research focus: The study of the alignment of speech and gesture.

Description:: A detailed hierarchical structure of codes that considers (in descending order in the hierarchy) sequence, phrase (e.g. deictic, beat, iconic), phase (e.g. preparation, stroke), practice (e.g. grasping, shaping, sizing) perspective (e.g. survey, speaker) and referent (e.g. object, location, action).

A full annotation manual is available via the link above.

Project/corpus: Corpus of Academic Spoken English (CASE)

Reference: Diemer, S., Brunner, M. L., & Schmidt, S. (2016). Compiling computer-mediated spoken language corpora: Key issues and recommendations. International Journal of Corpus Linguistics, 21(3), 348-371.

Research focus: To analyse informal academic communication on a computer-mediated platform (Skype)

System: Embedded annotations into transcript using curly brackets e.g. {shrugs}

Description: Concise descriptive language using present tense verbs to describe head gestures (including gaze), facial expressions, hands/body, physical stance and background movement and noises.

Project/corpus: The AMI meeting corpus

Reference: Carletta, J. (2006) ‘Announcing the AMI Meeting Corpus’, The ELRA Newsletter 11(1), January-March, p. 3-5.

Link: https://groups.inf.ed.ac.uk/ami/corpus/overview.shtml

Research focus: To produce research and technology that will help groups interact better.

System: Annotations created using NXT, with transcription and time-stamped labellings imported from ChannelTrans and EventEditor respectively.

Description: Annotations describe named entities (e.g. colours, shapes), dialogue acts (e.g. backchannels), topic segmentation (e.g. openings, agenda items), abstractive summaries giving general abstracts of meeting including any problems encountered, extractive summaries identifying parts of a meeting that support the contents of an abstractive summary, limited head gestures including nodding, limited hand gestures used in turn taking, movement around the room, face, mouth and hand location for development of tracking software and course gaze annotation (e.g. looking at whiteboard).

Project/corpus: A multi-scale investigation of the human communication system’s response to visual disruption

Reference: Trujillo, JP. Levinson, SC. Holler, J. (2022) A multi-scale investigation of the human communication system’s response to visual disruption.R. Soc. Open Sci.9: 211489. https://doi.org/10.1098/rsos.211489

Research focus: To test whether gesture and speech are dynamically co-adapted to meet communicative needs.

System: SPUDNIG application to process videos in OpenPose, a video-based motion tracking algorithm. The was subsequently checked and annotated manually.

Link: (article and supplementary material) https://royalsocietypublishing.org/doi/full/10.1098/rsos.211489

Description: Following the automated process using the tools above, each gesture is annotated as being representational, abstract deictics, pragmatic, emblematic or interactive.

Project/corpus: NEUROpsychological GESture (NEUROGES)

Reference: Lausberg, H. (2019). The NEUROGES® Analysis System for Nonverbal Behavior and Gesture. The Complete Research Coding Manual including an Interactive Video Learning Tool and Coding Template. Peter Lang, Berlin.

Link: https://neuroges.neuroges-bast.info/

Research focus: The creation of an analysis system for nonverbal behaviour, focussing on hand movement and gestural behaviour.

System: The NEUROGES–ELAN system

Description: Algorithmic coding system combining kinetic and functional coding. Categories within algorithms are chosen for relationship to neuropsychological states or have neurobiological correlates. The system encodes within three modules. Module I is for kinetic gesture coding, Module II is for bimanual relationship coding and Module III is for functional gesture coding.

Project/corpus: Database of Speech and Gesture (DoSaGE)

Reference: Kong, A. P. H., Law, S. P., Kwan, C. C. Y., Lai, C., & Lam, V. (2015). A coding system with independent annotations of gesture forms and functions during verbal communication: Development of a Database of Speech and GEsture (DoSaGE). Journal of nonverbal behavior, 39(1), 93-111.

Research focus: To examine how speakers’ age and linguistic performance were related to the frequency of gesture employment using a large sample of normal speakers.

System: Using ELAN, three independent tiers were generated to annotate the (1) linguistic information of the transcript, (2) forms of gestures appeared, and (3) function for each gesture used.

Description: Forms were classified using six categories: (1) Iconic, (2) Metaphoric, (3) Deictic, (4) Emblems, (5) Beat and (6) Non-identifiable.
Eight functions classified as follows: (1) Providing additional information to message conveyed, (2) Enhancing the speech content, (3) Providing alternative means of communication, (4) Guiding and controlling the flow of speech, (5) Reinforcing the intonation or prosody of speech, (6) Assisting lexical retrieval, (7) Assisting sentence re-construction and (8) No specific function deduced.

Project/Corpus: REmote COL-laborative and Affective interactions (RECOLA)

Reference: Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG) (1-8). IEEE.

Research focus: To assess emotional arousal and valence in dyadic video conferences

Link: https://diuf.unifr.ch/main/diva/recola/download.html

System: Original web-based tool for annotation called ANNEMO

Description: The two affective dimensions (arousal and valence) were annotated separately and time-continuously, using a slider with values ranging from -1 to +1 and a step of 0.01. The social dimensions were rated using a 7-Likert scales on five dimensions: agreement, dominance, engagement, performance and rapport.

Project/Corpus: Human-Computer Interaction Technologies (HuComTech)

Reference: Pápay, K., Szeghalmy, S. & Szekrényes, I. (2011): HuComTech Multimodal Corpus Annotation. Argumentum 7, 330–347.

Link: CLARIN blog post

Research focus: To investigate the nature and temporal alignment of verbal and nonverbal features of spontaneous speech as well as to compare the characteristics of formal and informal communication in Hungarian to improve human-machine communication applications (like chatbots) by empowering them with a comprehensive set of knowledge about human-human communicative behaviour.

System: Tiers in ELAN

Description: Annotations carried out based on either one mode (audio only or video only) or two modes (audio and video). The corpus also includes syntactic, prosodic and pragmatic annotation. The syntactic annotation was restricted to the identification and classification of clauses and sentences.
The pragmatic annotation was carried out on two separate levels, multimodal (based on both audio and video) and unimodal (based on video only). Multimodal pragmatic annotation codes communicative functions and speaker intentions.

References:

Blache, P., Ferré, G., and Rauzy, S. (2008). ‘An XML coding scheme for multimodal corpus annotation’, Corpus Linguistics, 1-17.

Kendon A. (1967). ‘Some functions of gaze-direction in social interaction’, Acta Psychologica, 26, 22–63.

Knight, D. (2011). Multimodality and active listenership: A corpus approach. Bloomsbury Publishing.

McNeill D. (1992). Hand and Mind. What Gestures Reveal about Thought, Chicago: The University of Chicago Press

O’Keeffe, A. and Adolphs, S. (2008), Using a corpus to look at variational pragmatics: response tokens in British and Irish discourse. In Schneider, K. P. and Barron, A. (eds) Variational Pragmatics. Amsterdam, Netherlands: John Benjamins, pp. 69–98.

Yang, H., Zhao, Y., Liu, J., Wu, Y., & Qin, B. (2022). MACSA: A Multimodal Aspect-Category Sentiment Analysis Dataset with Multimodal Fine-grained Aligned Annotations. arXiv preprint arXiv:2206.13969.