Research: Current Projects

Latest Projects

ADAPTATIONS IN CONVERSATION: ENGAGING VOICES, FACES, BRAINS AND MACHINES

Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)

Research Team: Yue Wang (PI, Linguistics, SFU), Paul Tupper (Mathematics, SFU), Maggie Clarke (SFU ImageTech Lab), Dawn Behne (Psychology, Norwegian University of Science and Technology), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), and members of the SFU Language and Brain Lab.

We are routinely engaged in face-to-face conversations, voice-only phone chats, and increasingly, video-based online communication. We interact with people with different language backgrounds, and nowadays even with AI chatbots. Our experiences thus involve adjustments in our speech production and perception based on whom we communicate with and in what environments. In this research, we explore how conversation partners from different backgrounds (e.g., native-nonnative, human-AI) adjust speech for successful communication. Specifically, we collect audio-video recordings of live conversations involving interactive computer-game tasks that elicit words with specific sound contrasts and examine how interlocutors adapt their speech to resolve miscommunications as the conversation progresses. Speakers’ facial movements during target sound productions and acoustic correlates of the same productions are analyzed, along with how these differences are perceived. Neural data are also collected to study brain activities during the conversation. Finally, visual, acoustic, perceptual and neural data are brought together using computational modeling to develop prediction for which adaptive attributes improve the likelihood of accurate communication.

CREATING ADAPTIVE VOCAL INTERFACES IN HUMAN-AI INTERACTIONS

Funding: FASS Breaking Barriers Grant, SFU Faculty of Arts and Social Sciences

Research Team: Yue Wang (PI, Linguistics, SFU), Henny Yeung (Co-Investigator, Linguistics, SFU), Angelica Lim (Co-Investigator, Computer Science, SFU), and members of the Language and Brain Lab, Language and Development Lab, and Rosie Lab.

AI-powered vocal interfaces are rapidly increasing in prevalence. Subsequently, a pressing issue is that communication with these interfaces can break down, especially when speaking or listening is challenging (for language learners, children, speech-hearing impaired individuals, in noisy conditions, etc.). The goal of this research is to investigate how humans and vocal interfaces adapt their speech in the face of these misunderstandings in three experiments. Specifically, Study 1 asks how human speech production changes in response to misperceptions from AI-powered vocal interfaces. Study 2 creates an adaptive AI-powered vocal interface that better communicates when humans misunderstand. Study 3 brings this work outside SFU to the community, and examines naturalistic interactions between humans and social robots that implement the adaptive conversational platform developed in Study 1 and Study 2. Findings will improve the technology behind existing virtual assistants, fostering technological engagement in education and in other diverse, multilingual environments.

HYPER-ARTICULATION IN AUDITORY-VISUAL COMMUNICATION

Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)

Research Team: Yue Wang (PI, Linguistics, SFU), Allard Jongman (Linguistics, University of Kansas), Joan Sereno (Linguistics, University of Kansas), Dawn Behne (Psychology, Norwegian University of Science and Technology), Ghassan Harmaneh (Computer Science, SFU), Paul Tupper (Mathematics, SFU), and members of the SFU Language and Brain Lab

Human speech involves multiple styles of communication. In adverse listening environments or in challenging linguistic conditions, speakers often alter their speech productions using a clarified articulation style termed “hyperarticulation”, with the intention of improving listener intelligibility and comprehension. Questions thus arise as to what strategies speakers use to enhance their speech and whether they are effective in improving intelligibility and comprehension. This research examines hyperarticulation in words differing in voice and facial cues to identify which speech-enhancing cues are important to make words more distinctive. We examine (1) both acoustic voice properties and visual mouth configurations in hyperarticulated words, using innovative computerized sound and image analysis techniques; (2) the intelligibility of hyperarticulated words presenting speaker voice and/or speaker face to perceivers for word identification, and (3) the relationship of speaker-perceiver behavior based on computational and mathematical modeling to determine how speakers and perceivers cooperate to encode and decode hyperarticulated cues in order to achieve optimal communication.

VISUAL PROCESSING OF PROSODIC AND SEGMENTAL SPEECH CUES: AN EYE-TRACKING STUDY

Funding: Social Sciences and Humanities Research Council of Canada (SSHRC)

Research Team: Yue Wang (Co-PI, Linguistics, SFU), Henny Yeung (Co-PI, Linguistics, SFU), and members of the SFU Language and Brain Lab and Language and Development Lab.

Facial gestures carry important linguistic information and improve speech perception. Research including our own (Garg, Hamarneh, Jongman, Sereno, and Wang, 2019; Tang, Hannah, Jongman, Sereno, Hamarneh, and Wang, 2015) indicates that movements of the mouth help convey segmental information while eyebrow and head movements help convey prosodic and syllabic information. Perception studies using eye-tracking techniques have also shown that familiarity with a language influence looking time at different facial areas (Barenholtz, Mavica, and Lewkowicz, 2016; Lewkowicz and Hansen-Tift, 2012). However, it is not clear the extent to which attention to different facial areas (e.g., mouth vs. eyebrows) differ for prosodic and segmental information and as a function of language familiarity. Using eye-tracking, the present study investigates 3 questions. Firstly, we focus on differences in eye gazing patterns to see how different prosodic structures are processed in a familiar language versus a non-familiar language. Secondly, we focus on monolingual processing of segmental and prosodic information. Thirdly, we compare the results of segmental differences and prosodic differences in familiar versus non-familiar languages. Results of this research have significant implications in improving strategies for language learning and early intervention.

CONSUMING TEXT AND AUDIO FAKE NEWS IN A FIRST AND SECOND LANGUAGE

Funding: SFU FASS Kickstarter Grant

Research Team: Henny Yeung (PI), Maite Taboada (Co-PI), Yue Wang (Collaborator), Linguistics, SFU.

Interest in digital media—particularly in disinformation, or “fake news”—has surged. Almost all work on this topic, however, has looked only at native speakers’ consumption of English-language media. We ask here how fake news is consumed in one’s first language (L1) vs. in a second language (L2), since decision-making, moral judgments, and lie detection are all influenced by whether one uses a first or second language. Only a few prior studies have asked how we consume fake news in an L2, and results are mixed, limited to a single dimension of media evaluation (believability), and only explore text and not audio speech. Objective 1 of this study is to ask how text and acoustic signatures of truthfulness differ in written and audio news excerpts in English, French, and Mandarin. Objective 2 asks how L1-English vs. L1- L1-French and L1-Mandarin speakers may show distinct tendencies to believe in, change attitudes about, and engage with (true or fake) text and audio news clips in English. Results have profound implications for Canada, where L2 consumers of English-language media are numerous.

Recent Projects

AUTOMATED LIP-READING: EXTRACTING SPEECH FROM VIDEO OF A TALKING FACE

Funding: Next Big Question Fund, SFU's Big Data Initiative

Research Team: Yue Wang (PI, SFU Linguistics); Ghassan Harmaneh (SFU Computing Science); Paul Tupper (SFU Mathematics); Dawn Behne (Psychology, Norwegian University of Science and Technology, Norway); Allard Jongman and Joan Sereno (Linguistics, University of Kansas, USA).

Speaking face-to-face, voice and coordinated facial movements are simultaneously used to perceive speech. In noisy environments, seeing a speaker’s facial movements makes speech perception easier. Similarly, with multimedia, we rely on visual cues when the audio is not transmitted well (e.g., during video conferencing) or in noisy backgrounds. In the current era of social media, we increasingly encounter multimedia-induced challenges where the audio signal in the video is of poor quality or misaligned (e.g., via Skype). The next big question for speech scientists, and relevant for all multimedia users, is what speech information can be extracted from a face and whether the corresponding audio signal can be recreated from it to enhance speech intelligibility. This project tackles the issue by integrating machine-learning and linguistic approaches to develop an automatic face-reading system that identifies and extracts attributes of visual speech to reconstruct the acoustic information of a speaker's voice.

COMMUNICATING PITCH IN CLEAR SPEECH

Funding: Natural Sciences and Engineering Research Council of Canada (NSERC)

Research Team: Yue Wang (PI, SFU Linguistics); Allard Jongman, Joan Sereno, and Rustle Zeng (Linguistics, University of Kansas, USA); Paul Tupper (SFU Mathematics); Ghassan Harmaneh (SFU Computing Science); Keith Leung (SFU Linguistics); Saurabh Garg (SFU Language and Brain Lab, and Pacific Parkinson’s Research Centre, UBC).

This research investigates the role of clear speech in communicating pitch-related information: lexical tone. The objectives are to identify how speakers modify their tone production while still maintaining tone category distinctions, and how perceivers utilize tonal enhancement and categorical cues from different forms of input. These questions are addressed in a series of inter-related studies examining articulation, acoustics, intelligibility, and neuro-processing of clear-speech tones.

MULTI-LINGUAL AND MULTI-MODAL SPEECH PERCEPTION, PROCESSING, AND LEARNING

Funding: Social Sciences and Humanities Research Council of Canada (SSHRC) (2012-2017)

Research Team: Yue Wang (PI, Linguistics, SFU), Joan Sereno (Linguistics, University of Kansas), Allard Jongman (Linguistics, University of Kansas), Ghassan Harmaneh (Computer Science, SFU), and members of SFU Language and Brain Lab.

The temporal alignment of what we hear and see is fundamental for the cognitive organization of information from our environment. Research indicates that a perceiver´s experience influences sensitivity to audio-visual (AV) synchrony. We theorize that experience that enhances sensitivity to speech sound distinctions in the temporal domain would enhance sensitivity in AV synchrony perception. With this basis, a perceiver whose native language (L1) involves duration-based phonemic distinctions would be expected to be more sensitive to AV synchrony in speech than for an L1 which has less use of temporal cues. In the current study, simultaneity judgment data from participants differing in L1 experience with phonemic duration (e.g., English, Norwegian, Estonian) were collected using speech tokens with different degrees of AV alignments: from audio preceding the video (audio-lead)to the audio and video being physically aligned (synchronous) to video preceding the audio (video-lead). Findings of this research contribute to understanding the underpinnings of experience and AV synchrony perception.