Assessment of non-native speakers
Children's Speech
Focus of Attention
User States
Speech Recognition


Multimodal classification of the focus of attention


In the German SmartWeb project, the user was interacting with the web via a Smartphone to get information on, for example, points of interest. To overcome the tedious use of devices such as push-to-talk, but still to be able to tell whether the user is addressing the system or talking to herself or to a third person, we developed a module that monitors speech and video in parallel. Our database has been recorded in a real-life setting, indoors as well as outdoors, with unfavourable acoustic and light conditions. With acoustic features, we classify up to 4 different types of addressing (talking to the system: On-Talk, reading from the display: Read Off- Talk, paraphrasing information presented on the screen: Paraphrasing Off-Talk, talking to a third person or to oneself: Spontaneous Off-Talk). With the camera of the Smartphone, we record the user's face and decide whether he is looking onto the phone or somewhere else. We use three different types of turn features based on classification scores of frame-based face detection and word-based analysis: 13 acoustic-prosodic features, 18 linguistic features, and 9 video features. The classification rate for acoustics only is up to 62 % for the four-class problem, and up to 77 % for the most important two-class problem "user is focussing on interaction with the system or not". For video only, it is 45 % and 71 %, respectively. By combining the two modalities, and using linguistic information in addition, classification performance for the two-class problem rises up to 85 %.

Investigated problem

Classification, whether the user of a smartphone communicates with the system by speech input or not.
The image shows the user from the perspective of a smartphone cam:

Classification of On-View/Off-View, ROT (read off-talk), POT (paraphrasing off-talk), SOT (spontaneous off-talk), and NOT (no off-talk = on-talk):


For automatic classification of the focus of attention we used Haar-Wavelets/Adaboost for On-View/Off-View detection and prosodic features for On-Talk/ROT/POT/SOT classification. Linguistic features (POS = part of speech, e.g. content word follows) are used in the third classification task. Fusion of the modalities was based on meta features describing each sub-system in low dimensional featre space:


The demonstrator allows step by step speech recording, recognition, and classification (left). In the right part, whe word based recognition results are shown together with On-Talk scores. In the middle the sentence based scores for On-Talk/Off-Talk, On-View/Off-View, and after fusion (On-Focus/Off-Focus) are shown.

