Use case in

Language Domain

Scientific Tutoring

Teaching is naturally multimodal, involving speech, text, video, screencasts, stylus inputs for formulas and diagrams, and background knowledge from textbooks. To function as personal tutors, MMFM must adapt to individual learning styles and preferences. Capabilities include:

Personalized learning support through simplified, summarized multimodal study materials in the student’s native language

Content rearrangement to organize lesson materials into structured notes matching user preferences

Lesson commentary using personalized speech with synced highlights in videos and adaptive explanation difficulty

Personalized learning support through simplified, summarized multimodal study materials in the student’s native language

Multimodal and Multilingual Accessibility in Social Media

MMFM enhance accessibility by translating and interpreting audio, video, and text to support users with visual, hearing, or cognitive impairments. Key capabilities include:

Key features

These features help ensure compliance with the European Accessibility Act (EAA)

Audiovisual description-augmented captioning and subtitling for deaf and hard-of-hearing users.

Sign language dialogue systems supporting both comprehension and generation of sign language content, attentive to prosody and meaning.

Speech-to-text simplification and summarization for various comprehension levels, including children, language learners, and cognitively impaired individuals-

Multimodal Joint Editing

Hybrid-modality documents, such as forms with handwriting or annotated reports, present challenges in editing. MMFM simplify this process through intuitive, natural interactions. Capabilities include:

01.

Multimodal command recognition to extract content and commands from handwritten input, such as strike-through for deletion or underlining for emphasis.

02.

Instruction-following editing through speech and gesture commands like “replace all instances of X with Y”.

03.

Ambiguous entity resolution to correctly interpret vague references in user input.

Real-Time, In-the-Wild Multilingual Speech and Video Translation

Accurate, real-time translation of spontaneous speech and video supports global interaction and knowledge sharing. MMFM capabilities include:

Key features

Simultaneous speech translation across 20 languages with minimal latency.

Visual speech recognition enhancement using lip movement data in noisy or unclear audio conditions.

Speaker diarization to identify “who spoke when” in multilingual conversations, even with language switching.