Hearing What Isn’t Said: Emotion Signals from Audio in Psychiatric Documentation

16 okt.

Much of what matters in a clinical conversation is heard, not written.

Today, Careifai Engine v1.0 transcribes audio and uses the text as part of the assessment material. But a consultation is more than words: tempo, pauses, and affect shape how psychologists and psychiatrists understand the patient. That signal is largely lost in text.

We’re now starting the work to extract the unsaid—building an audio-only pipeline that detects emotion-related information in clinical conversations. The transcript stays, but mainly as a carrier for timestamps and display. Inference happens in the audio domain.

What we’re building (MVP)

Signals: neutral tone, arousal/excitement, irritation/anger (incl. raised voice), and pauses/hesitation (silence duration).
Output: the transcript is enriched with time-aligned tags (e.g., [NEUTRAL], [ANGER], [PAUSE≥2s]) per utterance/segment, with timestamps and optional confidence.
Integration: tags appear directly in Summarizer and act as functional features in downstream assessment—not just metadata.

Technical direction (for those who build)

Early fusion of audio: We teach the model pitch, energy/volume, timbre (spectral shape), tempo, and pauses in the same timeline (not in isolation). This lets it capture patterns like “rising pitch + faster tempo + shorter pauses” that often carry clinical meaning.
Shared segment embeddings: The goal is stable representations so identical acoustic phenomena are recognised across speakers, microphones, and recording conditions.
Pre-train → fine-tune: First learn general audio patterns on loosely labelled SER tasks; then fine-tune for our concrete goals: segment classification (which emotion where) and temporal localisation (when it occurs).
Temporal alignment: The model reasons purely in audio, but every emotion tag gets an exact timestamp and is projected onto the right line in the transcript for readability, search, and navigation.

Why this matters clinically

Some risk and state markers are heard rather than spelled.
Sustained neutrality/monotony can be as informative as overt affect.
Pause lengths and shifts in arousal add context that standard ASR drops.
The result: richer source material, faster structuring, and more time with the patient.

Plan & principles

Start simple with a few robust signals → iterate with real use cases.
Audio-only inference to avoid “guessing emotions” from text.
Evaluation: clinical utility first (time saved, clarity in assessments); complemented by precision/recall, temporal accuracy, robustness (speakers/mics/noise), and inter-annotator agreement on a small gold set.
Privacy: only consented, de-identified data; designed for Swedish care settings but scalable to other languages.

Collaborate with us

If you have experience in speech emotion recognition (SER) in Swedish, robust VAD/segmentation, or self-supervised audio encoders for health data, we’d love to compare notes on feature design, cross-mic robustness, and metrics that actually correlate with clinical value.

—

Careifai builds tools that reduce admin and free clinical time. With the next step in our audio engine, we aim to make the unsaid visible—reliably, and useful in everyday care.

Markus Boman

Hearing What Isn’t Said: Emotion Signals from Audio in Psychiatric Documentation

De viktiga minuterna

CareIfAI inleder samarbete med 6G-AI Sweden