April 10, 2026ai5 min read

Multimodal Emotion Recognition: What Actually Worked (and What Didn’t)

I built a multimodal emotion recognition system using audio and text, expecting combined models to outperform single modalities. In practice, simpler models and selective feature extraction performed more reliably than complex integrations.

#Ml Systems #Multimodal #Emotion Recognition

Context

The goal was to build a system that can detect human emotions using both voice and text. The assumption was simple: combining modalities should improve accuracy, since tone and semantics complement each other.

The system was designed around a merged multimodal dataset, where each sample includes both audio and its corresponding text.

Approach

For audio processing, I extracted features using openSMILE (eGeMAPS) and evaluated CNN and LSTM models.

For text, I used TF-IDF vectorization and tested multiple approaches including SVM, VADER, and RoBERTa-based models.

The final system combined audio and text features into a single pipeline and trained a unified model.

What Actually Worked

1. CNN > LSTM for audio

Although LSTM is theoretically better for sequential data, CNN performed better in practice.

Faster training
Better convergence
More stable results on spectrogram-based features

This was mainly due to how efficiently CNN handled spatial patterns in audio representations.

2. Simpler text models were more reliable

Transformer-based approaches and rule-based methods performed well individually, but combining them reduced overall accuracy.

A simple SVC model with TF-IDF:

Produced more stable results
Required less tuning
Generalized better under noisy data

3. Feature extraction mattered more than model complexity

Using Librosa initially gave strong training results but failed on test data.

Switching to openSMILE improved:

Generalization
Consistency between train/test

This showed that better features > more complex models.

What Didn’t Work

Multimodal ≠ better by default

Combining audio and text did not automatically improve performance.

In some cases:

Added noise
Reduced accuracy

The issue was not the idea, but the integration:

Feature misalignment
Different signal qualities
Model incompatibility

Key Takeaways

More complex models don’t guarantee better results
Multimodal systems require careful alignment, not just data merging
Feature quality has a bigger impact than model choice
Simpler pipelines are often more robust under real constraints

Final Thought

This project shifted my approach to ML systems.

Instead of chasing more data or more complex architectures, I now focus on:

Controlled experiments
Clear tradeoffs
Systems that remain stable under imperfect conditions

Notes

This project was developed as part of a team effort.

Collaborator:

Güneşsu Açık

Advisor:

Prof. Dr. Banu DİRİ

This article focuses on my approach to system design and the decisions made during development.