April 10, 2026ai5 min read
Multimodal Emotion Recognition: What Actually Worked (and What Didn’t)
I built a multimodal emotion recognition system using audio and text, expecting combined models to outperform single modalities. In practice, simpler models and selective feature extraction performed more reliably than complex integrations.
Context
The goal was to build a system that can detect human emotions using both voice and text. The assumption was simple: combining modalities should improve accuracy, since tone and semantics complement each other.
The system was designed around a merged multimodal dataset, where each sample includes both audio and its corresponding text.
Approach
For audio processing, I extracted features using openSMILE (eGeMAPS) and evaluated CNN and LSTM models.
For text, I used TF-IDF vectorization and tested multiple approaches including SVM, VADER, and RoBERTa-based models.
The final system combined audio and text features into a single pipeline and trained a unified model.
What Actually Worked
1. CNN > LSTM for audio
Although LSTM is theoretically better for sequential data, CNN performed better in practice.
- Faster training
- Better convergence
- More stable results on spectrogram-based features
This was mainly due to how efficiently CNN handled spatial patterns in audio representations.
2. Simpler text models were more reliable
Transformer-based approaches and rule-based methods performed well individually, but combining them reduced overall accuracy.
A simple SVC model with TF-IDF:
- Produced more stable results
- Required less tuning
- Generalized better under noisy data
3. Feature extraction mattered more than model complexity
Using Librosa initially gave strong training results but failed on test data.
Switching to openSMILE improved:
- Generalization
- Consistency between train/test
This showed that better features > more complex models.
What Didn’t Work
Multimodal ≠ better by default
Combining audio and text did not automatically improve performance.
In some cases:
- Added noise
- Reduced accuracy
The issue was not the idea, but the integration:
- Feature misalignment
- Different signal qualities
- Model incompatibility
Key Takeaways
- More complex models don’t guarantee better results
- Multimodal systems require careful alignment, not just data merging
- Feature quality has a bigger impact than model choice
- Simpler pipelines are often more robust under real constraints
Final Thought
This project shifted my approach to ML systems.
Instead of chasing more data or more complex architectures, I now focus on:
- Controlled experiments
- Clear tradeoffs
- Systems that remain stable under imperfect conditions
Notes
This project was developed as part of a team effort.
Collaborator:
- Güneşsu Açık
Advisor:
- Prof. Dr. Banu DİRİ
This article focuses on my approach to system design and the decisions made during development.