April 10, 2026ai10 min read

Building an Image-Based Clothing Recommendation System

An end-to-end image-based recommendation system that combines classification, feature extraction, and similarity ranking to generate meaningful clothing recommendations from user-uploaded images.

Introduction

This project focuses on building an image-based clothing recommendation system designed to help users find visually and semantically similar products from large-scale e-commerce datasets.

At a high level, the system takes an input image and returns similar clothing items. However, in practice, this problem is not just about similarity — it is about combining multiple components into a coherent system.


Problem

E-commerce platforms contain a massive number of clothing items. As product diversity increases, users struggle to find visually similar or relevant items efficiently.

A naive approach would be:

  • extract features from an image
  • compute similarity
  • return nearest neighbors

This approach fails in practice.

Similarity alone often produces:

  • visually similar but semantically incorrect items
  • matches across different categories
  • irrelevant recommendations due to shared textures or colors

The real problem is not just similarity — it is controlled similarity under constraints.


System Overview

The system is built as a multi-stage pipeline:

Image → Classification → Feature Extraction → Similarity → Filtering → Recommendations

Each stage solves a different failure mode.


Dataset

The dataset was constructed by scraping product images from e-commerce platforms.

  • ~54,000 images
  • 22 clothing categories
  • images include real-world noise (models, backgrounds, lighting)

Key challenges:

  • inconsistent labeling
  • multiple objects per image
  • background noise affecting feature extraction

Classification Layer (YOLOv8)

To prevent semantically incorrect matches, a classification step was introduced.

YOLOv8 was used to:

  • detect clothing category
  • assign class labels to images

Why classification first?

Without classification:

  • similarity search returns cross-category results
  • ranking becomes meaningless

With classification:

  • search space is constrained
  • results become more relevant

Feature Extraction (ResNet50)

ResNet50 was used to extract high-dimensional feature vectors from images.

Key choices:

  • pretrained weights (ImageNet)
  • transfer learning
  • feature extraction without retraining

The output is a dense vector representing visual features of the image.


Beyond CNN Features: Color Features

CNN features alone were not sufficient.

Observed issue:

  • visually similar structure ≠ similar clothing perception

To improve results:

  • color mean
  • color histogram

These were added as additional features.

Final feature vector:

[ResNet Features + Color Mean + Color Histogram]

This significantly improved recommendation quality.


Similarity Metrics

Different similarity algorithms were tested:

Euclidean Distance

  • unstable in high-dimensional space
  • sensitive to magnitude differences
  • rejected

Cosine Similarity

  • better than Euclidean
  • measures angle between vectors
  • still inconsistent in some cases

Pearson Correlation

  • measures linear relationship
  • more stable across combined features
  • selected as final approach

Final similarity:

similarity(A, B) = PearsonCorrelation(A, B)


Recommendation Logic

The recommendation process:

  1. User uploads an image
  2. YOLOv8 predicts category
  3. ResNet50 extracts features
  4. Color features are extracted
  5. Combined feature vector is created
  6. Similarity is computed against dataset
  7. Results are filtered by category
  8. Top-N items are returned

System Design Decisions

1. Classification before similarity

Improves semantic correctness

2. Hybrid feature representation

Combining deep features + handcrafted features improves ranking

3. Pearson over cosine

Better stability for this dataset

4. No database, file-based storage

Features stored as serialized vectors (pickle)


Performance

  • YOLOv8 Top-1 Accuracy: ~0.87
  • Dataset size: ~54k images
  • High recommendation relevance in controlled tests

Challenges

1. Noisy images

Images contain:

  • models
  • backgrounds
  • irrelevant objects

This affects both classification and feature extraction.


2. Feature alignment problem

Different feature types (CNN + color) behave differently.

Combining them required:

  • normalization
  • careful weighting

3. Similarity is not enough

Pure similarity systems fail without constraints.

This was the biggest realization.


Key Insight

The biggest takeaway from this project:

Recommendation systems are not model problems.
They are system design problems.


Conclusion

This project demonstrates that building a real-world recommendation system requires more than applying a single model.

It requires:

  • structured pipelines
  • multiple representations
  • careful decision-making across components

Even for a seemingly simple task like "find similar items", the system complexity emerges quickly.

The final system is not just a model — it is a combination of decisions.


Notes

This project was developed as part of a team effort.

Collaborator:

  • Güneşsu Açık

Advisor:

  • Prof. Dr. Banu DİRİ

This article focuses on my approach to system design and the decisions made during development.

Related writing