Skip to content

mjmaher987/Sentiment-Analysis-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎬 Sentiment Analysis of Movie Reviews

Sentiment analysis on the Large Movie Review Dataset (IMDB), comparing a wide range of traditional ML and neural models across two feature-engineering approaches.

  • Author: Mohammad Javad Maheronnaghsh
  • Context: Machine Learning course project
  • 📄 Full write-up: project documentation.pdf

📋 Overview

The goal is to classify movie reviews by sentiment. The project walks through the full ML lifecycle — clean → vectorize → train → evaluate → compare — and deliberately contrasts two feature approaches so their effect on every model is visible:

  1. Approach 1 — Basic tools: hand-built text cleaning + classic vectorization (bag-of-words / TF-IDF style).
  2. Approach 2 — Ready embeddings: pretrained / learned word embeddings (Word2Vec, FastText, GloVe).

The same set of models is trained under both approaches, then the project closes with a small Kaggle-style competition including a model ensemble.


🔄 Pipeline

flowchart LR
    A[IMDB reviews<br/>50,000 labeled] --> B[Preprocessing<br/>lowercase · strip punctuation<br/>remove stopwords · tokenize]
    B --> C1[Approach 1<br/>basic vectors / TF-IDF]
    B --> C2[Approach 2<br/>Word2Vec · FastText · GloVe]
    C1 --> D[Train models]
    C2 --> D
    D --> E[Evaluate<br/>accuracy · precision · recall · F1]
    E --> F[Compare + Ensemble<br/>Kaggle competition]
Loading

📊 Dataset

  • 50,000 IMDB movie reviews, balanced between positive and negative sentiment.
  • Split into train / validation / test for fitting, hyperparameter tuning, and final evaluation.

Preprocessing steps: lowercasing · punctuation removal · stop-word removal · tokenization · vectorization.


🧪 Models

The following models are implemented and evaluated under both feature approaches:

Family Models
Linear Logistic Regression · SVM
Tree-based Decision Tree · AdaBoost (Decision Tree base)
Neural networks Simple feed-forward NN · Deep NN (dropout + regularization)
Embedding model FastText (supervised)

Neural networks use the Adam optimizer with cross-entropy loss; the FastText model is trained in supervised mode on the training sentences.


📈 Results

Test-set accuracy for each model, basic vectors vs. word embeddings:

Model Approach 1 (basic) Approach 2 (embeddings)
Logistic Regression 0.50 0.54
SVM 0.51 0.57
Decision Tree 0.36 0.58
AdaBoost 0.36 0.63
Simple Neural Network 0.50 0.65
Deep Neural Network 0.51 0.53

Takeaways

  • Word embeddings clearly beat basic vectors — every model improves when moving from Approach 1 to Approach 2.
  • The best result (~65% accuracy) comes from the embedding-based neural / boosted models.
  • Tree-based models gain the most from richer embedding features.
  • There is room to grow with more advanced architectures, larger pretrained embeddings, and stronger regularization.

Full classification reports (precision / recall / F1 per class) are printed inline in the notebook.


🗂️ Repository Structure

Sentiment-Analysis-Project/
├── Final Edition.ipynb          # main, cleaned end-to-end notebook
├── Source Codes/                # earlier iterations / working notebooks
├── project documentation.pdf    # full project report
├── LICENSE
└── README.md

🚀 Getting Started

This project is delivered as a Jupyter notebook, so no separate scripts are required.

# clone
git clone https://github.com/mjmaher987/Sentiment-Analysis-Project.git
cd Sentiment-Analysis-Project

# typical dependencies
pip install numpy pandas scikit-learn tensorflow fasttext gensim nltk

# then open the notebook
jupyter notebook "Final Edition.ipynb"

Run the cells top-to-bottom to reproduce preprocessing, training of both approaches, evaluation, and the final ensemble.


🛠️ Tech Stack

  • Language: Python (Jupyter)
  • Classic ML: scikit-learn (Logistic Regression, SVM, Decision Tree, AdaBoost)
  • Deep Learning: TensorFlow / Keras
  • Embeddings / NLP: FastText, Word2Vec, GloVe, NLTK

🤝 Contributing

Ideas and improvements are welcome — feel free to open an issue or PR with new architectures, embeddings, or evaluation ideas.

About

This is a project related to machine learning course

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors