🎬 Sentiment Analysis of Movie Reviews

Sentiment analysis on the Large Movie Review Dataset (IMDB), comparing a wide range of traditional ML and neural models across two feature-engineering approaches.

Author: Mohammad Javad Maheronnaghsh
Context: Machine Learning course project
📄 Full write-up: project documentation.pdf

📋 Overview

The goal is to classify movie reviews by sentiment. The project walks through the full ML lifecycle — clean → vectorize → train → evaluate → compare — and deliberately contrasts two feature approaches so their effect on every model is visible:

Approach 1 — Basic tools: hand-built text cleaning + classic vectorization (bag-of-words / TF-IDF style).
Approach 2 — Ready embeddings: pretrained / learned word embeddings (Word2Vec, FastText, GloVe).

The same set of models is trained under both approaches, then the project closes with a small Kaggle-style competition including a model ensemble.

🔄 Pipeline

flowchart LR
    A[IMDB reviews<br/>50,000 labeled] --> B[Preprocessing<br/>lowercase · strip punctuation<br/>remove stopwords · tokenize]
    B --> C1[Approach 1<br/>basic vectors / TF-IDF]
    B --> C2[Approach 2<br/>Word2Vec · FastText · GloVe]
    C1 --> D[Train models]
    C2 --> D
    D --> E[Evaluate<br/>accuracy · precision · recall · F1]
    E --> F[Compare + Ensemble<br/>Kaggle competition]

📊 Dataset

50,000 IMDB movie reviews, balanced between positive and negative sentiment.
Split into train / validation / test for fitting, hyperparameter tuning, and final evaluation.

Preprocessing steps: lowercasing · punctuation removal · stop-word removal · tokenization · vectorization.

🧪 Models

The following models are implemented and evaluated under both feature approaches:

Family	Models
Linear	Logistic Regression · SVM
Tree-based	Decision Tree · AdaBoost (Decision Tree base)
Neural networks	Simple feed-forward NN · Deep NN (dropout + regularization)
Embedding model	FastText (supervised)

Neural networks use the Adam optimizer with cross-entropy loss; the FastText model is trained in supervised mode on the training sentences.

📈 Results

Test-set accuracy for each model, basic vectors vs. word embeddings:

Model	Approach 1 (basic)	Approach 2 (embeddings)
Logistic Regression	0.50	0.54
SVM	0.51	0.57
Decision Tree	0.36	0.58
AdaBoost	0.36	0.63
Simple Neural Network	0.50	0.65
Deep Neural Network	0.51	0.53

Takeaways

Word embeddings clearly beat basic vectors — every model improves when moving from Approach 1 to Approach 2.
The best result (~65% accuracy) comes from the embedding-based neural / boosted models.
Tree-based models gain the most from richer embedding features.
There is room to grow with more advanced architectures, larger pretrained embeddings, and stronger regularization.

Full classification reports (precision / recall / F1 per class) are printed inline in the notebook.

🗂️ Repository Structure

Sentiment-Analysis-Project/
├── Final Edition.ipynb          # main, cleaned end-to-end notebook
├── Source Codes/                # earlier iterations / working notebooks
├── project documentation.pdf    # full project report
├── LICENSE
└── README.md

🚀 Getting Started

This project is delivered as a Jupyter notebook, so no separate scripts are required.

# clone
git clone https://github.com/mjmaher987/Sentiment-Analysis-Project.git
cd Sentiment-Analysis-Project

# typical dependencies
pip install numpy pandas scikit-learn tensorflow fasttext gensim nltk

# then open the notebook
jupyter notebook "Final Edition.ipynb"

Run the cells top-to-bottom to reproduce preprocessing, training of both approaches, evaluation, and the final ensemble.

🛠️ Tech Stack

Language: Python (Jupyter)
Classic ML: scikit-learn (Logistic Regression, SVM, Decision Tree, AdaBoost)
Deep Learning: TensorFlow / Keras
Embeddings / NLP: FastText, Word2Vec, GloVe, NLTK

🤝 Contributing

Ideas and improvements are welcome — feel free to open an issue or PR with new architectures, embeddings, or evaluation ideas.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 Sentiment Analysis of Movie Reviews

📋 Overview

🔄 Pipeline

📊 Dataset

🧪 Models

📈 Results

🗂️ Repository Structure

🚀 Getting Started

🛠️ Tech Stack

🤝 Contributing

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Source Codes		Source Codes
Final Edition.ipynb		Final Edition.ipynb
LICENSE		LICENSE
README.md		README.md
project documentation.pdf		project documentation.pdf

Folders and files

Latest commit

History

Repository files navigation

🎬 Sentiment Analysis of Movie Reviews

📋 Overview

🔄 Pipeline

📊 Dataset

🧪 Models

📈 Results

🗂️ Repository Structure

🚀 Getting Started

🛠️ Tech Stack

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages