Sentiment analysis on the Large Movie Review Dataset (IMDB), comparing a wide range of traditional ML and neural models across two feature-engineering approaches.
- Author: Mohammad Javad Maheronnaghsh
- Context: Machine Learning course project
- 📄 Full write-up:
project documentation.pdf
The goal is to classify movie reviews by sentiment. The project walks through the full ML lifecycle — clean → vectorize → train → evaluate → compare — and deliberately contrasts two feature approaches so their effect on every model is visible:
- Approach 1 — Basic tools: hand-built text cleaning + classic vectorization (bag-of-words / TF-IDF style).
- Approach 2 — Ready embeddings: pretrained / learned word embeddings (Word2Vec, FastText, GloVe).
The same set of models is trained under both approaches, then the project closes with a small Kaggle-style competition including a model ensemble.
flowchart LR
A[IMDB reviews<br/>50,000 labeled] --> B[Preprocessing<br/>lowercase · strip punctuation<br/>remove stopwords · tokenize]
B --> C1[Approach 1<br/>basic vectors / TF-IDF]
B --> C2[Approach 2<br/>Word2Vec · FastText · GloVe]
C1 --> D[Train models]
C2 --> D
D --> E[Evaluate<br/>accuracy · precision · recall · F1]
E --> F[Compare + Ensemble<br/>Kaggle competition]
- 50,000 IMDB movie reviews, balanced between positive and negative sentiment.
- Split into train / validation / test for fitting, hyperparameter tuning, and final evaluation.
Preprocessing steps: lowercasing · punctuation removal · stop-word removal · tokenization · vectorization.
The following models are implemented and evaluated under both feature approaches:
| Family | Models |
|---|---|
| Linear | Logistic Regression · SVM |
| Tree-based | Decision Tree · AdaBoost (Decision Tree base) |
| Neural networks | Simple feed-forward NN · Deep NN (dropout + regularization) |
| Embedding model | FastText (supervised) |
Neural networks use the Adam optimizer with cross-entropy loss; the FastText model is trained in supervised mode on the training sentences.
Test-set accuracy for each model, basic vectors vs. word embeddings:
| Model | Approach 1 (basic) | Approach 2 (embeddings) |
|---|---|---|
| Logistic Regression | 0.50 | 0.54 |
| SVM | 0.51 | 0.57 |
| Decision Tree | 0.36 | 0.58 |
| AdaBoost | 0.36 | 0.63 |
| Simple Neural Network | 0.50 | 0.65 |
| Deep Neural Network | 0.51 | 0.53 |
Takeaways
- Word embeddings clearly beat basic vectors — every model improves when moving from Approach 1 to Approach 2.
- The best result (~65% accuracy) comes from the embedding-based neural / boosted models.
- Tree-based models gain the most from richer embedding features.
- There is room to grow with more advanced architectures, larger pretrained embeddings, and stronger regularization.
Full classification reports (precision / recall / F1 per class) are printed inline in the notebook.
Sentiment-Analysis-Project/
├── Final Edition.ipynb # main, cleaned end-to-end notebook
├── Source Codes/ # earlier iterations / working notebooks
├── project documentation.pdf # full project report
├── LICENSE
└── README.md
This project is delivered as a Jupyter notebook, so no separate scripts are required.
# clone
git clone https://github.com/mjmaher987/Sentiment-Analysis-Project.git
cd Sentiment-Analysis-Project
# typical dependencies
pip install numpy pandas scikit-learn tensorflow fasttext gensim nltk
# then open the notebook
jupyter notebook "Final Edition.ipynb"Run the cells top-to-bottom to reproduce preprocessing, training of both approaches, evaluation, and the final ensemble.
- Language: Python (Jupyter)
- Classic ML: scikit-learn (Logistic Regression, SVM, Decision Tree, AdaBoost)
- Deep Learning: TensorFlow / Keras
- Embeddings / NLP: FastText, Word2Vec, GloVe, NLTK
Ideas and improvements are welcome — feel free to open an issue or PR with new architectures, embeddings, or evaluation ideas.