

This project focuses on detecting spam messages using machine learning techniques to classify text as either spam or not spam. The dataset, sourced from Kaggle, contains labeled examples of both categories. Key steps in this project include preprocessing the text data, applying natural language processing (NLP) techniques, and developing a classification model. To enhance the model’s performance, hyperparameter tuning and evaluation were conducted, ensuring accurate and reliable predictions.

Exploratory data analysis (EDA) played a crucial role in understanding the distribution of spam and non-spam messages. Various visualization techniques were used to compare message frequencies, providing a clear overview of their distribution. Additionally, word clouds were generated to highlight the most frequently occurring words in both categories, offering valuable insights into the language patterns that distinguish spam from legitimate messages.
📋 Data Cleaning
The data was cleaned to ensure quality and consistency through the following steps:
Handling Null Values:
Text Standardization:
Converted text to lowercase.
Removed punctuation.
Eliminated stopwords.
Tokenization and Lemmatization:
Tokenized the text into meaningful units.
Applied lemmatization for better semantic representation.
Function Input at least 5 words
🧪 Model Selection and Optimization
Several machine learning algorithms were tested, and the Random Forest Classifier delivered the best performance. Key metrics included:
Accuracy: ~97%
Precision: ~100%
Precision was prioritized to minimize false positives, as misclassifying non-spam messages as spam can cause significant issues. The spam detection threshold was adjusted to 75% to further enhance precision.
🚀 Key Features
High precision and accuracy in spam detection.
Threshold-based classification to reduce false positives.
Robust preprocessing pipeline for textual data.