Classify Spam Emails with Natural Language Processing
Abstract
Have you ever wondered how your phone or email automatically detects and filters junk messages into spam? This feature helps ensure you focus on the messages that truly matter. In this project, you’ll build your own spam detector using natural language processing (NLP) techniques.
Summary
None
Readily available
No issues
Objective
Identify the most common words in spam emails and determine which are the most effective for detecting spam.
Introduction
Language is a powerful tool we use to communicate, but not all messages are welcome. Have you ever wondered how your email or phone automatically detects and filters spam messages? Some junk emails are obvious, with phrases like “Congratulations! You’ve won a prize!” while others are more subtle, attempting to mimic real messages to trick you into clicking on harmful links.
Cybersecurity experts and developers work to identify and block spam, but with millions of messages sent every day, manually sorting them is not practical. How can we efficiently detect spam on such a large scale?
This is where Natural Language Processing (NLP) comes in. Natural Language Processing is a branch of Artificial Intelligence (AI) that teaches computers to understand human language. Using NLP, computers can analyze large amounts of text and recognize patterns that distinguish spam from legitimate messages, helping to keep inboxes clean and secure.
Watch this video for a more detailed explanation of NLP:
In this project, you will identify the most effective tokens–individual units of text–for detecting spam. Then, you will analyze your findings and refine your spam detection strategy.
Terms and Concepts
- Natural Language Processing (NLP)
- Artificial Intelligence (AI)
- Token
- Tokenization
- Lemmatization
- Stopword
- Count vector
- Feature matrix
- Target array
- Precision
- Recall
- F1-score
- Support
- Accuracy
Questions
- Why is it impractical for humans to manually sort through all spam messages?
- How does Natural Language Processing (NLP) help in spam detection?
- Can you think of a time when you received a suspicious message or email? What made it seem like spam?
- Why do you think some spam messages are harder to detect than others?
Bibliography
The dataset we will be using can be found here:
- Naidu, Chandramouli. (2021, March). Spam Classification for Basic NLP. Kaggle. Retrieved February 18, 2025.
Original code based on this project:
- Hogg, Greg. (2022, February). NLP Tutorial in Python - Spam Classification. YouTube. Retrieved February 18, 2025.
To learn more about Natural Language Processing (NLP):
- IBM Technology. (2021, August). What is NLP (Natural Language Processing)?. YouTube. Retrieved February 18, 2025.
- Murel, Jacob Ph.D., Kavlakoglu, Eda (n.d.). What are stemming and lemmatization?. IBM. Retrieved February 18, 2025.
To learn more about why we split a dataset into train and test:
- Turp, Misra. (2023, February). Why do we split data into train test and validation sets?. YouTube. Retrieved February 18, 2025.
To learn more about reading a classification report:
- Kohli, Shivam. (2019, November). Understanding a Classification Report For Your Machine Learning Model. Medium. Retrieved February 18, 2025.
Materials and Equipment
- Computer with Internet access
Experimental Procedure

Overview
This project guides you through building a machine learning model to classify emails as spam or non-spam using natural language processing (NLP). Your main task is to define the most relevant tokens (specific words or phrases) that will help the model accurately identify spam messages. The model will only count and analyze the tokens you select for this purpose.
Setting Up the Project Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Navigate to Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “spam_filter.”
- Download the spam_classification.ipynb file from Science Buddies. This is the code you will need to process your data.
- Download the spam_data.csv file from Science Buddies. This is the file containing the spam and non-spam emails.
- Inside your “spam_filter” folder on Google Drive, upload both the spam_classification.ipynb file and the spam_data.csv file.
- Double-click on the spam_classification.ipynb file. This should automatically open in Google Colab.
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
- Run all the blocks under Importing Libraries to ensure you have access to all the functions we will use for this project.
1. Loading the Data into a Pandas DataFrame
- (Code Block 1A) Run this code block to make the files on your Google Drive available to use in the notebook.
- (Code Block 1B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.
2. Preprocessing
2.1 Tokenize a Test Message
Before preprocessing the entire dataset, we will explore fundamental NLP techniques, such as tokenization, lemmatization, and stopword removal.
- (Code Block 2A) This code block cleans a passage of text by removing punctuation and special characters, keeping only words. It uses NLTK’s (Natural Language ToolKit)
RegexpTokenizerwith\w+, which picks out letters, numbers, and underscores. When applied, it turns the text into a list of words, ignoring symbols. This process is called tokenization, which means breaking text into smaller parts (tokens) like words or phrases to make it easier to analyze. - (Code Block 2B) This code block converts all words to lowercase to keep them uniform. This helps treat words like “Hello” and “hello” as the same, making text processing easier. The final list of lowercase words is then displayed.
- (Code Block 2C) This code block reduces words to their base form using lemmatization, a process that converts words to their dictionary form based on meaning and context (e.g., "running" -> "run", "better"->"well"). This ensures linguistic accuracy and improves text consistency for NLP. The lemmatized words are then displayed. To learn more about lemmatization, click on this link here.
- (Code Block 2D) This code block loads a list of common English stopwords (e.g., “the”, “and”, “is”). Stopwords are frequent words that usually do not add much meaning to a sentence and are often removed in text processing to improve efficiency. The code retrieves these stopwords from the English language list and then displays them.
- (Code Block 2E) This code block creates a list of HTML-related words (like “html”, “div”, “table”, “href”) that may appear in text but are not useful for analysis. It then adds these words to the existing list of stopwords.
- HTML tags can appear in non-spam emails because many emails are formatted using HTML to include features like bold text, images, links, tables, and colors. Most modern email clients (like Gmail or Outlook) support HTML formatting to improve readability and design.
- These tags are normal and not necessarily spam-related, but they can still clutter text processing tasks, so we will remove them.
- (Code Block 2F) This code block removes unimportant words (like common stopwords and HTML) from the lemmatized tokens, making the text cleaner and more useful for text analysis.
- (Code Block 2G) Now that we covered the different text-cleaning steps, we will define a function that will perform all of the text-cleaning steps automatically. This function
message_to_token_list(s)takes a stringsand applies the following steps:- Tokenization – Splits the text into words.
- Lowercasing – Converts all words to lowercase for uniformity.
- Lemmatization – Converts words to their base form (e.g., “running” -> “run”)
- Stopword Removal –. Removes common words (e.g., “the”, “is”) and HTML-related terms.
2.2 Find the Most Common Tokens in Non-Spam Emails
- (Code Block 2H) This code block counts the unique words in non-spam emails. It goes through each message, breaks it into words, and tracks how often each word appears. In the end, it gives the total number of unique words in non-spam emails. Don't worry if this code block takes a bit to run, there are a lot of emails!
- (Code Block 2I) This code block sorts the tokens by word count, from highest to lowest. It then displays the sorted list of words and their counts.
- What are the most common words in non-spam emails?
- (Code Block 2J) This code block creates a bar chart to show the top 50 most frequent words in non-spam emails. It takes the most common words and their counts to make a chart with the words on the x-axis and their frequencies on the y-axis.
2.3 Find the Most Common Tokens in Spam Emails
- (Code Block 2K) This code block counts the unique words in spam emails. It goes through each message, breaks it into words, and tracks how often each word appears. In the end, it gives the total number of unique words in non-spam emails. Don't worry if this code block takes a while to run, there are a lot of emails!
- (Code Block 2L) This code block sorts the tokens by word count, from highest to lowest. It then displays the sorted list of words and their counts.
- What are the most common words in spam emails?
- (Code Block 2M) This code block creates a bar chart to show the top 50 most frequent words in spam emails. It takes the most common words and their counts to make a chart with the words on the x-axis and their frequencies on the y-axis.
2.4 Decide Which Tokens to Count
You will create a list of spam-related words (spam_tokens), focusing on terms like “free” or “urgent” to help the algorithm detect spam. Limiting the number of tokens helps keep the model efficient by avoiding unnecessary complexity, as tracking too many words can slow down the model and distract it from focusing on the tokens that truly matter.
- (Code Block 2N) In this code block, you will choose the tokens for the algorithm to track. Under the
#TODOcomment, there is an empty list calledspam_tokens. Enter the tokens you want to use, ensuring each token is surrounded by single or double quotes and separated by commas (e.g.,spam_tokens = [‘http’, ‘com’, ‘email’]). You can add as many or as few tokens as you would like.- Review the most common words from both non-spam and spam emails, and check for any overlap–words that appear frequently in both may not be useful for distinguishing spam.
- Spam emails often create urgency, pushing users to click on scam links. Do you notice words like “hurry” or “urgent”?
- Some spam emails also sound too good to be true, promising prizes or free gifts. Look for words like “free” or “win”.
- What other words might signal an email is spam?
- Remember that adding too many words can slow down the model and make it less effective. To keep the model efficient, focus on words that are truly relevant to spam detection.
- (Code Block 2O) This code block assigns a unique number to each token you added to
spam_tokensin the previous code block. It pairs each word with a number starting from 0. The result is a dictionary where each word has a unique number, making it easier to use in machine learning. - (Code Block 2P) This code block defines a function that turns a message into a count vector, showing how often spam-related words appear. It creates an array of 0s the same length as
spam_tokens, processes the message into tokens, and checks each token. If a token is inspam_tokens, it finds its index and increases the count. The result is a numeric representation of the message, useful for spam detection.
2.5 Split Dataset to Train and Test
We will split our dataset into train and test to avoid overfitting and ensure the model is tested on data it has never seen before. Click on this link to learn more about splitting our data into train and test.
- (Code Block 2Q) We have provided the code to split the dataset into training and testing parts. Run this code block.
- This code block will also print out the sizes of
train_dfandtest_df. If you see that thetrain_dfsize as (4636, 3), that means there are 4636 emails, each with 3 features (e.g., MESSAGE, CATEGORY, and FILE_NAME).
- This code block will also print out the sizes of
- (Code Block 2R) In this code block, you can adjust the index to any number between 0 and 4635 to view the count vector of a specific email in the dataset.
- If the array consists mostly of 1s and 0s, it is likely a non-spam email.
- However, if the array contains higher counts, it is probably a spam email, assuming you have selected tokens that are strongly associated with spam.
- (Code Block 2S) In this next code block, you will see the data for the index you selected in Code Block 2R. Check the value for CATEGORY to determine whether the email is spam or not. Remember that 0 represents non-spam and 1 represents spam.
- (Code Block 2T) This code block will display the email for the index you selected. You can view different emails by changing the index in Code Block 2R and then running Code Blocks 2R to 2T. Can you usually identify which emails are spam or not based on their count vector?
3. Convert Text Data into Numerical Data
This section focuses on transforming the preprocessed email dataset into a format suitable for machine learning.
- (Code Block 3A) This code block defines a function called
df_to_X_ywhich turns the DataFrame into two parts:X(word counts) andy(labels). It gets y from the ‘CATEGORY’ column, where 0 is non-spam and 1 is spam. For each email in ‘MESSAGE’, it creates a count vector. It then returns X and y for machine learning. - (Code Block 3B) This code block turns the training and testing DataFrames into feature matrices and target arrays using the
df_to_X_yfunction.- Feature matrix (X): A table where each row represents an email and each column shows how many times a specific word appears (count vector)
- Note: You can view the feature matrix by creating a new code block, typing in
X_trainorX_test, then running that code block.
- Note: You can view the feature matrix by creating a new code block, typing in
- Target array (y): A list of labels that indicate whether each email is spam (1) or non-spam (0).
- Note: You can view the target array by creating a new code block, typing in
y_trainory_test, and then running that code block.
- Note: You can view the target array by creating a new code block, typing in
X_trainandX_testhold the word count data, whiley_trainandy_testhold the labels. Finally, it prints the shapes of these arrays to check the data.
- Feature matrix (X): A table where each row represents an email and each column shows how many times a specific word appears (count vector)
- (Code Block 3C) This code block scales the training and testing data so all values are between 0 and 1. Scaling helps machine learning models work better when the data has different value ranges.
4. Train and Evaluate the Model
This section introduces two machine learning models for spam email classification: Logistic Regression and Random Forest.
- (Code Block 4A) This code block trains a logistic regression model to predict whether emails are spam or not. It first creates and trains the model using the train dataset. Next, it uses the trained model to make predictions on the test data. Finally, it prints out a classification report that shows performance metrics like precision, recall, F1-score, and support for both spam and non-spam categories, as well as overall accuracy.
- (Code Block 4B) This code block trains a random forest classifier to predict whether emails are spam or not. It first creates and trains the model using the train dataset. Next, it uses the trained model to make predictions on the test data. Finally, it prints out a classification report that shows performance metrics like precision, recall, F1-score, and support for both spam and non-spam categories.
- Compare the classification reports from both models. How well did each model perform? If needed, revisit Code Block 2N and adjust the selected tokens to improve performance by choosing more relevant keywords.
- Tip: To rerun all the code blocks in the notebook, go to the top menu and select Runtime -> Run all.
Ask an Expert
Variations
- Either make your own dataset or find one, and instead of binary classification (spam vs. non-spam), categorize emails into multiple classes such as spam, promotions, social, work, and personal.
- Instead of manual token selection, use techniques such as TF-IDK Vectorization, n-grams (bigrams, trigrams), or Word2Vec or FastText embeddings.
Careers
If you like this project, you might enjoy exploring these related careers:







