Jump to main content

Classify Spam Emails with Natural Language Processing

Abstract

Have you ever wondered how your phone or email automatically detects and filters junk messages into spam? This feature helps ensure you focus on the messages that truly matter. In this project, you’ll build your own spam detector using natural language processing (NLP) techniques.

Summary

Areas of Science
Difficulty
Method
Time Required
Short (2-5 days)
Prerequisites

None

Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Identify the most common words in spam emails and determine which are the most effective for detecting spam.

Introduction

Language is a powerful tool we use to communicate, but not all messages are welcome. Have you ever wondered how your email or phone automatically detects and filters spam messages? Some junk emails are obvious, with phrases like “Congratulations! You’ve won a prize!” while others are more subtle, attempting to mimic real messages to trick you into clicking on harmful links.

Cybersecurity experts and developers work to identify and block spam, but with millions of messages sent every day, manually sorting them is not practical. How can we efficiently detect spam on such a large scale?

This is where Natural Language Processing (NLP) comes in. Natural Language Processing is a branch of Artificial Intelligence (AI) that teaches computers to understand human language. Using NLP, computers can analyze large amounts of text and recognize patterns that distinguish spam from legitimate messages, helping to keep inboxes clean and secure.

Watch this video for a more detailed explanation of NLP:

In this project, you will identify the most effective tokens–individual units of text–for detecting spam. Then, you will analyze your findings and refine your spam detection strategy. 

Terms and Concepts

Questions

Bibliography

The dataset we will be using can be found here:

Original code based on this project:

To learn more about Natural Language Processing (NLP):

To learn more about why we split a dataset into train and test:

To learn more about reading a classification report:

Materials and Equipment

Experimental Procedure

This project follows the Engineering Design Process. Confirm with your teacher if this is acceptable for your project, and review the steps before you begin.

Overview

This project guides you through building a machine learning model to classify emails as spam or non-spam using natural language processing (NLP). Your main task is to define the most relevant tokens (specific words or phrases) that will help the model accurately identify spam messages. The model will only count and analyze the tokens you select for this purpose.

Setting Up the Project Environment

  1. You will need a Google account. If you do not have one, make one when prompted.
  2. Navigate to Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “spam_filter.”
  3. Download the spam_classification.ipynb file from Science Buddies. This is the code you will need to process your data.
  4. Download the spam_data.csv file from Science Buddies. This is the file containing the spam and non-spam emails.
  5. Inside your “spam_filter” folder on Google Drive, upload both the spam_classification.ipynb file and the spam_data.csv file.
  6. Double-click on the spam_classification.ipynb file. This should automatically open in Google Colab.
    1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
    2. Run all the blocks under Importing Libraries to ensure you have access to all the functions we will use for this project.

1. Loading the Data into a Pandas DataFrame

  1. (Code Block 1A) Run this code block to make the files on your Google Drive available to use in the notebook.
  2. (Code Block 1B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.

2. Preprocessing

2.1 Tokenize a Test Message

Before preprocessing the entire dataset, we will explore fundamental NLP techniques, such as tokenization, lemmatization, and stopword removal.

  1. (Code Block 2A) This code block cleans a passage of text by removing punctuation and special characters, keeping only words. It uses NLTK’s (Natural Language ToolKit) RegexpTokenizer with \w+, which picks out letters, numbers, and underscores. When applied, it turns the text into a list of words, ignoring symbols. This process is called tokenization, which means breaking text into smaller parts (tokens) like words or phrases to make it easier to analyze.
  2. (Code Block 2B) This code block converts all words to lowercase to keep them uniform. This helps treat words like “Hello” and “hello” as the same, making text processing easier. The final list of lowercase words is then displayed.
  3. (Code Block 2C) This code block reduces words to their base form using lemmatization, a process that converts words to their dictionary form based on meaning and context (e.g., "running" -> "run", "better"->"well"). This ensures linguistic accuracy and improves text consistency for NLP. The lemmatized words are then displayed. To learn more about lemmatization, click on this link here
  4. (Code Block 2D) This code block loads a list of common English stopwords (e.g., “the”, “and”, “is”). Stopwords are frequent words that usually do not add much meaning to a sentence and are often removed in text processing to improve efficiency. The code retrieves these stopwords from the English language list and then displays them.
  5. (Code Block 2E) This code block creates a list of HTML-related words (like “html”, “div”, “table”, “href”) that may appear in text but are not useful for analysis. It then adds these words to the existing list of stopwords.
    1. HTML tags can appear in non-spam emails because many emails are formatted using HTML to include features like bold text, images, links, tables, and colors. Most modern email clients (like Gmail or Outlook) support HTML formatting to improve readability and design.
    2. These tags are normal and not necessarily spam-related, but they can still clutter text processing tasks, so we will remove them.
  6. (Code Block 2F) This code block removes unimportant words (like common stopwords and HTML) from the lemmatized tokens, making the text cleaner and more useful for text analysis.
  7. (Code Block 2G) Now that we covered the different text-cleaning steps, we will define a function that will perform all of the text-cleaning steps automatically. This function message_to_token_list(s) takes a string s and applies the following steps:
    1. Tokenization – Splits the text into words.
    2. Lowercasing – Converts all words to lowercase for uniformity.
    3. Lemmatization – Converts words to their base form (e.g., “running” -> “run”)
    4. Stopword Removal –. Removes common words (e.g., “the”, “is”) and HTML-related terms.

2.2 Find the Most Common Tokens in Non-Spam Emails

  1. (Code Block 2H) This code block counts the unique words in non-spam emails. It goes through each message, breaks it into words, and tracks how often each word appears. In the end, it gives the total number of unique words in non-spam emails. Don't worry if this code block takes a bit to run, there are a lot of emails!
  2. (Code Block 2I) This code block sorts the tokens by word count, from highest to lowest. It then displays the sorted list of words and their counts.
    1. What are the most common words in non-spam emails?
  3. (Code Block 2J) This code block creates a bar chart to show the top 50 most frequent words in non-spam emails. It takes the most common words and their counts to make a chart with the words on the x-axis and their frequencies on the y-axis.

2.3 Find the Most Common Tokens in Spam Emails

  1. (Code Block 2K) This code block counts the unique words in spam emails. It goes through each message, breaks it into words, and tracks how often each word appears. In the end, it gives the total number of unique words in non-spam emails. Don't worry if this code block takes a while to run, there are a lot of emails!
  2. (Code Block 2L) This code block sorts the tokens by word count, from highest to lowest. It then displays the sorted list of words and their counts.
    1. What are the most common words in spam emails?
  3. (Code Block 2M) This code block creates a bar chart to show the top 50 most frequent words in spam emails. It takes the most common words and their counts to make a chart with the words on the x-axis and their frequencies on the y-axis.

2.4 Decide Which Tokens to Count

You will create a list of spam-related words (spam_tokens), focusing on terms like “free” or “urgent” to help the algorithm detect spam. Limiting the number of tokens helps keep the model efficient by avoiding unnecessary complexity, as tracking too many words can slow down the model and distract it from focusing on the tokens that truly matter.

  1. (Code Block 2N) In this code block, you will choose the tokens for the algorithm to track. Under the #TODO comment, there is an empty list called spam_tokens. Enter the tokens you want to use, ensuring each token is surrounded by single or double quotes and separated by commas (e.g., spam_tokens = [‘http’, ‘com’, ‘email’]). You can add as many or as few tokens as you would like.
    1. Review the most common words from both non-spam and spam emails, and check for any overlap–words that appear frequently in both may not be useful for distinguishing spam.
    2. Spam emails often create urgency, pushing users to click on scam links. Do you notice words like “hurry” or “urgent”?
    3. Some spam emails also sound too good to be true, promising prizes or free gifts. Look for words like “free” or “win”.
    4. What other words might signal an email is spam?
    5. Remember that adding too many words can slow down the model and make it less effective. To keep the model efficient, focus on words that are truly relevant to spam detection.
  2. (Code Block 2O) This code block assigns a unique number to each token you added to spam_tokens in the previous code block. It pairs each word with a number starting from 0. The result is a dictionary where each word has a unique number, making it easier to use in machine learning.
  3. (Code Block 2P) This code block defines a function that turns a message into a count vector, showing how often spam-related words appear. It creates an array of 0s the same length as spam_tokens, processes the message into tokens, and checks each token. If a token is in spam_tokens, it finds its index and increases the count. The result is a numeric representation of the message, useful for spam detection.

2.5 Split Dataset to Train and Test

We will split our dataset into train and test to avoid overfitting and ensure the model is tested on data it has never seen before. Click on this link to learn more about splitting our data into train and test.

  1. (Code Block 2Q) We have provided the code to split the dataset into training and testing parts. Run this code block.
    1. This code block will also print out the sizes of train_df and test_df. If you see that the train_df size as (4636, 3), that means there are 4636 emails, each with 3 features (e.g., MESSAGE, CATEGORY, and FILE_NAME).
  2. (Code Block 2R) In this code block, you can adjust the index to any number between 0 and 4635 to view the count vector of a specific email in the dataset.
    1. If the array consists mostly of 1s and 0s, it is likely a non-spam email.
    2. However, if the array contains higher counts, it is probably a spam email, assuming you have selected tokens that are strongly associated with spam.
  3. (Code Block 2S) In this next code block, you will see the data for the index you selected in Code Block 2R. Check the value for CATEGORY to determine whether the email is spam or not. Remember that 0 represents non-spam and 1 represents spam.
  4. (Code Block 2T) This code block will display the email for the index you selected. You can view different emails by changing the index in Code Block 2R and then running Code Blocks 2R to 2T. Can you usually identify which emails are spam or not based on their count vector?

3. Convert Text Data into Numerical Data

This section focuses on transforming the preprocessed email dataset into a format suitable for machine learning.

  1. (Code Block 3A) This code block defines a function called df_to_X_y which turns the DataFrame into two parts: X (word counts) and y (labels). It gets y from the ‘CATEGORY’ column, where 0 is non-spam and 1 is spam. For each email in ‘MESSAGE’, it creates a count vector. It then returns X and y for machine learning.
  2. (Code Block 3B) This code block turns the training and testing DataFrames into feature matrices and target arrays using the df_to_X_y function.
    1. Feature matrix (X): A table where each row represents an email and each column shows how many times a specific word appears (count vector)
      1. Note: You can view the feature matrix by creating a new code block, typing in X_train or X_test, then running that code block.
    2. Target array (y): A list of labels that indicate whether each email is spam (1) or non-spam (0).
      1. Note: You can view the target array by creating a new code block, typing in y_train or y_test, and then running that code block.
    3. X_train and X_test hold the word count data, while y_train and y_test hold the labels. Finally, it prints the shapes of these arrays to check the data.
  3. (Code Block 3C) This code block scales the training and testing data so all values are between 0 and 1. Scaling helps machine learning models work better when the data has different value ranges.

4. Train and Evaluate the Model

This section introduces two machine learning models for spam email classification: Logistic Regression and Random Forest.

  1. (Code Block 4A) This code block trains a logistic regression model to predict whether emails are spam or not. It first creates and trains the model using the train dataset. Next, it uses the trained model to make predictions on the test data. Finally, it prints out a classification report that shows performance metrics like precision, recall, F1-score, and support for both spam and non-spam categories, as well as overall accuracy.
    1. Tip: Check out this link to learn more about these metrics and how to read a classification report.
  2. (Code Block 4B) This code block trains a random forest classifier to predict whether emails are spam or not. It first creates and trains the model using the train dataset. Next, it uses the trained model to make predictions on the test data. Finally, it prints out a classification report that shows performance metrics like precision, recall, F1-score, and support for both spam and non-spam categories.
  3. Compare the classification reports from both models. How well did each model perform? If needed, revisit Code Block 2N and adjust the selected tokens to improve performance by choosing more relevant keywords.
    1. Tip: To rerun all the code blocks in the notebook, go to the top menu and select Runtime -> Run all.
icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Variations

  • Either make your own dataset or find one, and instead of binary classification (spam vs. non-spam), categorize emails into multiple classes such as spam, promotions, social, work, and personal. 
  • Instead of manual token selection, use techniques such as TF-IDK Vectorization, n-grams (bigrams, trigrams), or Word2Vec or FastText embeddings. 

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Classify Spam Emails with Natural Language Processing." Science Buddies, 7 July 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p025/artificial-intelligence/spam_classification. Accessed 14 June 2026.

APA Style

Ngo, T. (2025, July 7). Classify Spam Emails with Natural Language Processing. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p025/artificial-intelligence/spam_classification


Last edit date: 2025-07-07
Top
Free science fair projects.