Jump to main content

Predict Air Quality with Machine Learning

1
2
3
4
5
37 reviews

Abstract

Air pollution is a growing issue, especially in cities, due to the rise of industrial activities, increased fossil fuel consumption from things like car use, and natural events like wildfires. This pollution is measured by the Air Quality Index (AQI), which shows how clean or polluted the air is. But what if we could predict changes in AQI to help communities plan ahead to protect people who are at risk from air pollution? That’s where machine learning comes in. In this project, you will gather air quality data for a location of your choice and use a type of machine learning model called a Long Short-Term Memory (LSTM) model to forecast future AQI levels.

Summary

Areas of Science
Difficulty
Method
Time Required
Short (2-5 days)
Prerequisites

None

Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Collect air quality data from the EPA website for a city or county of your choice and explore how well an LSTM machine learning model can predict AQI one week, four weeks, and one year into the future.

Introduction

Air quality is an increasingly pressing global health concern. According to the World Health Organization (WHO), 99% of the world's population breathes air that exceeds WHO safety limits for pollutants. The main harmful pollutants are ozone (O3), which can trigger asthma and other respiratory issues; particulate matter (PM2.5 and PM10), which consists of tiny particles that penetrate deep into the lungs and can worsen heart and lung diseases; carbon monoxide (CO), a gas that interferes with the body’s ability to transport oxygen; sulfur dioxide (SO2) which can cause respiratory problems and contribute to acid rain; and nitrogen dioxide (NO2), which can inflame airways and reduce lung function. Each of these pollutants contribute differently to air quality, and they are all factored into the Air Quality Index (AQI). 

An image comparing the size of PM2.5 and PM10 to human hair and sand. Image Credit: EPA / Public Domain

Figure 1. An image comparing the size of PM2.5 and PM10 to human hair and sand. 

Air quality is commonly measured using the Air Quality Index (AQI), which categorizes air conditions into six levels ranging from “good” to “severe.” These categories are essential for understanding the impact of various air pollutants and for issuing timely warnings when air quality deteriorates. Predicting AQI is crucial because it can help protect vulnerable populations, such as those with existing health conditions, by encouraging precautionary actions like limiting outdoor exposure or using air filtration systems. Early detection of declining air quality–whether from industrial pollution, deforestation, or wildfires–also allows authorities to intervene and reduce potential harm, and we can predict air quality with machine learning. 

The Air Quality Index includes AQI categories and colors, corresponding index values and cautionary statements for different levels of health concern.Image Credit: EPA / Public Domain

Figure 2. The Air Quality Index includes AQI categories and colors, corresponding index values and cautionary statements for different levels of health concern. 

Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks which offer a powerful way to predict air quality.

RNNs are a type of neural network specifically designed to handle sequential data, making them useful for tasks where the order of the data matters, such as time-series forecasting. What makes RNNs unique is that they have loops that allow information to persist, meaning they can retain context from previous inputs when processing new data. However, RNNs can struggle with long-term dependencies because they tend to “forget” older information as new data is processed. 

This is where LSTMs come in. LSTMs are a specialized type of RNN designed to overcome this limitation. They can selectively remember or forget information by using mechanisms called “gates” that control the flow of data. This ability makes LSTMs particularly well-suited for time-series data, where remembering information over long sequences is critical–such as predicting future air quality. 

Watch this video to learn more about LSTMs:

In this project, you will collect air quality data from a location of your choice and train three LSTM models to predict AQI one week, four weeks, and one year into the future, helping to anticipate and mitigate the impact of air pollution on health and the environment. 

Terms and Concepts

Questions

Bibliography

More on air pollution: 

We will download our data from the EPA website:

Why we normalize data:

Why we split data into train, validation, and test:

Learn more about RNNs: 

Learn more about LSTMs:

Materials and Equipment

Experimental Procedure

This project follows the Scientific Method. Review the steps before you begin.

Overview

In this project, you will gather air quality data from the EPA website for a location of your choice, preprocess the dataset, and train an LSTM model to predict future AQI values over three different time frames: one week, four weeks, and one year. By comparing the model’s performance across these intervals, you will evaluate whether it is more accurate for short-term or long-term predictions.

1. Gathering Our Data

  1. Navigate to the Air Quality Index Daily Values Report resource by the EPA. For the ‘Pollutant’ dropdown, select All AQI Pollutants. 

  2. Select a year for the data. Eventually, you will need to download data for five consecutive years (you will return to step 2 later for a different year). 

    1. Tip: Earlier years may have missing data, so it is best to choose more recent years. 

  3. Choose a city or county. You can pick your own, but keep in mind that not all locations have complete data.

  4. After selecting a location, click Generate Report. The data will be displayed in a table, with missing values represented as dots. If there are numerous missing values (e.g., more than 10 scattered across various columns or entire columns with no data), please select a different location that may have more complete data.

  5. Click Download CSV (spreadsheet) to save the data to your computer. Change the year and download again to repeat this for each of the five years (Steps 2-5). 

  6. Once you have the data for all five years, you can combine it using either Google Sheets or Microsoft Excel. First, we will add the data into one spreadsheet file. See the video at the beginning of this project for a walk-through of this process 

    1. Google Sheets

      1. Navigate to Google Sheets and create a new blank spreadsheet.

      2. For each CSV file, go to File > Import, choose Upload, and select the CSV file. Under Import location, choose Insert new sheet(s). You will need to import each year’s data into a new sheet. 

    2. Microsoft Excel: 

      1. Navigate to the Downloads folder on your computer. 

      2. Right-click on any of the downloaded CSV files and open them in Excel. Then, go to File > Save as and save the file as an .xlsx file. 

  7. Next, you will combine all of the spreadsheets into one ‘Master’ sheet. 

    1. Google Sheets: 

      1. Add a new sheet by clicking the ‘+’ at the bottom left. 

      2. Rename it ‘Master’ by right-clicking on the tab and selecting Rename.

      3. Copy the first row (column names) from any of the other sheets and paste it into cell A1 of the ‘Master’ sheet. 

      4. For each of sheets with the year’s data, delete the first row by right-clicking on row 1 and selecting ‘Delete row.’ Then, copy all of the data by holding Ctrl + A then Ctrl + C (Cmd + A and Cmd + C on a MacBook), then paste the data into the the first empty slot on the Master sheet by holding Ctrl + V. (Yes, you will have to do a bit of scrolling). 

      5. If needed, sort the data by right-clicking the first column and selecting Sort Sheet A-Z

    2. Microsoft Excel: 

      1. Select Data > Get Data > From File > From Text/CSV. One at a time, choose the other four notebooks. 

      2. Add a new sheet called “Master” by clicking on the ‘+’ icon on the bottom. Rename the sheet by right-clicking on the name and selecting ‘Rename.’ 

      3. For each of sheets with the year’s data, delete the first row by right-clicking on row 1 and selecting ‘Delete row.’ Then, copy all of the data by holding Ctrl + A then Ctrl + C (Cmd + A and Cmd + C on a MacBook), then paste the data into the the first empty slot on the Master sheet by holding Ctrl + V. (Yes, you will have to do a bit of scrolling). 

  8. When you are finished creating the ‘Master’ spreadsheet, download the ‘Master’ sheet as a .csv file. 

    1. Google Sheets: 

      1. You can download the data by selecting File > Download > Comma Separated Values (.csv). Rename the file to aqidaily_fiveyears.csv.

    2. Microsoft Excel: 

      1. You can download the data by selecting File > Save as and select the CSV (Comma delimited) (*.csv) option. Rename the file to aqidaily_fiveyears.csv. 

2. Setting Up the Project Environment

  1. You will need a Google account. If you do not have one, make one when prompted. 

  2. Download the air_quality.ipynb file from Science Buddies. This is the code you will need to process your data. 

  3. Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “Air Quality Prediction.” Inside the folder, upload both the air_quality.ipynb file and the aqidaily_fiveyears.csv file you created earlier.

  4. Double-click on the air_quality.ipynb file. This should automatically open in Google Colab. 

    1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.

    2. Run the blocks under Importing Libraries to ensure you have access to all the functions we will use for this project and the aqidaily_fiveyears.csv data you uploaded to your Drive. We are importing quite a few libraries for this project, so do not worry if this takes a bit to run. 

3. Loading the Data into a Pandas DataFrame

  1. (Code Block 3A) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.

4. Preprocessing

  1. (Code Block 4A) Run this block to convert the ‘Date’ column from text to a datetime format. This ensures the machine learning model recognizes the data as dates, allowing it to correctly process time-based patterns instead of treating the values as plain text. 

  2. (Code Block 4B) Run this block to display the Air Quality Index (AQI) data for the five years you selected. This will give you a complete view of the dataset you have downloaded. 

  3. (Code Block 4C) Next, we will drop features that we think will be uninformative for modeling. In this case, we will be dropping ‘Site Name (of Overall AQI)’, ‘Site ID (of Overall AQI)’, ‘Source (of Overall AQI)’, and ‘Main Pollutant.’ We only want to use the pollutants to predict the AQI, so information related to the location is not important. 

  4. (Code Block 4D) In this code block, we are filling in the NULL, or blank, values (if any) with forward fill. Forward fill is commonly used in time series data (like predicting future AQI) by filling in NULL values with the last known observation. This technique makes sense in many time series contexts because often, the most recent observation is the best estimate of the next value until a new observation is recorded. Run this code block. 

  5. (Code Block 4E) In LSTM, the algorithm relies on scaling data with different values so that it is easier for the LSTM to work with. Click on this link to learn more about why we normalize our data

    1. For instance, if you are trying to compare the ages of people and their salaries, the LSTM might focus too much on the salary (because the numbers are bigger) and ignore the age of the people (because the numbers are smaller compared to the salary numbers). 

    2. Normalization helps to put all the variables on the same level so the LSTM can learn from the data more easily. 

    3. As with the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing under the #TODO comment. We will be normalizing the variables ‘CO,’ ‘Ozone,’ ‘PM10,’ ‘PM25,’ and ‘NO2.’

  6. (Code Block 4F) This code block defines a helper function that formats the data into sequences that the LSTM can process. Since LSTMs excel at learning sequential patterns, breaking the data into windows of time helps the model learn how each time step in a sequence influences the next one. This technique makes the LSTM capable of predicting future values based on past information. Run this code block.

  7. (Code Block 4G) In this code block, we define a variable called WINDOW_SIZE that determines how many consecutive time steps are grouped together into a “window” or sequence. This will be the number of time steps, in other words, how many days, the model will use to predict the next value (e.g., WINDOW_SIZE = 5 means the model will predict the 6th day AQI based on the previous 5 days). Here is where you will be changing the WINDOW_SIZE variable:

    1. Increasing the WINDOW_SIZE means you are using more past time steps in each sequence to predict the next value. This can give the LSTM more context to learn from (capturing longer-term dependencies), but it also increases the complexity of the model because it has more input to process (it will also be slower to train). 

    2. Decreasing the WINDOW_SIZE means each sequence uses fewer time steps to make predictions. This simplifies the input for the model but might not capture enough of the pattern, especially if the data has longer-term dependencies.

5. Split to Train, Validation, and Test

We will split our dataset into train, validation, and test to avoid overfitting and ensure the model is tested on data it has never seen before. Click on this link to learn more about splitting out data into train, validation, and test

  1. (Code Block 5A) We have provided the code to split the dataset into training, validation, and testing parts. This block will also normalize our data. In LSTM, the algorithm relies on scaling data with different values so that it is easier for the LSTM to work with. Click on this link to learn more about why we normalize our data. Run this code block.

    1. For instance, if you are trying to compare the ages of people and their salaries, the LSTM might focus too much on the salary (because the numbers are bigger) and ignore the age of the people (because the numbers are smaller compared to the salary numbers).
    2. Normalization helps to put all the variables on the same level so the LSTM can learn from the data more easily. We will normalize the variables ‘CO,’ ‘Ozone,’ ‘PM10,’ ‘PM2.5,’ and ‘NO2.’
  2. (Code Block 5B) This code block prints out the sizes of X_train, y_train, X_val, y_val, X_test, and y_test. If you see the X_train size as (216, 5) that means there are 216 samples (days) in the training data, each with 5 features (e.g. CO, Ozone, PM10, etc.)

6. Training the Model

  1. (Code Block 6A) We have provided the code to create an LSTM model. Run this code block. 

  2. (Code Block 6B) We have provided the code to set up a model for training and saving. It uses a ModelCheckpoint to save the best version of the model during training to a file called ‘model.keras.’ Run this code block.

  3. (Code Block 6C) This code block trains the LSTM model using the training data we split earlier. Run this code (this is like pressing play to let the computer do its job and learn from the examples we have given it). Don’t worry if this takes a while to run! We gave it a lot of data. 

7. Evaluating the Model

  1. (Code Block 7A) This code block will reload the saved model. Run this code block.

  2. (Code Block 7B) This code block prints a table with the actual values of the predicted days on the first column and the values that the model predicted on the right column. At a glance, how close are the model’s predictions?

  3. (Code Block 7C) This code block displays a graph of the predicted values next to the actual values for AQI. Are the lines close to each other?

  4. (Code Block 7D) This code block calculates the Mean Absolute Error (MAE) to assess the accuracy of the model’s predictions.

    1. The MAE shows how close predictions are to actual values by averaging the size of errors. A lower MAE means better prediction accuracy.

    2. For example, an MAE of 13 indicates that, on average, the model’s predictions differ from the actual values by 13 units. That means that if the model predicts the AQI to be 100, the actual AQI may be anywhere between 87 and 113. 

  5. (Code Block 7E) This code block calculates the Mean Squared Error (MSE) to assess the accuracy of the model’s predictions. 

    1. The MSE measures the average squared difference between predicted and actual values. It provides insight into the accuracy of the model by penalizing larger errors more heavily, as the differences are squared. Lower MSE values indicate better model performance. 

    2. For example, an MSE of 325 means that, on average, the squared difference between the predicted and actual values is 325. This metric is particularly useful when you want to minimize large errors, as the squaring emphasizes their impact, making it ideal for regression tasks where larger deviations are more significant.

  6. (Code Block 7F) This code block calculates the R2 (R-squared) value, which is a statistical measure that assesses how well the regression predicts the actual data points.

    1. R2 = 1: Perfect prediction. The model accurately predicts future AQI levels with no errors. 

    2. R2 = 0: Poor prediction. The model’s predictions are no better than simply guessing.

    3. 0 < R2 < 1: This range indicates how much of the variation in future AQI levels the model can explain. For example, an R2 value of 0.75 means the model can predict 75% of the variation in AQI beyond what would be predicted by just using the average value. 

    4. How accurate were your model’s predictions? How close is your R2 value to 1? 

8. Record the Metrics and Retest for Different Timesteps in the Future

  1. Currently, the model is set to predict the Air Quality Index (AQI) seven days into the future. Begin by recording the current performance metrics–Mean Absolute Error (MAE), Mean Squared Error (MSE), and R2–in the table below. 

  2. Next, update the model to predict AQI for a longer period by modifying the WINDOW_SIZE parameter in Code Block 4F to 28, which will forecast four weeks ahead. Once updated, run all the code blocks again by selecting Runtime > Run all. 

  3. After the code execution completes, locate the updated metrics (MAE, MSE, and R2) in Code Blocks 7D, 7E, and 7F. Record these new values in the table. 

  4. Repeat steps 2-3 but change the WINDOW_SIZE parameter to 365 to predict AQI one year into the future.

  5. Finally, reflect on the results: Did the model maintain accuracy when predicting different time frames, or does it perform better with shorter-term predictions compared to longer-term ones?

Swipe left to see more
MAE MSE R2
1 week
4 weeks
1 year

Table 1. Comparison of MAE, MSE, and R2 metrics for LSTM models predicting AQI at 1 week, 4 weeks, and 1 year timeframes. 

icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Global Goals

The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.

This project explores topics key to Sustainable Cities and Communities: Make cities inclusive, safe, resilient and sustainable.
This project explores topics key to Climate Action: Take urgent action to combat climate change and its impacts.
This project explores topics key to Life on Land: Sustainably manage forests, combat desertification, halt and reverse land degradation, halt biodiversity loss.

Variations

  • Train the model on 10 years, 15 years, and 20 years' worth of data. Can the model improve with more data? If not, why do you think so?
  • Train two different models on two different areas. How well can the models predict air quality in a region with consistent levels year-round compared to a region impacted by annual wildfires?
  • Train the model on different time steps (such as a few years, two days, or anything you want). Compare the MAE, MSE, and R2 results with the original models you trained on one day, four weeks, and one year. 

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
How is climate change affecting Earth? What will the changes mean for society? If these are questions that peak your curiosity, then you might be interested in a job as a climate change analyst. Climate change analysts evaluate climate data and research to determine how shifts in the climate will affect natural resources, animals, and civilizations. They use this information to make suggestions about what individuals and governments can do to ensure a higher-quality life for everyone in the… Read more
Career Profile
Have you ever noticed that for people with asthma it can sometimes be especially hard to breathe in the middle of a busy city? One reason for this is the exhaust from vehicles. Cars, buses, and motorcycles add pollution to our air, which affects our health. But can pollution impact more than our health? Cutting down trees, or deforestation, can contribute to erosion, which carries off valuable topsoil. But can erosion alter more than the condition of the soil? How does an oil spill harm fish… Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Predict Air Quality with Machine Learning." Science Buddies, 16 Dec. 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p022/artificial-intelligence/air-quality. Accessed 22 June 2026.

APA Style

Ngo, T. (2025, December 16). Predict Air Quality with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p022/artificial-intelligence/air-quality


Last edit date: 2025-12-16
Top
Free science fair projects.