Jump to main content

Predicting Future Water Quality with Machine Learning

1
2
3
4
5
53 reviews

Abstract

Understanding the water quality of our rivers, streams, and lakes is critical for maintaining healthy ecosystems and sources of drinking water. Dissolved oxygen is one key water quality measurement that impacts aquatic life. But how can we predict how water quality might change in the future so we can intervene to keep our water healthy? With machine learning! In this project, you will gather water quality data for a location of your choice and use a random forest model to predict future dissolved oxygen levels. 

Summary

Areas of Science
Difficulty
Method
Time Required
Very Short (≤ 1 day)
Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Collect water data from the USGS website for a river, stream, or lake to use on a random forest model and explore how well the model can predict dissolved oxygen one day, seven days, and thirty days into the future.

Introduction

Water quality is essential for the health of ecosystems, human communities, and economic activities. Freshwater sources have many important roles: they provide drinking water, support wildlife, offer recreational opportunities, and aid industrial processes. However, these water sources face challenges from pollution associated with agriculture, industry, and urban development.

Monitoring water quality helps identify and address pollutants, ensuring safe and sustainable water use. One key measure of water quality is dissolved oxygen (DO), which is the amount of oxygen dissolved in water. DO is essential for the survival of fish, invertebrates, and aerobic bacteria. Low dissolved oxygen levels can cause serious problems, such as algae blooms and “dead zones,” areas where aquatic life cannot survive. Do some research to learn more about algae blooms in your state or region.

Watch this video to learn more about dissolved oxygen and how it impacts the health of a body of water:

To improve water quality, we need proactive measures like controlling nutrient inputs, aerating water bodies, and using early warning systems. Predictive models can provide early warnings for potential water quality issues, allowing timely interventions. Research and education also play a crucial role in informing scientific research and raising public awareness. One way to predict future dissolved oxygen levels is through machine learning.

Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into decision trees, one type of machine learning algorithm. Decision trees mimic the human decision-making process by breaking down a problem into a series of sequential questions or decisions. Decision Trees often serve as powerful tools for classifying data and solving various problems.

Watch this video to learn more about decision trees. We recommend watching the video from 0:18 to 9:15:

However, decision trees have some drawbacks, like overfitting (being too specific to training data) and being sensitive to small changes in the data. To address these problems, we can use random forests, which combine several decision trees to enhance prediction accuracy. Random forests help reduce overfitting and improve model stability, making them a great option for predicting dissolved oxygen levels.

Watch this video to learn more about random forests:

In this project, your task is to use the USGS website to gather data for a water body of your choice for our random forest model. We will use several water quality properties to predict dissolved oxygen levels, including:

You can read more about each of these water quality properties on the USGS website. Once the data is collected, you will run it through the model to determine how accurate it is at predicting dissolved oxygen one day, seven days, and thirty days into the future and which features are most important in predicting future dissolved oxygen levels.

Terms and Concepts

Questions

Bibliography

Decision Trees, Random Forests, and Dissolved Oxygen:

USGS Data:

To learn more about why we split data into train and test:

To learn more about algal blooms:

Materials and Equipment

Experimental Procedure

This project follows the Scientific Method. Review the steps before you begin.

1. Gathering the Data

We will be using water quality data from USGS National Water Dashboard for this project.

  1. Navigate to the USGS National Water Dashboard website. Under USGS Stations, you will find a list of station types and what they measure (Streamflow, Surface-Water Levels, etc.). You will want to focus on the “Water Quality” stations. Click on the arrow next to each station type, and turn each of them off except the one called “Water Quality.” 
  2. Use the drop-down under “Water Quality” to select “Dissolved oxygen.” Now, the map will show only sites that have “Dissolved oxygen” measurements.
  3. You will now see yellow and black dots and triangles populating the map of the United States. The black locations mean that recent measurements are unavailable, so it is preferable that you choose a yellow point instead. Click on a location that interests you. You should get a pop-up showing the name of the area that you clicked on as well as its water quality data. Scroll down and make sure that the location you have chosen has all of the properties that we will be using for this project (see the list below). After verifying that the site has all of these properties, click on ‘Site page’ at the top of the pop-up window; otherwise pick another location:
    1. Dissolved oxygen, water, unfiltered, milligrams per liter
    2. Temperature, water, degrees Celcius
    3. Turbidity, water, unfiltered, monochrome near infra-red LED light, 780-900 nm, detection angle 90 +-2.5 degrees, formazin nephelometric units (FNU)
    4. pH, water, unfiltered, field, standard units
    5. Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celcius
  4. Next, you will export the water quality data from the USGS website to a spreadsheet so it can be formatted for use in the random forest model. Choose your preferred option to create the dataset spreadsheet:
    1. Google Sheets: Navigate to Google Sheets. Click on ‘Blank spreadsheet’ and at the top, rename the file from ‘Untitled spreadsheet’ to ‘water_data.’
    2. Microsoft Excel: Open the Microsoft Excel app on your computer. This should automatically open a new spreadsheet called ‘Book1’. Rename the file by clicking on ‘File’->’Save as’ and enter ‘water_data’ where it says ‘Enter file name here.’ Save it in a location on your computer that is convenient for you.
  5. Set up your spreadsheet to get it ready for the data you will export from the USGS website. Create the following Sheets in your spreadsheet document by clicking the ‘+’ at the bottom of the spreadsheet and renaming them:
    1. Average Temperature
    2. Average Turbidity
    3. Average pH
    4. Average Specific Conductance
    5. Average Dissolved Oxygen
    6. Average Dissolved Oxygen +1 day
    7. Average Dissolved Oxygen +1 week
    8. Average Dissolved Oxygen +4 weeks
    9. Final
  6. Next, you will export the data from the USGS data page for each property listed in Step 5 to the corresponding tab on your spreadsheet.
    1. On the USGS data page (make sure you are on the Site page that you opened in Step 3 for your specific body of water), under ‘Select data to graph,’ choose ‘Temperature, water, degrees Celsius.’
    2. Click ‘Change time span’ and choose an approximately 3-month time span for your training data. Make sure that for the time span you choose, there is at least one month of data available after that time span. For example, if you choose a time span from 01/01/2024 to 03/31/2024, make sure that there is data going up to at least 05/01/2024. Click on the ‘Change time span’ button to save.
    3. Click on ‘Download data,’ select ‘Primary time series’ , and then click ‘Retrieve.’ The data should open in a new tab.
    4. Select all of the data by holding Ctrl + A on your keyboard (Cmd + A on a Mac). Once everything is selected, copy the data by holding Ctrl + C (Cmd + C on a Mac).
  7. Choose your preferred option to prepare the data in the spreadsheet:
    1. Google Sheets:
      1. Make sure you are on the ‘Average Temperature’ sheet.
      2. Click on the first cell (A1) and paste the data by holding Ctrl + V on your keyboard (Cmd + V on a Mac).
      3. Click on the text wrapping icon on the toolbar and select the ‘Overflow’ option. The data should now be formatted the way it was on the USGS website.
      4. Delete all of the header information by clicking on row 1, holding shift, and then clicking on row 28. Right-click, then select ‘Delete rows 1-28’.
      5. Click on the ‘A’ at the top of column A, then from the top menu, choose ‘Data’-> ’Split text to columns.’ The data should now be separated into columns.
      6. We only need the columns with the date (column D), the time (column E), and the water quality data (column G). Delete the unnecessary columns by right-clicking on the column name (e.g. A, B, …) and then selecting ‘Delete column.’
      7. Rename these columns in Row 1 as Date (column A), Time (column B), and Temperature (column C).
    2. Microsoft Excel:
      1. Make sure you are on the ‘Average Temperature’ sheet.
      2. Click on the first cell (A1) and paste the data by holding Ctrl + V on your keyboard (Cmd + V if you are using a MacBook).
      3. Delete all of the header information by clicking on row 1, holding shift, and then clicking on row 28. Right-click, then select ‘Delete’.
      4. Click on column A, then at the top toolbar select ‘Data’->’Text to Columns’. Select the ‘Delimited’ option then press ‘Next’ and select ‘Space’, then ‘Finish.’ The data should now be separated into columns.
      5. We only need the columns with the date (column D), the time (column E), and the water quality data (column G). Delete the unnecessary columns by right-clicking on the column name (e.g. A, B, …) and then selecting ‘Delete column.’
      6. Rename these columns in Row 1 as Date (column A), Time (column B), and Temperature (column C).
  8. Most locations collect data multiple times each day. Your next step will be to calculate the average temperature for each day. Rather than doing this by hand, you can use the spreadsheet to do the calculations for you. Rename column D to say ‘Date’ and column E to say ‘Average __’  where __ is the property name that you are currently using.
    1. Enter the first date from your dataset under your new ‘Date’ column. To add the rest of the dates, you can click on the box with the first date, then click and drag the dot at the bottom right of that box. Drag it down until the last date of the time period you have chosen.
    2. In the next column (cell E2), type ‘=AVERAGEIF(D:D,D2,C:C)’ This formula calculates the average of the values in column C (the water quality property) where the corresponding date in column D matches the date in cell D2.
    3. Double-click the dot at the bottom right corner of cell E2 to copy the formula for the rest of the dates. This will automatically calculate the average for each day in your dataset.
  9. Copy the two columns that you have created (columns D and E)  onto the sheet called ‘Final.’ You can do this by right-clicking on the column, selecting copy, then paste the column in the new sheet. Make sure to select ‘Paste special’->’Values only’ for the water quality property. (Note that after you copy the first spreadsheet, you will not have to copy the date column again).
  10. Repeat steps 6-9 until all of the water quality properties listed in step 5 have been added to the ‘Final’ sheet.
    1. For Average Turbidity, Average pH, Average Specific Conductance, and Average Dissolved Oxygen, you will use the exact same date range that you used for Average Temperature.
    2. For Average Dissolved Oxygen +1 Day, Average Dissolved Oxygen +1 Week, and Average Dissolved Oxygen +4 Weeks, you will need to adjust the dates. For example, if you initially chose a time frame from 01/01/2024 - 03/31/2024, your 1-day time frame will be 01/02/2024 - 04/01/2024, your 1-week time frame will be 01/08/2024 - 04/06/2024, and your 4-week time frame will be 02/01/2024 - 04/30/2024). Note that you will adjust the dates when you export these datasets, but you do not need to adjust the dates in the ‘Final’ sheet.
    3. On the ‘Final’ sheet, rename the properties in Row 1s to the names in the list below. This will be critical for the code to read your data, so you may want to copy and paste these directly:
      1. Average Temperature → Average Temperature C
      2. Average Turbidity → Average Turbidity FNU
      3. Average pH → Average pH (no change needed here)
      4. Average Specific Conductance → Average Specific Conductance μS/cm
      5. Average Dissolved Oxygen +1 day → Average Dissolved Oxygen +1 day mg/l
      6. Average Dissolved Oxygen +1 week → Average Dissolved Oxygen +1 week mg/l
      7. Average Dissolved Oxygen +4 weeks → Average Dissolved Oxygen +4 weeks mg/l
  11. Once you have finished creating your dataset, convert the ‘Final’ sheet into a CSV file.
    1. Google Sheets: Make sure you are on the ‘Final’ sheet. Select ‘File’->’Download’->’Comma separated values (.csv)’. Rename the file to ‘water_data.csv’ by right-clicking the file in your Downloads folder and selecting the ‘Rename’ option.
    2. Microsoft Excel: Make sure you are on the ‘Final’ sheet. Select ‘File’->’Save a Copy’. In the drop-down menu select ‘Comma delimited (*.csv)’. Rename the file to ‘water_data.csv’ by right-clicking the file in your Downloads folder and selecting the ‘Rename’ option.

2. Preparing the Workspace

  1. Within your Google Drive, click on ‘My Drive,’ then create a new folder and rename it “Water Quality.” Inside the folder, upload your ‘water_data.csv’ file.
  2. Download the ‘dissolved_oxygen.ipynb’ file from Science Buddies and upload it to the same folder. This is the code you will need to process your data.
  3. Double-click the dissolved_oxygen.ipynb file. This should automatically open Google Colab.
    1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
    2. Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.

3. Loading the Data into a Pandas DataFrame

  1. (Code Block 3A) Run this code block to make the files on your Google Drive available to use in the notebook.
  2. (Code Block 3B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block. If you get an error when running this block, check to make sure your file name matches what is outlined in the instructions above.

4. Preprocessing the Dataset

  1. Separating the Dataset into Inputs and Targets:
    1. (Code Block 4A) Run this code block to separate the dataset into two parts:
      1. ‘inputs’: Contains the water quality properties (e.g. temperature, turbidity, pH, etc.) that the model will use to try to predict future dissolved oxygen.
      2. ‘target’: Contains the future values of dissolved oxygen that the model is trying to predict (e.g., predictions for one day, one week, four weeks into the future).
    2. (Code Block 4B) Run this code block to convert the ‘inputs’ and ‘target’ data into NumPy arrays, which are efficient, multi-dimensional containers for numerical data in Python. This conversion makes it easier for the random forest model to process.
  2. Splitting the Training and Testing Data:
    1. (Code Block 4C) Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. We have provided the code to split the dataset into training and testing parts. Pay attention to how X and y look after this step, as well as the sizes of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of characteristics in each set. For example, if you see the X_train shape as (72, 5), that means there are 72 water samples, each with 5 properties (e.g., temperature, turbidity, pH, etc.).

Coding Tip:

Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.

5. Training the Model

  1. (Code Block 5A) We have provided the code to make a random forest regressor. Remember that a random forest regressor uses multiple decision trees to improve prediction accuracy and reduce the risk of overfitting by averaging their results. Run this code.
  2. (Code Block 5B) This code trains the regressor using the training data you gave it. Run this code (this is like pressing play to let the computer do its job and learn from the examples we’ve given it).
  3. (Code Block 5C) This code block prints the predictions made by our random forest model. The output is formatted as a 2D array with three columns, where each column represents predictions of dissolved oxygen levels for one day, one week, and four weeks in the future, respectively.
  4. (Code Block 5D) This code block prints the actual dissolved oxygen levels in the same format as Code Block 5C. Compare the outputs of this block with the previous block. How close are the values to each other? You will do a closer analysis in the following steps.

6. Evaluating the Model

  1. (Code Block 6A) This code block calculates the Mean Absolute Error (MAE) to assess the accuracy of the model’s predictions.
    1. The MAE shows how close predictions are to actual values by averaging the size of errors. A lower MAE means better accuracy in predictions.
    2. For example, an MAE of 0.13 indicates that, on average, the model's predictions differ from the actual values by 0.13 units.
  2. (Code Block 6B) This code block calculates the Mean Absolute Percentage Error (MAPE) to assess the accuracy of the model’s predictions.
    1. The MAPE measures the average absolute percentage error between predicted and actual values. It provides insight into the accuracy of the model in terms of relative error, expressed as a percentage. Lower MAPE values indicate better model performance.
    2. For example, a MAPE of 5% means that, on average, the predictions are off by 5% from the actual values. This metric is particularly useful when you want to understand the error in relative terms, making it easier to compare the performance across different datasets or models.

7. Visualize the Data

  1. (Code Block 7A) When you run this code block, three graphs will be created, showing the model’s predictions of dissolved oxygen levels for one day, one week, and four weeks in the future.
    1. The graph shows the actual dissolved oxygen level on the x-axis and the predicted dissolved oxygen level on the y-axis. So, if a point is at 8.5 on the x-axis and the y-axis, that point was classified correctly. If a point is at 8.5 and at 9.2 on the y-axis, it is classified incorrectly.
    2. The R2 (R-squared) value is a statistical measure that assesses how well the regression predicts the actual data points.
      1. R2 = 1: Perfect prediction. The model accurately predicts future dissolved oxygen levels with no errors.
      2. R2 = 0: Poor prediction. The model’s predictions are no better than simply using the average dissolved oxygen as a guess.
      3. 0 < R2 < 1: This range indicates how much of the variation in future dissolved oxygen levels the model can explain. For example, an R2 value of 0.75 means the model can predict 75% of the variation in dissolved oxygen levels beyond what would be predicted by just using the average value. In a perfect scenario, where the model is 100% accurate, the dots on the graph would create a straight diagonal line from the bottom left to the top right.
    3. How accurate were your model’s predictions? How close were your R2 values to 1? How did the predictions and R2 values change as your model looked further into the future?
  2. (Code Block 7B) When you run this code block, it will create a picture or visualization of one of the decision trees in our random forest model. The characteristics are arranged from the most important for determining the future dissolved oxygen level at the top to the least important for determining the future dissolved oxygen level at the bottom. At the top of each box in the decision tree, you will see a characteristic and a number, such as Average DO <= 9.4. This example means that if the dissolved oxygen level is less than 9.4, it will go to the left side of the tree; otherwise, it will go to the right side of the decision tree.
    1. There are two values you can change:
      1. On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 100 to view the different decision trees in the random forest.
      2. One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
      3. Explore some of these decision trees. What water quality properties were most important for predicting future dissolved oxygen? Why do you think this might be?
  3. (Code Block 7C) This code calculates and sorts feature importances from the random forest model, then creates a horizontal bar plot to visualize each feature’s importance. This helps identify which features most influence the model’s predictions.
    1. What were the top three features that were most and least important in predicting future dissolved oxygen levels? Why do you think this might be the case?
icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Global Goals

The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.

This project explores topics key to Clean Water and Sanitation: Ensure access to water and sanitation for all.
This project explores topics key to Climate Action: Take urgent action to combat climate change and its impacts.
This project explores topics key to Life Below Water: Conserve and sustainably use the oceans, seas and marine resources.

Variations

  • Experiment with a single decision tree to assess whether a single decision tree can achieve performance comparable to that of a random forest. You can use the DecisionTreeRegressor from the sklearn library.
  • Repeat the project by collecting data from multiple areas instead of one. Analyze if the model's accuracy decreases when exposed to more diverse and variable data. 
  • Make the model predict further into the future (e.g. 2 weeks, 3 weeks, 8 weeks, 12 weeks ahead).
  • Add more water quality properties, such as discharge, chlorophyll fluorescence (fChl), nitrate, etc.
  • See how removing certain water quality properties affects the random forest model's accuracy.
  • Expand or shorten the time period of the training data. How much data is really necessary to accurately predict future water quality at different time steps?
  • Adjust the model to predict a different water quality property (e.g. predict pH levels instead of dissolved oxygen). 

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
Soil and water are two of Earth's most important natural resources. Earth would not be able to sustain life without nutritive soil to grow food and clean water to drink. Soil and water conservationists foster the science and art of natural resource conservation. The scientists work to discover, develop, implement, and constantly improve ways to use land that sustains its productive capacity, and enhances the environment at the same time. Soil and water conservationists are involved in improving… Read more
Career Profile
When you think about a city that is a great place to live, what do you consider? Probably a community where the citizens are happy, healthy, and comfortable. Part of being all three is having a clean, safe, and constant water supply. Many of us take for granted that when we turn the faucet on we will be able to get a glass of water or that when we flush the toilet our waste will be carried away and treated somewhere. Well, that is what a water or wastewater engineer does. Their job is to design… Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Predicting Future Water Quality with Machine Learning." Science Buddies, 26 Dec. 2024, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p019/artificial-intelligence/dissolved_oxygen. Accessed 11 June 2026.

APA Style

Ngo, T. (2024, December 26). Predicting Future Water Quality with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p019/artificial-intelligence/dissolved_oxygen


Last edit date: 2024-12-26
Top
Free science fair projects.