Abstract
Understanding the water quality of our rivers, streams, and lakes is critical for maintaining healthy ecosystems and sources of drinking water. Dissolved oxygen is one key water quality measurement that impacts aquatic life. But how can we predict how water quality might change in the future so we can intervene to keep our water healthy? With machine learning! In this project, you will gather water quality data for a location of your choice and use a random forest model to predict future dissolved oxygen levels.
Summary
Readily available
No issues
Objective
Collect water data from the USGS website for a river, stream, or lake to use on a random forest model and explore how well the model can predict dissolved oxygen one day, seven days, and thirty days into the future.
Introduction
Water quality is essential for the health of ecosystems, human communities, and economic activities. Freshwater sources have many important roles: they provide drinking water, support wildlife, offer recreational opportunities, and aid industrial processes. However, these water sources face challenges from pollution associated with agriculture, industry, and urban development.
Monitoring water quality helps identify and address pollutants, ensuring safe and sustainable water use. One key measure of water quality is dissolved oxygen (DO), which is the amount of oxygen dissolved in water. DO is essential for the survival of fish, invertebrates, and aerobic bacteria. Low dissolved oxygen levels can cause serious problems, such as algae blooms and “dead zones,” areas where aquatic life cannot survive. Do some research to learn more about algae blooms in your state or region.
Watch this video to learn more about dissolved oxygen and how it impacts the health of a body of water:
To improve water quality, we need proactive measures like controlling nutrient inputs, aerating water bodies, and using early warning systems. Predictive models can provide early warnings for potential water quality issues, allowing timely interventions. Research and education also play a crucial role in informing scientific research and raising public awareness. One way to predict future dissolved oxygen levels is through machine learning.
Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into decision trees, one type of machine learning algorithm. Decision trees mimic the human decision-making process by breaking down a problem into a series of sequential questions or decisions. Decision Trees often serve as powerful tools for classifying data and solving various problems.
Watch this video to learn more about decision trees. We recommend watching the video from 0:18 to 9:15:
However, decision trees have some drawbacks, like overfitting (being too specific to training data) and being sensitive to small changes in the data. To address these problems, we can use random forests, which combine several decision trees to enhance prediction accuracy. Random forests help reduce overfitting and improve model stability, making them a great option for predicting dissolved oxygen levels.
Watch this video to learn more about random forests:
In this project, your task is to use the USGS website to gather data for a water body of your choice for our random forest model. We will use several water quality properties to predict dissolved oxygen levels, including:
- Temperature: affects chemical reactions and aquatic life
- Turbidity: measures water clarity; high levels can mean harmful particles are present
- pH: shows how acidic or alkaline the water is, which affects aquatic health
- Specific Conductance: measures dissolved ions, influencing the solubility of gases like oxygen
You can read more about each of these water quality properties on the USGS website. Once the data is collected, you will run it through the model to determine how accurate it is at predicting dissolved oxygen one day, seven days, and thirty days into the future and which features are most important in predicting future dissolved oxygen levels.
Terms and Concepts
- Water quality
- Dissolved oxygen (DO)
- Algae blooms
- Artificial Intelligence (AI)
- Machine learning
- Decision tree
- Overfitting
- Random forest
- Temperature
- Turbidity
- pH
- Specific conductance
- Mean Absolute Error (MAE)
- Mean Absolute Percentage Error (MAPE)
- R2 (R-squared) value
Questions
- How does monitoring water quality help maintain safe and sustainable water use?
- What is dissolved oxygen (DO), and why is it crucial for aquatic life?
- How can predictive models and early warning systems help in water quality management?
- What is a decision tree, and how does it mimic the human decision-making process?
- What are some drawbacks of decision trees?
- How do random forests address the problems associated with decision trees?
- How do temperature, turbidity, pH, and specific conductance affect water quality?
Bibliography
Decision Trees, Random Forests, and Dissolved Oxygen:
- StatQuest with Josh Starmer. (2021, April). Decision and Classification Trees, Clearly Explained!!!. YouTube. Retrieved June 27, 2024.
- StatQuest with Josh Starmer. (2018, February). StatQuest: Random Forests Part 1 - Building, Using and Evaluating. YouTube. Retrieved June 27, 2024.
- MITK12Videos. (2014, April). Dissolved Oxygen. YouTube. Retrieved June 27, 2024.
USGS Data:
- USGS. (n.d.) USGS | National Water Dashboard. Retrieved June 27, 2024.
To learn more about why we split data into train and test:
- Turp, Misra. (2023, February). Why do we split data into train test and validation sets?. YouTube. Retrieved June 27, 2024.
To learn more about algal blooms:
- Dennis, Ellen. (2023, October). 'Unusually large' toxic algal bloom covers 30-mile stretch of the Lower Snake River. The Spokesman-Review. Retrieved June 27, 2024.
- Utah Department of Environmental Quality. (n.d.) Photo Examples of Harmful Algal Blooms (HABs) - Utah Department of Environmental Quality. Retrieved June 27, 2024.
Materials and Equipment
- Computer with Internet access
- (Optional) Microsoft Excel
Experimental Procedure

1. Gathering the Data
We will be using water quality data from USGS National Water Dashboard for this project.
- Navigate to the USGS National Water Dashboard website. Under USGS Stations, you will find a list of station types and what they measure (Streamflow, Surface-Water Levels, etc.). You will want to focus on the “Water Quality” stations. Click on the arrow next to each station type, and turn each of them off except the one called “Water Quality.”
- Use the drop-down under “Water Quality” to select “Dissolved oxygen.” Now, the map will show only sites that have “Dissolved oxygen” measurements.
- You will now see yellow and black dots and triangles populating the map of the United States. The black locations mean that recent measurements are unavailable, so it is preferable that you choose a yellow point instead. Click on a location that interests you. You should get a pop-up showing the name of the area that you clicked on as well as its water quality data. Scroll down and make sure that the location you have chosen has all of the properties that we will be using for this project (see the list below). After verifying that the site has all of these properties, click on ‘Site page’ at the top of the pop-up window; otherwise pick another location:
- Dissolved oxygen, water, unfiltered, milligrams per liter
- Temperature, water, degrees Celcius
- Turbidity, water, unfiltered, monochrome near infra-red LED light, 780-900 nm, detection angle 90 +-2.5 degrees, formazin nephelometric units (FNU)
- pH, water, unfiltered, field, standard units
- Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celcius
- Next, you will export the water quality data from the USGS website to a spreadsheet so it can be formatted for use in the random forest model. Choose your preferred option to create the dataset spreadsheet:
- Google Sheets: Navigate to Google Sheets. Click on ‘Blank spreadsheet’ and at the top, rename the file from ‘Untitled spreadsheet’ to ‘water_data.’
- Microsoft Excel: Open the Microsoft Excel app on your computer. This should automatically open a new spreadsheet called ‘Book1’. Rename the file by clicking on ‘File’->’Save as’ and enter ‘water_data’ where it says ‘Enter file name here.’ Save it in a location on your computer that is convenient for you.
- Set up your spreadsheet to get it ready for the data you will export from the USGS website. Create the following Sheets in your spreadsheet document by clicking the ‘+’ at the bottom of the spreadsheet and renaming them:
- Average Temperature
- Average Turbidity
- Average pH
- Average Specific Conductance
- Average Dissolved Oxygen
- Average Dissolved Oxygen +1 day
- Average Dissolved Oxygen +1 week
- Average Dissolved Oxygen +4 weeks
- Final
- Next, you will export the data from the USGS data page for each property listed in Step 5 to the corresponding tab on your spreadsheet.
- On the USGS data page (make sure you are on the Site page that you opened in Step 3 for your specific body of water), under ‘Select data to graph,’ choose ‘Temperature, water, degrees Celsius.’
- Click ‘Change time span’ and choose an approximately 3-month time span for your training data. Make sure that for the time span you choose, there is at least one month of data available after that time span. For example, if you choose a time span from 01/01/2024 to 03/31/2024, make sure that there is data going up to at least 05/01/2024. Click on the ‘Change time span’ button to save.
- Click on ‘Download data,’ select ‘Primary time series’ , and then click ‘Retrieve.’ The data should open in a new tab.
- Select all of the data by holding Ctrl + A on your keyboard (Cmd + A on a Mac). Once everything is selected, copy the data by holding Ctrl + C (Cmd + C on a Mac).
- Choose your preferred option to prepare the data in the spreadsheet:
- Google Sheets:
- Make sure you are on the ‘Average Temperature’ sheet.
- Click on the first cell (A1) and paste the data by holding Ctrl + V on your keyboard (Cmd + V on a Mac).
- Click on the text wrapping icon on the toolbar and select the ‘Overflow’ option. The data should now be formatted the way it was on the USGS website.
- Delete all of the header information by clicking on row 1, holding shift, and then clicking on row 28. Right-click, then select ‘Delete rows 1-28’.
- Click on the ‘A’ at the top of column A, then from the top menu, choose ‘Data’-> ’Split text to columns.’ The data should now be separated into columns.
- We only need the columns with the date (column D), the time (column E), and the water quality data (column G). Delete the unnecessary columns by right-clicking on the column name (e.g. A, B, …) and then selecting ‘Delete column.’
- Rename these columns in Row 1 as Date (column A), Time (column B), and Temperature (column C).
- Microsoft Excel:
- Make sure you are on the ‘Average Temperature’ sheet.
- Click on the first cell (A1) and paste the data by holding Ctrl + V on your keyboard (Cmd + V if you are using a MacBook).
- Delete all of the header information by clicking on row 1, holding shift, and then clicking on row 28. Right-click, then select ‘Delete’.
- Click on column A, then at the top toolbar select ‘Data’->’Text to Columns’. Select the ‘Delimited’ option then press ‘Next’ and select ‘Space’, then ‘Finish.’ The data should now be separated into columns.
- We only need the columns with the date (column D), the time (column E), and the water quality data (column G). Delete the unnecessary columns by right-clicking on the column name (e.g. A, B, …) and then selecting ‘Delete column.’
- Rename these columns in Row 1 as Date (column A), Time (column B), and Temperature (column C).
- Google Sheets:
- Most locations collect data multiple times each day. Your next step will be to calculate the average temperature for each day. Rather than doing this by hand, you can use the spreadsheet to do the calculations for you. Rename column D to say ‘Date’ and column E to say ‘Average __’ where __ is the property name that you are currently using.
- Enter the first date from your dataset under your new ‘Date’ column. To add the rest of the dates, you can click on the box with the first date, then click and drag the dot at the bottom right of that box. Drag it down until the last date of the time period you have chosen.
- In the next column (cell E2), type ‘=AVERAGEIF(D:D,D2,C:C)’ This formula calculates the average of the values in column C (the water quality property) where the corresponding date in column D matches the date in cell D2.
- Double-click the dot at the bottom right corner of cell E2 to copy the formula for the rest of the dates. This will automatically calculate the average for each day in your dataset.
- Copy the two columns that you have created (columns D and E) onto the sheet called ‘Final.’ You can do this by right-clicking on the column, selecting copy, then paste the column in the new sheet. Make sure to select ‘Paste special’->’Values only’ for the water quality property. (Note that after you copy the first spreadsheet, you will not have to copy the date column again).
- Repeat steps 6-9 until all of the water quality properties listed in step 5 have been added to the ‘Final’ sheet.
- For Average Turbidity, Average pH, Average Specific Conductance, and Average Dissolved Oxygen, you will use the exact same date range that you used for Average Temperature.
- For Average Dissolved Oxygen +1 Day, Average Dissolved Oxygen +1 Week, and Average Dissolved Oxygen +4 Weeks, you will need to adjust the dates. For example, if you initially chose a time frame from 01/01/2024 - 03/31/2024, your 1-day time frame will be 01/02/2024 - 04/01/2024, your 1-week time frame will be 01/08/2024 - 04/06/2024, and your 4-week time frame will be 02/01/2024 - 04/30/2024). Note that you will adjust the dates when you export these datasets, but you do not need to adjust the dates in the ‘Final’ sheet.
- On the ‘Final’ sheet, rename the properties in Row 1s to the names in the list below. This will be critical for the code to read your data, so you may want to copy and paste these directly:
- Average Temperature → Average Temperature C
- Average Turbidity → Average Turbidity FNU
- Average pH → Average pH (no change needed here)
- Average Specific Conductance → Average Specific Conductance μS/cm
- Average Dissolved Oxygen +1 day → Average Dissolved Oxygen +1 day mg/l
- Average Dissolved Oxygen +1 week → Average Dissolved Oxygen +1 week mg/l
- Average Dissolved Oxygen +4 weeks → Average Dissolved Oxygen +4 weeks mg/l
- Once you have finished creating your dataset, convert the ‘Final’ sheet into a CSV file.
- Google Sheets: Make sure you are on the ‘Final’ sheet. Select ‘File’->’Download’->’Comma separated values (.csv)’. Rename the file to ‘water_data.csv’ by right-clicking the file in your Downloads folder and selecting the ‘Rename’ option.
- Microsoft Excel: Make sure you are on the ‘Final’ sheet. Select ‘File’->’Save a Copy’. In the drop-down menu select ‘Comma delimited (*.csv)’. Rename the file to ‘water_data.csv’ by right-clicking the file in your Downloads folder and selecting the ‘Rename’ option.
2. Preparing the Workspace
- Within your Google Drive, click on ‘My Drive,’ then create a new folder and rename it “Water Quality.” Inside the folder, upload your ‘water_data.csv’ file.
- Download the ‘dissolved_oxygen.ipynb’ file from Science Buddies and upload it to the same folder. This is the code you will need to process your data.
- Double-click the dissolved_oxygen.ipynb file. This should automatically open Google Colab.
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
- Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.
3. Loading the Data into a Pandas DataFrame
- (Code Block 3A) Run this code block to make the files on your Google Drive available to use in the notebook.
- (Code Block 3B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block. If you get an error when running this block, check to make sure your file name matches what is outlined in the instructions above.
4. Preprocessing the Dataset
- Separating the Dataset into Inputs and Targets:
- (Code Block 4A) Run this code block to separate the dataset into two parts:
- ‘inputs’: Contains the water quality properties (e.g. temperature, turbidity, pH, etc.) that the model will use to try to predict future dissolved oxygen.
- ‘target’: Contains the future values of dissolved oxygen that the model is trying to predict (e.g., predictions for one day, one week, four weeks into the future).
- (Code Block 4B) Run this code block to convert the ‘inputs’ and ‘target’ data into NumPy arrays, which are efficient, multi-dimensional containers for numerical data in Python. This conversion makes it easier for the random forest model to process.
- (Code Block 4A) Run this code block to separate the dataset into two parts:
- Splitting the Training and Testing Data:
- (Code Block 4C) Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. We have provided the code to split the dataset into training and testing parts. Pay attention to how X and y look after this step, as well as the sizes of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of characteristics in each set. For example, if you see the X_train shape as (72, 5), that means there are 72 water samples, each with 5 properties (e.g., temperature, turbidity, pH, etc.).
Coding Tip:
Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.
5. Training the Model
- (Code Block 5A) We have provided the code to make a random forest regressor. Remember that a random forest regressor uses multiple decision trees to improve prediction accuracy and reduce the risk of overfitting by averaging their results. Run this code.
- (Code Block 5B) This code trains the regressor using the training data you gave it. Run this code (this is like pressing play to let the computer do its job and learn from the examples we’ve given it).
- (Code Block 5C) This code block prints the predictions made by our random forest model. The output is formatted as a 2D array with three columns, where each column represents predictions of dissolved oxygen levels for one day, one week, and four weeks in the future, respectively.
- (Code Block 5D) This code block prints the actual dissolved oxygen levels in the same format as Code Block 5C. Compare the outputs of this block with the previous block. How close are the values to each other? You will do a closer analysis in the following steps.
6. Evaluating the Model
- (Code Block 6A) This code block calculates the Mean Absolute Error (MAE) to assess the accuracy of the model’s predictions.
- The MAE shows how close predictions are to actual values by averaging the size of errors. A lower MAE means better accuracy in predictions.
- For example, an MAE of 0.13 indicates that, on average, the model's predictions differ from the actual values by 0.13 units.
- (Code Block 6B) This code block calculates the Mean Absolute Percentage Error (MAPE) to assess the accuracy of the model’s predictions.
- The MAPE measures the average absolute percentage error between predicted and actual values. It provides insight into the accuracy of the model in terms of relative error, expressed as a percentage. Lower MAPE values indicate better model performance.
- For example, a MAPE of 5% means that, on average, the predictions are off by 5% from the actual values. This metric is particularly useful when you want to understand the error in relative terms, making it easier to compare the performance across different datasets or models.
7. Visualize the Data
- (Code Block 7A) When you run this code block, three graphs will be created, showing the model’s predictions of dissolved oxygen levels for one day, one week, and four weeks in the future.
- The graph shows the actual dissolved oxygen level on the x-axis and the predicted dissolved oxygen level on the y-axis. So, if a point is at 8.5 on the x-axis and the y-axis, that point was classified correctly. If a point is at 8.5 and at 9.2 on the y-axis, it is classified incorrectly.
- The R2 (R-squared) value is a statistical measure that assesses how well the regression predicts the actual data points.
- R2 = 1: Perfect prediction. The model accurately predicts future dissolved oxygen levels with no errors.
- R2 = 0: Poor prediction. The model’s predictions are no better than simply using the average dissolved oxygen as a guess.
- 0 < R2 < 1: This range indicates how much of the variation in future dissolved oxygen levels the model can explain. For example, an R2 value of 0.75 means the model can predict 75% of the variation in dissolved oxygen levels beyond what would be predicted by just using the average value. In a perfect scenario, where the model is 100% accurate, the dots on the graph would create a straight diagonal line from the bottom left to the top right.
- How accurate were your model’s predictions? How close were your R2 values to 1? How did the predictions and R2 values change as your model looked further into the future?
- (Code Block 7B) When you run this code block, it will create a picture or visualization of one of the decision trees in our random forest model. The characteristics are arranged from the most important for determining the future dissolved oxygen level at the top to the least important for determining the future dissolved oxygen level at the bottom. At the top of each box in the decision tree, you will see a characteristic and a number, such as Average DO <= 9.4. This example means that if the dissolved oxygen level is less than 9.4, it will go to the left side of the tree; otherwise, it will go to the right side of the decision tree.
- There are two values you can change:
- On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 100 to view the different decision trees in the random forest.
- One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
- Explore some of these decision trees. What water quality properties were most important for predicting future dissolved oxygen? Why do you think this might be?
- There are two values you can change:
- (Code Block 7C) This code calculates and sorts feature importances from the random forest model, then creates a horizontal bar plot to visualize each feature’s importance. This helps identify which features most influence the model’s predictions.
- What were the top three features that were most and least important in predicting future dissolved oxygen levels? Why do you think this might be the case?
Ask an Expert
Global Goals
The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.
Variations
- Experiment with a single decision tree to assess whether a single decision tree can achieve performance comparable to that of a random forest. You can use the DecisionTreeRegressor from the sklearn library.
- Repeat the project by collecting data from multiple areas instead of one. Analyze if the model's accuracy decreases when exposed to more diverse and variable data.
- Make the model predict further into the future (e.g. 2 weeks, 3 weeks, 8 weeks, 12 weeks ahead).
- Add more water quality properties, such as discharge, chlorophyll fluorescence (fChl), nitrate, etc.
- See how removing certain water quality properties affects the random forest model's accuracy.
- Expand or shorten the time period of the training data. How much data is really necessary to accurately predict future water quality at different time steps?
- Adjust the model to predict a different water quality property (e.g. predict pH levels instead of dissolved oxygen).
Careers
If you like this project, you might enjoy exploring these related careers:














