Exploring the Impact of Social Media on Mental Health with Machine Learning

51 reviews

Abstract

Do you ever wonder if you spend too much time online? How can the amount of time spent on social media influence someone's mental health? What other factors play a role? In this science project, you will investigate which variables correlate with better mental health scores using a machine learning algorithm called the Random Forest algorithm.

Summary

Areas of Science

Artificial Intelligence
Human Biology & Health
Human Behavior

Difficulty

Method

Scientific Method

Time Required

Very Short (≤ 1 day)

Prerequisites

None

Material Availability

Readily available

Cost

Very Low (under $20)

Safety

No issues

Credits

Tracey Ngo, Science Buddies

Laura Ohl, PhD, Science Buddies Alumni

Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

https://www.youtube.com/watch?v=hlCc1Yoiv5c

Objective

In this science project, students will investigate which variables most impact mental health scores using a machine learning algorithm called Random Forest.

Introduction

Content Warning: This page includes topics about mental health, which may be upsetting to some people. If you are struggling with your mental health, remember you are not alone. There is always someone ready to listen, and it's normal and okay to reach out to someone to talk about your health.

If you or someone you know needs emotional support, please call the National Mental Health Hotline (866-903-3787) or 988 Lifeline (https://988lifeline.org/).

Have you ever noticed how a lack of sleep impacts your mental health? What about your use of social media platforms? How does that impact your mental health?

The U.S. Department of Health and Human Services released an advisory that shows the impact of social media on the mental health of children and adolescents. They mention that social media is a part of everyday life for most young people. Some reports even showed that over 90% of people aged 13-17 use social media. Could the average daily usage of social media or particular platforms be associated with worsened mental health? Multiple studies have shown that adolescents who spend over 3 hours a day on social media platforms have double the risk of mental health problems. These problems typically present as symptoms of depression and anxiety. What about how we use social media to build relationships, initiate conflict, or become addicted to using it? Until we figure out what about social media makes it less safe for adolescents, there have been recommendations by the Surgeon General to limit its use to improve overall mental health.

Interestingly, datasets have been collected on adolescents to better understand the relationship between social media use and overall mental health. The dataset we recommend for this project includes many variables (e.g., addiction score, age, academic performance, etc.) to analyze and understand which factors are associated with better mental health scores. What variable do you think will impact mental health scores the most?

In this science project, you will use a Random Forest Regressor model to investigate how these different variables are related to mental health scores. Random Forest is especially useful for this type of data because it can handle both numerical and categorical variables and capture complex nonlinear relationships and interactions between features. Unlike basic linear regression, which assumes a straight-line relationship between each input and the target, Random Forest can model more realistic, complex patterns in the data without those strict assumptions, making it a powerful tool for this analysis.

Watch this video to learn more about Random Forests:

https://www.youtube.com/watch?v=gkXX4h3qYm4

Before you get started, it's essential to understand the limitations of using these types of data sets and the machine learning algorithm we recommend for this project. Most data collected from human studies includes self-reporting. This common practice in clinical and human trials requires the participant, or their caretaker, to rely on their own memory and record keeping of their behaviors, in this case, the use of technology. While this may have some bias in the data collection stage, studies have shown that most people relay answers to these questions as accurately as possible. Furthermore, these types of studies are necessary, since long-term monitoring of daily behaviors is expensive and more time-consuming than self-reporting our behaviors to the best of our abilities.

When using data sets, particularly with Artificial Intelligence (AI) algorithms that use correlation analyses, it's important to remember that correlation doesn't mean causation. For example, just because you see a correlation between the amount of sleep and mental health scores, it doesn't necessarily mean that disturbed sleep causes mental health issues based on your data analysis. Instead, you need a randomized control trial to test this relationship and remove confounding variables. Mental health is a multi-dimensional biological process. For example, having a mental health condition could predispose someone to having worse sleep, as opposed to less sleep causing a mental health disorder. Therefore, it's essential to understand that we can't draw strong conclusions about what causes better or worse mental health just from this analysis alone. We would need additional human research studies to investigate this further and understand the relationship between these variables at a deeper level.

Terms and Concepts

Mental health
Social media
Mental health crisis
Depression
Anxiety
Addiction
Random Forest Regressor
Numerical variable
Categorical variable
Machine learning
Self-reporting
Artificial Intelligence (AI)
Correlation
Causation
Confounding variables
Label encoding
One-Hot Encoding
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
R² (R-squared) value

Questions

What is mental health?
What are social media platforms, and how many are there?
Why is there a mental health crisis in the US?
How does social media impact your everyday life?
How much do youth use social media, and how do you think that impacts their mental health?
How can machine learning and AI be used to better understand the mental health crisis?
Why isn’t correlation the same as causation?
What are the limitations of self-reporting? What are the advantages?
What is a confounding variable?

Bibliography

This is the dataset we will be using:

Adil Shamim. (2025, April). Students' Social Media Addiction. Kaggle. Retrieved June 25, 2025.

To learn more about the mental health crisis:

U.S. Health and Human Services. (2025, February 19). Social media and Youth Mental Health. Retrieved June 16, 2025.

To learn more about encoding:

Misra Turp (2023, February 10). Quick explanation: One-hot encoding. YouTube. Retrieved June 25, 2025.
StatQuest with Josh Starmer. (2023, February 12). One-Hot, Label, Target, and K-Fold Target Encoding, Clearly Explained!!!. YouTube. Retrieved June 25, 2025.

To learn more about the Random Forest algorithm:

IBM Technology. (2022, February 7). What is Random Forest?. YouTube. Retrieved June 25, 2025.
scikit-learn. (n.d.). RandomForestRegressor. Retrieved June 25, 2025.

Materials and Equipment

Computer with Internet access.

Experimental Procedure

Download PDF of Procedure

This project follows the

Scientific Method. Review the steps before you begin.

Overview

This project aims to explore how social media habits and other daily factors relate to students' mental health scores. Using a dataset with variables like daily social media use, sleep hours, addiction scores, and much more, you will train a Random Forest Regressor to predict mental health scores based on these factors. The goal is to identify which variables are most associated with better or worse mental health scores.

Participant Reflection Survey

Before starting the analysis, take a moment to answer these questions as if you were a participant in the study:

How old are you?
What is your gender?
What is your academic level? (High school, undergraduate, or graduate)
On average, how many hours do you spend on social media each day?
What is your most-used platform? (e.g., TikTok, YouTube, Instagram)
Do you think your social media use has affected your academic performance? (Yes or No)
On average, how many hours do you sleep each night?
How would you rate your mental health on a scale from 1 to 10? (1 = poor, 10 = excellent)
What is your relationship status? (Single, In a relationship, or Complicated)
How many relationship conflicts have you had due to social media?
How addicted do you feel you are to social media on a scale from 1 to 10? (1 = not at all addicted, 10 = extremely addicted)

Reflecting on these questions will help you better understand the variables in the dataset and think critically about how your own habits might relate to mental health.

Setting Up the Google Colab Environment

You will need a Google account. If you do not have one, make one when prompted.
Download the social_media_addiction.ipynb file from Science Buddies. This is the code you will need to process your data.
Download the social_media_addiction.csv file from Science Buddies.
1. Open the dataset to familiarize yourself with its structure by going to your Downloads and double-clicking on the file.
2. You will see that the very last column (column M) is called Mental_Health_Score, which we will try to predict. This column is a self-rated integer from 1 (poor) to 10 (excellent), indicating overall mental well-being, allowing assessment of potential associations with social media habits and other variables.
3. In columns A-L, you will see other variables such as Avg_Daily_Use_Hours and Sleep_Hours_Per_Night. Each of these factors will be described in more detail as you preprocess the data.
Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it social_media_addiction. Inside the folder, upload both the social_media_addiction.ipynb file and the social_media_addiction.csv file.
Double click on the social_media_addiction.ipynb file. This should automatically open in Google Colab.
1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
2. Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.

Loading the Data into a Pandas DataFrame

(Code Block 1A) Run this code block to create a DataFrame, like a table, that will be used to load and manipulate the data in the notebook. The data from your .csv file will populate a table below the code block.

Preprocessing

Dropping Columns

(Code Block 2A) First, we will drop features that will be uninformative for modeling. In this case, we will be dropping the Student_ID and Country features.
1. This is because Student_ID is a unique integer identifier assigned to each survey respondent. It does not carry any inherent predictive value and is simply an arbitrary identifier. Including it as a feature could potentially introduce noise (random or irrelevant information that can disrupt the meaningful patterns or signals in data).
2. Similarly, we will drop the Country column because some countries only had a few participants, making the data sparse and potentially introducing bias or misleading associations due to small sample sizes.
3. Run this code block to remove these uninformative features before continuing.

Label Encoding

When working with machine learning algorithms, it is important to convert categorical variables into numbers since algorithms understand numbers better than words. Categorical variables represent qualities like gender or group, and cannot be used because they lack a numerical form that algorithms can work with. Label encoding is a way to convert categorical variables into positive whole numbers, assigning each category a unique integer value (0, 1, 2, etc.).

(Code Block 2B) In this code block, we have provided the code to find the unique values (distinct types in the data) in the Gender column. You will use this same code again later to find unique values for other columns as well. You will see two genders in the output when you run this code (note: this particular dataset did not include additional or nonbinary gender options). You will assign these genders an integer value in the next code block.
(Code Block 2C) In this code block, you will map each Gender to a number. Under the #TODO comment, make sure to assign both Male and Female to a distinct number before moving on. It does not matter whether you map Male to 0 and Female to 1 or vice versa, as long as each gender has a unique numeric value.
1. Be sure to use quotes around each category name and separate them using commas.
2. If you get an error when you run the code block, double-check that you do not have any typos and have used commas and quotes correctly.
3. Here is an example of how the code should look.
  mapping = {"Male": 0, "Female": 1}
(Code Block 2D) In this empty code block, refer to Code Block 2B, and use the code provided there as a model to help you write the code to find the unique values of the Academic_Level column. Make sure that the column name is between either single or double quotes. Here is an example of how the code should look.
df['Academic_Level'].unique()
(Code block 2E) In this code block, you will map each Academic_Level to a number. Academic_Level is the highest level of education the respondent is currently enrolled in. Order these from lowest to highest.
(Code Block 2F) In this empty code block, refer to code block 2B, and use the code provided there as a model to help you write the code to find the unique values of the Affects_Academic_Performance column.
(Code Block 2G) In this code block, you will map each Affects_Academic_Performance value to a number.
1. Note: By common coding conventions, 0 typically means ‘No’ and 1 typically means ‘Yes.’

One-Hot Encoding

Sometimes, it makes more sense to use one-hot encoding instead of label encoding. If categorical data does not have an inherent order, using label encoding can mislead the model into assuming a false ordinal relationship. For example, in our label encoding earlier, it made sense to assign "High School," "Undergraduate," and "Graduate," to 0, 1, and 2, since they are in chronological order and correspond to the person's age. However, for our Most_Used_Platform variable, there is no specific order that makes more sense than another for the different social media platforms. Therefore, one-hot encoding makes more sense for this variable. One-hot encoding works by creating a new binary column for each category, where a 1 indicates the presence of that category and a 0 indicates its absence, allowing the model to treat each category as distinct without implying any ranking or order. Check out this video for a quick explanation of one-hot encoding.

(Code Block 2H) In this code block, we will view our DataFrame again. Identify the columns that are categorical but have not been encoded yet.
(Code Block 2I) In this code block, use the unique() function again to see the values of the columns that have not been encoded yet.
(Code Block 2J) In this code block, under the #TODO comment, add in the columns you want to one-hot encode.

Splitting to Train and Test

(Code Block 3A) Separating the Dataset into Inputs and Target: This code block separates the dataset into two parts:
1. X contains all the feature columns except Mental_Health_Score.
2. y contains the Mental_Health_Score column, which we will predict based on the other features.
(Code Block 3B) Splitting the Training and Test Data: Splitting data into training and testing sets is essential in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. Pay attention to how X and y look after this step, as well as the sizes of X_train, X_test, y_train, and y_test. For example, if you see the X_train shape as (564, 23), that means there are 564 samples (students) in the training data, each with 23 features (e.g., Age, Gender, Academic_Level, etc.). Remember that one-hot encoding will increase the number of features (e.g., Most_Used_Platform becomes multiple columns: Most_Used_Platform_Tiktok, Most_Used_Platform_LinkedIn, etc.)

Training the Model

(Code Block 4A) We have provided the code to train a Random Forest Regressor model. Run this code block. (Note: There are both Random Forest Classifier and Regressor models–since we are predicting a numerical variable, we are using the Regressor).
(Code Block 4B) In this code block, we are using our trained model to estimate mental health scores for the participants in our test dataset.
1. The model has already learned how different features (like age, sleep hours, academic performance, etc.) relate to Mental_Health_Score from the training data.
2. Now, we give it the features from the test dataset (all columns except Mental_Health_Score), and it uses what it learned to predict what each participant's mental health score would be.

Evaluating the Model

(Code Block 5A) This code block calculates the Mean Absolute Error (MAE) to assess the accuracy of the model’s predictions.
1. The MAE shows how close predictions are to actual values by averaging the size of errors. A lower MAE means better prediction accuracy.
2. For example, an MAE of 1 indicates that, on average, the model’s predictions differ from the actual values by 1 unit. That means that if the model predicts the Mental_Health_Score of a student to be 7, that means that the actual Mental_Health_Score may be anywhere between 6 and 8.
(Code Block 5B) This code block calculates the Mean Squared Error (MSE) to assess the accuracy of the model’s predictions.
1. The MSE measures the average squared difference between predicted and actual values. It provides insight into the accuracy of the model by penalizing larger errors more heavily, as the differences are squared. Lower MSE values indicate better model performance.
2. For example, an MSE of 0.04 means that, on average, the squared difference between the predicted and actual value is 0.04. This metric is particularly useful when you want to minimize large errors, as the squaring emphasizes their impact, making it ideal for regression tasks where larger deviations are more significant.
(Code Block 5C) This code block calculates the R²(R-squared) value, which is a statistical measure that assesses how well the regression predicts the actual data points. Remember that, unlike basic linear regression, which assumes a straight-line relationship between each input and target, Random Forest can capture complex, non-linear patterns. With Random Forest, R² still measures how well the model predicts the data and reflects how well all the trees of the Random Forest combined fit the data's complex patterns.
1. R²= 1: Perfect prediction. The model accurately predicts Mental_Health_Score with no errors.
2. R²= 0: Poor prediction. The model’s predictions are no better than simply using the average Mental_Health_Score as a guess.
3. 0 < R² < 1: This range indicates how much of the variation in Mental_Health_Score the model can explain. For example, an R² value of 0.75 means the model can predict 75% of the variability in mental health scores based on the input features.
(Code Block 5D) This code block displays a feature importance graph showing the top features the model considers most important when making predictions. Each horizontal bar represents a feature, and the length of the bar reflects how much that feature contributed to the model’s decisions. Features at the top (with longer bars) are more influential, while those further down are less important. This helps identify which inputs matter most to the model.
1. To see more features, under the #TODO comment, simply increase the number of features.
(Code Block 5G) This code block displays a graph that shows how a chosen feature (like average daily screen time) relates to mental health scores. Each point is a person, and the red line shows the trend. If it slopes down, the feature may be linked to lower mental health. If it slopes up, the feature may be linked to higher mental health. You can change the feature name under the #TODO comment to explore relationships with other variables. You may change it to one of these options: Avg_Daily_Usage_Hours, Sleep_Hours_Per_Night, Conflicts_Over_Social_Media, or Addicted_Score.
1. Note: This graph only works with numerical variables, not categorical ones.
2. If you tried a different numerical feature, how did the trend change?
3. Can we say one causes the other based on this graph? Why or why not?
(Code Block 5H) This code block displays a bar graph that shows the average mental health score for each category within a selected feature, such as academic level. Each bar represents a group (e.g., high school, undergraduate, graduate), and its height shows that group's average mental health score. You can change the feature name under the #TODO comment to explore other categorical variables. You may change it to one of these options: Gender, Academic_Level, or Affects_Academic_Performance.
1. Note: This graph is for categorical features we label encoded earlier, not ones we one-hot encoded.
2. Were you surprised by the differences between categories? Why or why not?
3. How might each category's sample size affect the graph's average?
(Code Block 2I) This code block displays a bar graph of the average mental health score for each category of a one-hot encoded feature, like Most_Used_Platform. Each bar represents a category (e.g., Instagram, TikTok) and its height shows the average score for users in that group. You can change the feature name under the #TODO comment to explore other one-hot encoded categorical variables. You may change it to one of these options: Most_Used_Platform or Relationship_Status.
1. Note: This graph only works with features that were one-hot encoded.
2. What patterns do you notice across different categories?
Now that you have completed your analysis, take a moment to think about the data.
1. Which variables were most strongly correlated with higher mental health scores?
2. Which variables were most strongly correlated with lower mental health scores?
3. Were there any variables that seemed to have little or no correlation with mental health?
4. Can you think of potential causal explanations for any of these correlations? How could these be tested in a proper scientific study?
5. Finally, what other variables do you think could impact mental health that were not included in this dataset?
6. Reflecting on these questions will help you better understand the findings, their limitations, and how this type of data analysis can inform future research.

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Post a Question

Global Goals

The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.

This project explores topics key to Good Health and Well-Being: Ensure healthy lives and promote well-being for all at all ages.

Variations

(Code Block 6A) This code block displays a lower-triangle heatmap showing how the top 10 features most positively correlated with Mental_Health_Score relate to each other. You might notice that the Mental_Health_Score row is all red. The heatmap uses color and numbers to show the strength of correlation between pairs of features–red for strong positive and gray for neutral, blue for negative, and lighter shades for weaker correlations. Values closer to +1 indicate a strong positive correlation, meaning that as this variable increases, the mental health score also tends to increase. This helps identify which features are not only related to mental health but also potentially redundant with each other. The heatmap shows correlations between all features, but to understand how each variable relates to mental health, focus on the row or column corresponding to Mental_Health_Score.
(Code Block 6B) This code block displays another lower-triangle heatmap showing the top 10 features most negatively correlated with Mental_Health_Score. You might notice that compared to the previous graph, the Mental_Health_Score row is more blue. Each cell displays how strongly two features are linearly related, with values close to -1 (deep blue) indicating a strong negative correlation and values near 0 showing little or no relationship. The heatmap highlights not only how each of these features relates to mental health, but also how they correlate with one another.
Check Category Counts: Investigate how many responses belong to each category. For example, are there more high school students than undergraduates or graduates? Are certain countries or platforms overrepresented in the data?
Filter the data. If you are interested in a specific group (e.g., only college students or a specific country), filter the dataset to include only that group. This lets you explore how trends change for different populations.
Test for Statistical Significance: Learn how to use t-tests or ANOVA to see if group differences are statistically meaningful.
Look for Outliers or Anomalies: Are there responses with unusually high or low values? What could they tell you?
Tune the Random Forest Model: Experiment with the parameters of your Random Forest (like the number of trees, max_depth, or min_samples_split) to see how they affect model performance.
Find another dataset of human data containing mental health scores. Does this data set have similar or different correlations from the previous dataset? Why do you think that is?
Try a different machine learning model like K-Nearest Neighbors (KNN), Boosted Tree, etc.
Instead of predicting mental health scores, try predicting average sleep hours, academic performance, etc., and compare which features are most important for different outcomes.

Careers

If you like this project, you might enjoy exploring these related careers:

Data Scientist

Career Profile

Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more

Psychologist

Career Profile

Why people take certain actions can often feel like a mystery. Psychologists help solve these mysteries by investigating the physical, cognitive, emotional, or social aspects of human behavior and the human mind. Some psychologists also apply these findings in order to design better products or to help people change their behaviors. Read more

News Feed on This Topic

, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey, and Laura Ohl. "Exploring the Impact of Social Media on Mental Health with Machine Learning." Science Buddies, 5 Aug. 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p029/artificial-intelligence/social_media?from=Blog. Accessed 23 July 2026.

APA Style

Ngo, T., & Ohl, L. (2025, August 5). Exploring the Impact of Social Media on Mental Health with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p029/artificial-intelligence/social_media?from=Blog

Last edit date: 2025-08-05

Explore Our Science Videos

Predict Wildfires with Machine Learning: Python Tutorial

Underwater Color Bursts – STEM Activity

Fire-Fighting Foam | STEM Activities for Kids