Exploring the Impact of Social Media on Mental Health with Machine Learning
Abstract
Do you ever wonder if you spend too much time online? How can the amount of time spent on social media influence someone's mental health? What other factors play a role? In this science project, you will investigate which variables correlate with better mental health scores using a machine learning algorithm called the Random Forest algorithm.
Summary
None
Readily available
No issues
Objective
In this science project, students will investigate which variables most impact mental health scores using a machine learning algorithm called Random Forest.
Introduction
Content Warning: This page includes topics about mental health, which may be upsetting to some people. If you are struggling with your mental health, remember you are not alone. There is always someone ready to listen, and it's normal and okay to reach out to someone to talk about your health.
If you or someone you know needs emotional support, please call the National Mental Health Hotline (866-903-3787) or 988 Lifeline (https://988lifeline.org/).
Have you ever noticed how a lack of sleep impacts your mental health? What about your use of social media platforms? How does that impact your mental health?
The U.S. Department of Health and Human Services released an advisory that shows the impact of social media on the mental health of children and adolescents. They mention that social media is a part of everyday life for most young people. Some reports even showed that over 90% of people aged 13-17 use social media. Could the average daily usage of social media or particular platforms be associated with worsened mental health? Multiple studies have shown that adolescents who spend over 3 hours a day on social media platforms have double the risk of mental health problems. These problems typically present as symptoms of depression and anxiety. What about how we use social media to build relationships, initiate conflict, or become addicted to using it? Until we figure out what about social media makes it less safe for adolescents, there have been recommendations by the Surgeon General to limit its use to improve overall mental health.
Interestingly, datasets have been collected on adolescents to better understand the relationship between social media use and overall mental health. The dataset we recommend for this project includes many variables (e.g., addiction score, age, academic performance, etc.) to analyze and understand which factors are associated with better mental health scores. What variable do you think will impact mental health scores the most?
In this science project, you will use a Random Forest Regressor model to investigate how these different variables are related to mental health scores. Random Forest is especially useful for this type of data because it can handle both numerical and categorical variables and capture complex nonlinear relationships and interactions between features. Unlike basic linear regression, which assumes a straight-line relationship between each input and the target, Random Forest can model more realistic, complex patterns in the data without those strict assumptions, making it a powerful tool for this analysis.
Watch this video to learn more about Random Forests:
Before you get started, it's essential to understand the limitations of using these types of data sets and the machine learning algorithm we recommend for this project. Most data collected from human studies includes self-reporting. This common practice in clinical and human trials requires the participant, or their caretaker, to rely on their own memory and record keeping of their behaviors, in this case, the use of technology. While this may have some bias in the data collection stage, studies have shown that most people relay answers to these questions as accurately as possible. Furthermore, these types of studies are necessary, since long-term monitoring of daily behaviors is expensive and more time-consuming than self-reporting our behaviors to the best of our abilities.
When using data sets, particularly with Artificial Intelligence (AI) algorithms that use correlation analyses, it's important to remember that correlation doesn't mean causation. For example, just because you see a correlation between the amount of sleep and mental health scores, it doesn't necessarily mean that disturbed sleep causes mental health issues based on your data analysis. Instead, you need a randomized control trial to test this relationship and remove confounding variables. Mental health is a multi-dimensional biological process. For example, having a mental health condition could predispose someone to having worse sleep, as opposed to less sleep causing a mental health disorder. Therefore, it's essential to understand that we can't draw strong conclusions about what causes better or worse mental health just from this analysis alone. We would need additional human research studies to investigate this further and understand the relationship between these variables at a deeper level.
Terms and Concepts
- Mental health
- Social media
- Mental health crisis
- Depression
- Anxiety
- Addiction
- Random Forest Regressor
- Numerical variable
- Categorical variable
- Machine learning
- Self-reporting
- Artificial Intelligence (AI)
- Correlation
- Causation
- Confounding variables
- Label encoding
- One-Hot Encoding
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- R² (R-squared) value
Questions
- What is mental health?
- What are social media platforms, and how many are there?
- Why is there a mental health crisis in the US?
- How does social media impact your everyday life?
- How much do youth use social media, and how do you think that impacts their mental health?
- How can machine learning and AI be used to better understand the mental health crisis?
- Why isn’t correlation the same as causation?
- What are the limitations of self-reporting? What are the advantages?
- What is a confounding variable?
Bibliography
This is the dataset we will be using:
- Adil Shamim. (2025, April). Students' Social Media Addiction. Kaggle. Retrieved June 25, 2025.
To learn more about the mental health crisis:
- U.S. Health and Human Services. (2025, February 19). Social media and Youth Mental Health. Retrieved June 16, 2025.
To learn more about encoding:
- Misra Turp (2023, February 10). Quick explanation: One-hot encoding. YouTube. Retrieved June 25, 2025.
- StatQuest with Josh Starmer. (2023, February 12). One-Hot, Label, Target, and K-Fold Target Encoding, Clearly Explained!!!. YouTube. Retrieved June 25, 2025.
To learn more about the Random Forest algorithm:
- IBM Technology. (2022, February 7). What is Random Forest?. YouTube. Retrieved June 25, 2025.
- scikit-learn. (n.d.). RandomForestRegressor. Retrieved June 25, 2025.
Materials and Equipment
- Computer with Internet access.
Experimental Procedure

Overview
This project aims to explore how social media habits and other daily factors relate to students' mental health scores. Using a dataset with variables like daily social media use, sleep hours, addiction scores, and much more, you will train a Random Forest Regressor to predict mental health scores based on these factors. The goal is to identify which variables are most associated with better or worse mental health scores.
Participant Reflection Survey
Before starting the analysis, take a moment to answer these questions as if you were a participant in the study:
- How old are you?
- What is your gender?
- What is your academic level? (High school, undergraduate, or graduate)
- On average, how many hours do you spend on social media each day?
- What is your most-used platform? (e.g., TikTok, YouTube, Instagram)
- Do you think your social media use has affected your academic performance? (Yes or No)
- On average, how many hours do you sleep each night?
- How would you rate your mental health on a scale from 1 to 10? (1 = poor, 10 = excellent)
- What is your relationship status? (Single, In a relationship, or Complicated)
- How many relationship conflicts have you had due to social media?
- How addicted do you feel you are to social media on a scale from 1 to 10? (1 = not at all addicted, 10 = extremely addicted)
Reflecting on these questions will help you better understand the variables in the dataset and think critically about how your own habits might relate to mental health.
Setting Up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the social_media_addiction.ipynb file from Science Buddies. This is the code you will need to process your data.
- Download the social_media_addiction.csv file from Science Buddies.
- Open the dataset to familiarize yourself with its structure by going to your Downloads and double-clicking on the file.
- You will see that the very last column (column M) is called
Mental_Health_Score, which we will try to predict. This column is a self-rated integer from 1 (poor) to 10 (excellent), indicating overall mental well-being, allowing assessment of potential associations with social media habits and other variables. - In columns A-L, you will see other variables such as
Avg_Daily_Use_HoursandSleep_Hours_Per_Night. Each of these factors will be described in more detail as you preprocess the data.
- Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it
social_media_addiction. Inside the folder, upload both thesocial_media_addiction.ipynbfile and thesocial_media_addiction.csvfile. - Double click on the
social_media_addiction.ipynbfile. This should automatically open in Google Colab.- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
- Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.
Loading the Data into a Pandas DataFrame
- (Code Block 1A) Run this code block to create a DataFrame, like a table, that will be used to load and manipulate the data in the notebook. The data from your .csv file will populate a table below the code block.
Preprocessing
Dropping Columns
- (Code Block 2A) First, we will drop features that will be uninformative for modeling. In this case, we will be dropping the
Student_IDandCountryfeatures.- This is because
Student_IDis a unique integer identifier assigned to each survey respondent. It does not carry any inherent predictive value and is simply an arbitrary identifier. Including it as a feature could potentially introduce noise (random or irrelevant information that can disrupt the meaningful patterns or signals in data). - Similarly, we will drop the
Countrycolumn because some countries only had a few participants, making the data sparse and potentially introducing bias or misleading associations due to small sample sizes. - Run this code block to remove these uninformative features before continuing.
- This is because
Label Encoding
When working with machine learning algorithms, it is important to convert categorical variables into numbers since algorithms understand numbers better than words. Categorical variables represent qualities like gender or group, and cannot be used because they lack a numerical form that algorithms can work with. Label encoding is a way to convert categorical variables into positive whole numbers, assigning each category a unique integer value (0, 1, 2, etc.).
- (Code Block 2B) In this code block, we have provided the code to find the unique values (distinct types in the data) in the
Gendercolumn. You will use this same code again later to find unique values for other columns as well. You will see two genders in the output when you run this code (note: this particular dataset did not include additional or nonbinary gender options). You will assign these genders an integer value in the next code block. - (Code Block 2C) In this code block, you will map each
Genderto a number. Under the#TODOcomment, make sure to assign bothMaleandFemaleto a distinct number before moving on. It does not matter whether you mapMaleto 0 andFemaleto 1 or vice versa, as long as each gender has a unique numeric value.- Be sure to use quotes around each category name and separate them using commas.
- If you get an error when you run the code block, double-check that you do not have any typos and have used commas and quotes correctly.
- Here is an example of how the code should look.
mapping = {"Male": 0, "Female": 1}
- (Code Block 2D) In this empty code block, refer to Code Block 2B, and use the code provided there as a model to help you write the code to find the unique values of the
Academic_Levelcolumn. Make sure that the column name is between either single or double quotes. Here is an example of how the code should look.df['Academic_Level'].unique() - (Code block 2E) In this code block, you will map each
Academic_Levelto a number.Academic_Levelis the highest level of education the respondent is currently enrolled in. Order these from lowest to highest. - (Code Block 2F) In this empty code block, refer to code block 2B, and use the code provided there as a model to help you write the code to find the unique values of the
Affects_Academic_Performancecolumn. - (Code Block 2G) In this code block, you will map each
Affects_Academic_Performancevalue to a number.- Note: By common coding conventions, 0 typically means ‘No’ and 1 typically means ‘Yes.’
One-Hot Encoding
Sometimes, it makes more sense to use one-hot encoding instead of label encoding. If categorical data does not have an inherent order, using label encoding can mislead the model into assuming a false ordinal relationship. For example, in our label encoding earlier, it made sense to assign "High School," "Undergraduate," and "Graduate," to 0, 1, and 2, since they are in chronological order and correspond to the person's age. However, for our Most_Used_Platform variable, there is no specific order that makes more sense than another for the different social media platforms. Therefore, one-hot encoding makes more sense for this variable. One-hot encoding works by creating a new binary column for each category, where a 1 indicates the presence of that category and a 0 indicates its absence, allowing the model to treat each category as distinct without implying any ranking or order. Check out this video for a quick explanation of one-hot encoding.
- (Code Block 2H) In this code block, we will view our DataFrame again. Identify the columns that are categorical but have not been encoded yet.
- (Code Block 2I) In this code block, use the unique() function again to see the values of the columns that have not been encoded yet.
- (Code Block 2J) In this code block, under the
#TODOcomment, add in the columns you want to one-hot encode.
Splitting to Train and Test
- (Code Block 3A) Separating the Dataset into Inputs and Target: This code block separates the dataset into two parts:
Xcontains all the feature columns exceptMental_Health_Score.ycontains theMental_Health_Scorecolumn, which we will predict based on the other features.
- (Code Block 3B) Splitting the Training and Test Data: Splitting data into training and testing sets is essential in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. Pay attention to how
Xandylook after this step, as well as the sizes ofX_train,X_test,y_train, andy_test. For example, if you see theX_trainshape as (564, 23), that means there are 564 samples (students) in the training data, each with 23 features (e.g.,Age,Gender,Academic_Level, etc.). Remember that one-hot encoding will increase the number of features (e.g.,Most_Used_Platformbecomes multiple columns:Most_Used_Platform_Tiktok,Most_Used_Platform_LinkedIn, etc.)
Training the Model
- (Code Block 4A) We have provided the code to train a Random Forest Regressor model. Run this code block. (Note: There are both Random Forest Classifier and Regressor models–since we are predicting a numerical variable, we are using the Regressor).
- (Code Block 4B) In this code block, we are using our trained model to estimate mental health scores for the participants in our test dataset.
- The model has already learned how different features (like age, sleep hours, academic performance, etc.) relate to
Mental_Health_Scorefrom the training data. - Now, we give it the features from the test dataset (all columns except
Mental_Health_Score), and it uses what it learned to predict what each participant's mental health score would be.
- The model has already learned how different features (like age, sleep hours, academic performance, etc.) relate to
Evaluating the Model
- (Code Block 5A) This code block calculates the Mean Absolute Error (MAE) to assess the accuracy of the model’s predictions.
- The MAE shows how close predictions are to actual values by averaging the size of errors. A lower MAE means better prediction accuracy.
- For example, an MAE of 1 indicates that, on average, the model’s predictions differ from the actual values by 1 unit. That means that if the model predicts the
Mental_Health_Scoreof a student to be 7, that means that the actualMental_Health_Scoremay be anywhere between 6 and 8.
- (Code Block 5B) This code block calculates the Mean Squared Error (MSE) to assess the accuracy of the model’s predictions.
- The MSE measures the average squared difference between predicted and actual values. It provides insight into the accuracy of the model by penalizing larger errors more heavily, as the differences are squared. Lower MSE values indicate better model performance.
- For example, an MSE of 0.04 means that, on average, the squared difference between the predicted and actual value is 0.04. This metric is particularly useful when you want to minimize large errors, as the squaring emphasizes their impact, making it ideal for regression tasks where larger deviations are more significant.
- (Code Block 5C) This code block calculates the R2 (R-squared) value, which is a statistical measure that assesses how well the regression predicts the actual data points. Remember that, unlike basic linear regression, which assumes a straight-line relationship between each input and target, Random Forest can capture complex, non-linear patterns. With Random Forest, R² still measures how well the model predicts the data and reflects how well all the trees of the Random Forest combined fit the data's complex patterns.
- R2 = 1: Perfect prediction. The model accurately predicts
Mental_Health_Scorewith no errors. - R2 = 0: Poor prediction. The model’s predictions are no better than simply using the average
Mental_Health_Scoreas a guess. - 0 < R2 < 1: This range indicates how much of the variation in
Mental_Health_Scorethe model can explain. For example, an R2 value of 0.75 means the model can predict 75% of the variability in mental health scores based on the input features.
- R2 = 1: Perfect prediction. The model accurately predicts
- (Code Block 5D) This code block displays a feature importance graph showing the top features the model considers most important when making predictions. Each horizontal bar represents a feature, and the length of the bar reflects how much that feature contributed to the model’s decisions. Features at the top (with longer bars) are more influential, while those further down are less important. This helps identify which inputs matter most to the model.
- To see more features, under the
#TODOcomment, simply increase the number of features.
- To see more features, under the
- (Code Block 5G) This code block displays a graph that shows how a chosen feature (like average daily screen time) relates to mental health scores. Each point is a person, and the red line shows the trend. If it slopes down, the feature may be linked to lower mental health. If it slopes up, the feature may be linked to higher mental health. You can change the feature name under the
#TODOcomment to explore relationships with other variables. You may change it to one of these options:Avg_Daily_Usage_Hours,Sleep_Hours_Per_Night,Conflicts_Over_Social_Media, orAddicted_Score.- Note: This graph only works with numerical variables, not categorical ones.
- If you tried a different numerical feature, how did the trend change?
- Can we say one causes the other based on this graph? Why or why not?
- (Code Block 5H) This code block displays a bar graph that shows the average mental health score for each category within a selected feature, such as academic level. Each bar represents a group (e.g., high school, undergraduate, graduate), and its height shows that group's average mental health score. You can change the feature name under the
#TODOcomment to explore other categorical variables. You may change it to one of these options:Gender,Academic_Level, orAffects_Academic_Performance.- Note: This graph is for categorical features we label encoded earlier, not ones we one-hot encoded.
- Were you surprised by the differences between categories? Why or why not?
- How might each category's sample size affect the graph's average?
- (Code Block 2I) This code block displays a bar graph of the average mental health score for each category of a one-hot encoded feature, like
Most_Used_Platform. Each bar represents a category (e.g., Instagram, TikTok) and its height shows the average score for users in that group. You can change the feature name under the#TODOcomment to explore other one-hot encoded categorical variables. You may change it to one of these options:Most_Used_PlatformorRelationship_Status.- Note: This graph only works with features that were one-hot encoded.
- What patterns do you notice across different categories?
- Now that you have completed your analysis, take a moment to think about the data.
- Which variables were most strongly correlated with higher mental health scores?
- Which variables were most strongly correlated with lower mental health scores?
- Were there any variables that seemed to have little or no correlation with mental health?
- Can you think of potential causal explanations for any of these correlations? How could these be tested in a proper scientific study?
- Finally, what other variables do you think could impact mental health that were not included in this dataset?
- Reflecting on these questions will help you better understand the findings, their limitations, and how this type of data analysis can inform future research.
Ask an Expert
Global Goals
The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.
Variations
- (Code Block 6A) This code block displays a lower-triangle heatmap showing how the top 10 features most positively correlated with
Mental_Health_Scorerelate to each other. You might notice that theMental_Health_Scorerow is all red. The heatmap uses color and numbers to show the strength of correlation between pairs of features–red for strong positive and gray for neutral, blue for negative, and lighter shades for weaker correlations. Values closer to +1 indicate a strong positive correlation, meaning that as this variable increases, the mental health score also tends to increase. This helps identify which features are not only related to mental health but also potentially redundant with each other. The heatmap shows correlations between all features, but to understand how each variable relates to mental health, focus on the row or column corresponding toMental_Health_Score. - (Code Block 6B) This code block displays another lower-triangle heatmap showing the top 10 features most negatively correlated with
Mental_Health_Score. You might notice that compared to the previous graph, theMental_Health_Scorerow is more blue. Each cell displays how strongly two features are linearly related, with values close to -1 (deep blue) indicating a strong negative correlation and values near 0 showing little or no relationship. The heatmap highlights not only how each of these features relates to mental health, but also how they correlate with one another. - Check Category Counts: Investigate how many responses belong to each category. For example, are there more high school students than undergraduates or graduates? Are certain countries or platforms overrepresented in the data?
- Filter the data. If you are interested in a specific group (e.g., only college students or a specific country), filter the dataset to include only that group. This lets you explore how trends change for different populations.
- Test for Statistical Significance: Learn how to use t-tests or ANOVA to see if group differences are statistically meaningful.
- Look for Outliers or Anomalies: Are there responses with unusually high or low values? What could they tell you?
- Tune the Random Forest Model: Experiment with the parameters of your Random Forest (like the number of trees,
max_depth, ormin_samples_split) to see how they affect model performance. - Find another dataset of human data containing mental health scores. Does this data set have similar or different correlations from the previous dataset? Why do you think that is?
- Try a different machine learning model like K-Nearest Neighbors (KNN), Boosted Tree, etc.
- Instead of predicting mental health scores, try predicting average sleep hours, academic performance, etc., and compare which features are most important for different outcomes.
Careers
If you like this project, you might enjoy exploring these related careers:









