Jump to main content

Build Personal Playlists with Machine Learning

1
2
3
4
5
26 reviews

Abstract

AI has the power to organize data into categories—even when it is not given clear instructions. In this project, we will create an unsupervised K-Means learning model that organizes songs into distinct groups based on their similarities. The best part? You don't need advanced coding skills; just bring an open mind and a desire to explore the fascinating intersection of music and technology. Are you ready to embark on this exciting journey?

Summary

Areas of Science
Difficulty
Method
Time Required
Short (2-5 days)
Prerequisites

None

Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Create an unsupervised K-Means learning model that categorizes songs into a set number of groups, then evaluate the accuracy of the song recommendation function.

Introduction

Imagine it is after school and you are preparing for a study session. As you set up your study area, create a plan, and gather the necessary materials, you open a music streaming platform on your device. Initially, you contemplate playing your go-to study playlist, but you decide to explore the platform's personalized playlist recommendations instead.

This gets you thinking about the intricate process behind the platform's song suggestions. It's fascinating: With a vast catalog spanning millions of songs, the platform consistently recommends tracks that match your musical tastes. Music services like Spotify can do this with the help of artificial intelligence.

Artificial Intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. Machine learning encompasses various techniques, including supervised learning, in which models are trained on labeled data to make predictions, and unsupervised learning, in which models uncover patterns and relationships within data without explicit labels.

K-Means is an unsupervised machine learning algorithm that groups similar data points together into clusters, with the goal of minimizing the differences within each cluster and maximizing the differences between clusters. It is commonly used for tasks like recognizing patterns and making recommendations.

K-Means outperforms manual data analysis because it is more efficient in terms of time and effort. It is great at finding complex patterns that might be hard for people to see. K-Means is consistent and works well with both small and large datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details.

However, K-Means has some limitations. Selecting the appropriate number of clusters (K) can be challenging and subjective. K-Means also struggles with datasets that contain outliers or features of varying importance. Despite these limitations, K-Means is a valuable tool when used carefully and in combination with preprocessing techniques that enhance its performance.

K-Means exemplifies the ability of machine learning algorithms to uncover patterns and make data-driven decisions. Companies and researchers are interested in K-Means for a variety of uses. Companies like Spotify or Netflix, for instance, may use K-Means to automatically generate recommendations for their users, enhancing user engagement and satisfaction.

Watch this video to learn more about K-Means:

In this project, we will give you the basic code for implementing the K-Means model. Your job is to use this code and see how well it can recommend songs that you will enjoy.

The dataset you will be using contains information about 24 features of songs on Spotify:

Taken together, these 24 features are intended to describe how a song sounds and its appeal to listeners. These factors are crucial for artists and music platforms in creating and recommending music that resonates with listeners.

Terms and Concepts

Questions

Bibliography

Materials and Equipment

Experimental Procedure

This project follows the Scientific Method. Review the steps before you begin.

Setting up the Google Colab Environment

  1. You will need a Google account. If you do not have one, make one when prompted.
  2. Download the spotify.ipynb file from Science Buddies.
  3. Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
  4. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
  5. Run the code block under Importing Libraries, to bring in all of the functions.
Coding Tip:

Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.

  • Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
  • click on the current cell you are working on, then click 'Runtime' and 'Run before'.

Preprocessing the Dataset

Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.

  1. Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will drop Artist, Track, Album, Album_type, Title, Channel, Licensed, official_video, and most_playedon.
    1. Features like Artist, Track (name), Album (name), and Title are usually unique values and, in general, they do not tell us much about how the song sounds. They will not help the program group the data.
    2. The other features, such as whether a song was licensed or which platform it was played on most, are also unlikely to correlate with the listener's musical tastes.

    We have provided the code to delete certain columns from our Pandas DataFrame—a two-dimensional and highly flexible data structure provided by the Pandas library in Python. Add in the names of the columns you want to delete, making sure they match exactly as they are written in the dataset. Then, run the code in the cell.

  2. Dropping NaN Values: NaN stands for Not a Number. NaN values typically indicate missing or incomplete data. When we remove these values, datasets become compatible with algorithms that require completely numerical data. Statistical analyses become more accurate, and the risk of biased predictions due to incomplete information is reduced.

    We have provided the code to drop NaN values. Run the code in the cell. Notice the shape of the DataFrame before and after dropping NaN values.

  3. Normalizing/Scaling Features: Since K-Means is a distance-based algorithm, it is crucial to normalize or scale the features to ensure that all features contribute equally to the distance calculations. If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the calculations.

    For instance, if Energy values range from 0 to 1 and Views values range from 0 to 8 billion, the larger range for Views could cause Views to overshadow the Energy feature in distance calculations.

    1. Normalization: This is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. The technique we will be using is called Min-Max scaling. It will bring the values between 0 and 1.
    2. Scaling: In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.

    As in the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing. After using the describe() function, we can see that some features are already within the range of 0 to 1, including Danceability, Energy, Speechiness, Acousticness, Instrumentalness, Liveness, and Valence, so we can exclude those features from this step. We will be normalizing our other numerical variables, which in this case would be Loudness, Tempo, Duration_min, Views, Likes, Comments, Energy, and Liveness. Add these names to the list specified by the comment in the code. Then, run the code in the cell.

Clustering the Data

There are various strategies to determine the optimal number of clusters. One of the techniques we'll explore is called the Elbow method. This method involves creating a plot that illustrates the inertia, a measure of how well the data has been grouped by the K-Means algorithm, as a function of the number of clusters. Within this plot, we are on the lookout for a point where the decrease in inertia starts to slow down, resembling the bend of an elbow. This point is indicative of a potential optimal number of clusters for the given dataset. Click here to learn more about the Elbow Method and how to choose the number of clusters.

We have provided the code for a function that works out the optimal number of clusters. Run the code in these cells and observe the graph. Where in the graph is there a big change in inertia? (Find the "elbow" part of the graph.)

Applying K-Means Clustering

  1. We have provided the code to make a K-Means classifier with a specified number of clusters. Input the K value you chose from the previous step into the K-Means model as indicated by the comment.

Visualizing the Model

  1. Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of features used to describe each item in the dataset. There are many methods for reducing dimensionality. We will use a technique called Principal Component Analysis (PCA). Click here to learn more about PCA.
    1. In the dataset, there are multiple features (dimensions) that describe each instance. In our Spotify dataset, we have features like Danceability, Energy, and Loudness. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
    2. PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
    3. In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we are now plotting the data points in a two-dimensional space defined by the two most important principal components. This makes it possible to create a clear and interpretable graph of the K-Means clustering.
  2. The graph generated by the provided code visualizes the number of clusters of the K-Means classifier in a two-dimensional space. These clusters represent the different groups that the songs are categorized into.

Creating Our Song Recommendation Function

  1. We have provided the code for the function find_track_index(track_name, df), which finds the index of a given track name in the DataFrame's Track column. If the track name is found, it returns the index; otherwise, it returns None to signify that the track name is absent in the DataFrame. Run the code in this cell.
  2. We have provided the code for the function find_song_recommendation(track_name, df), which takes a track name and a DataFrame containing song information and clusters. It first identifies the cluster to which the input track belongs, then selects songs from the same cluster, and finally generates and prints five song recommendations from that cluster, aiming to suggest similar songs. Run the code in this cell.
  3. Experiment with inputting different track names into the function. When you input the name of songs you like, does the recommender function also output songs you like? In a journal, keep track of how many of the songs you like out of the five recommended. If it includes songs you have not heard before, find them on Spotify or another streaming service and give them a listen.
    1. Repeat this experiment at least five times. Every time, record the proportion of songs that you like. For example, if you liked four of the five songs, write down either 4/5 or 0.8.
    2. Have other people give it a try, and record the proportion of songs that they liked.
    3. Record your results in a table like Table 1.
  Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Listener 1 (me)          
Listener 2          
Listener 3          

Table 1. Table for recording what proportion of the five songs each listener liked.

Creating Our Song Randomizer Function

  1. We have provided the code for the function find_random_song(track_name, df), which takes a track name and a DataFrame that contains song information and clusters. It identifies the cluster to which the input track belongs, then selects songs from different clusters, and finally generates and prints five random song recommendations that are distinct from the provided track. Run the code in this cell.
  2. Experiment with inputting different track names into the function. When you input the name of a song you like, the function should output songs that it thinks are very different from that song. Do you still see songs that you like?
    1. In your lab notebook, record the proportion of songs that you like in another table like Table 1.
    2. Have other people give it a try, and record the proportion of songs that they liked.

Evaluating the Model

  1. We have provided the code for calculating the accuracies of the recommender and random song functions. Input the accuracy rate for each function in the list specified by the comment. There is no need to separate the trials or people.

    For instance, if your data looked like Table 2, you would input the values as follows: recommendations_accuracy = [0.8, 0.6, 0.2, 0.8, 1, 1, 1, 0.8, 0.6, 0.1].

  Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Listener 1 (me) 0.8 0.6 0.2 0.8 1
Listener 2 1 1 0.8 0.6 1

Table 2. Example data table.

  1. Compare the accuracies for both of the functions. Did the song recommendations perform significantly better than the random song generator?

Experimenting and Improving

  1. If the song recommendation function was not as accurate as you would like, what do you think we could do to improve the model?
    1. What happens if we drop certain features? For example, we could drop Views since that feature tends to group songs based on popularity. A listener would not necessarily like a song just because it was popular. Which other features could we drop?
    2. Could we increase or decrease the number of clusters? What impact do you think this would have on what songs the model recommends?
  2. Repeat the steps of this project, this time applying the changes you think would improve the model. Can you increase its accuracy?
icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Variations

  • Find another dataset on Kaggle and perform this project again. How well does the model work on your new data?

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
Are you interested in developing cool video game software for computers? Would you like to learn how to make software run faster and more reliably on different kinds of computers and operating systems? Do you like to apply your computer science skills to solve problems? If so, then you might be interested in the career of a computer software engineer. Read more
Career Profile
Computers are essential tools in the modern world, handling everything from traffic control, car welding, movie animation, shipping, aircraft design, and social networking to book publishing, business management, music mixing, health care, agriculture, and online shopping. Computer programmers are the people who write the instructions that tell computers what to do. Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Build Personal Playlists with Machine Learning." Science Buddies, 1 May 2024, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p012/artificial-intelligence/K-Means-Spotify. Accessed 22 June 2026.

APA Style

Ngo, T. (2024, May 1). Build Personal Playlists with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p012/artificial-intelligence/K-Means-Spotify


Last edit date: 2024-05-01
Top
Free science fair projects.