Science Projects

Build Personal Playlists with Machine Learning

1

2

3

4

5

26 reviews

Abstract

AI has the power to organize data into categories—even when it is not given clear instructions. In this project, we will create an unsupervised K-Means learning model that organizes songs into distinct groups based on their similarities. The best part? You don't need advanced coding skills; just bring an open mind and a desire to explore the fascinating intersection of music and technology. Are you ready to embark on this exciting journey?

Summary

Areas of Science

Artificial Intelligence

Difficulty

Method

Scientific Method

Time Required

Short (2-5 days)

Prerequisites

None

Material Availability

Readily available

Cost

Very Low (under $20)

Safety

No issues

Credits

Tracey Ngo, Science Buddies

Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

https://www.youtube.com/watch?v=PnLz0L7LGhE

Objective

Create an unsupervised K-Means learning model that categorizes songs into a set number of groups, then evaluate the accuracy of the song recommendation function.

Introduction

Imagine it is after school and you are preparing for a study session. As you set up your study area, create a plan, and gather the necessary materials, you open a music streaming platform on your device. Initially, you contemplate playing your go-to study playlist, but you decide to explore the platform's personalized playlist recommendations instead.

This gets you thinking about the intricate process behind the platform's song suggestions. It's fascinating: With a vast catalog spanning millions of songs, the platform consistently recommends tracks that match your musical tastes. Music services like Spotify can do this with the help of artificial intelligence.

Artificial Intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. Machine learning encompasses various techniques, including supervised learning, in which models are trained on labeled data to make predictions, and unsupervised learning, in which models uncover patterns and relationships within data without explicit labels.

K-Means is an unsupervised machine learning algorithm that groups similar data points together into clusters, with the goal of minimizing the differences within each cluster and maximizing the differences between clusters. It is commonly used for tasks like recognizing patterns and making recommendations.

K-Means outperforms manual data analysis because it is more efficient in terms of time and effort. It is great at finding complex patterns that might be hard for people to see. K-Means is consistent and works well with both small and large datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details.

However, K-Means has some limitations. Selecting the appropriate number of clusters (K) can be challenging and subjective. K-Means also struggles with datasets that contain outliers or features of varying importance. Despite these limitations, K-Means is a valuable tool when used carefully and in combination with preprocessing techniques that enhance its performance.

K-Means exemplifies the ability of machine learning algorithms to uncover patterns and make data-driven decisions. Companies and researchers are interested in K-Means for a variety of uses. Companies like Spotify or Netflix, for instance, may use K-Means to automatically generate recommendations for their users, enhancing user engagement and satisfaction.

Watch this video to learn more about K-Means:

https://www.youtube.com/watch?v=PJGSEttUzx8

In this project, we will give you the basic code for implementing the K-Means model. Your job is to use this code and see how well it can recommend songs that you will enjoy.

The dataset you will be using contains information about 24 features of songs on Spotify:

Artist: The artist's name
Track: The title of the track
Album: The name of the album
Album_type: The type of album (Album, Single, Compilation)
Danceability: A measure of how suitable a song is for dancing. It quantifies the rhythm, tempo, and beat strength of a track. Spotify uses machine learning algorithms to analyze these features and assign a danceability score to each song. The score typically ranges from 0 to 1, where higher values indicate more danceable tracks.
Energy: A measure of the intensity and activity level of a song. It reflects how dynamic and lively a track feels. Energy is calculated by analyzing its loudness and dynamic range. It is assigned a value from 0 to 1.
Loudness: The perceived volume or intensity of a song. Spotify quantifies Loudness using Loudness Units Full Scale (LUFS), a standardized unit of measurement used in audio engineering and music analysis. It takes into account human perception of loudness, making it suitable for music and audio content. The LUFS scale is designed to measure consistent loudness levels across different audio sources.
Speechiness: A measure of the extent to which a segment of audio or an entire song contains spoken words, vocalizations, or non-musical sounds, as opposed to instrumental music. It is a valuable metric for categorizing and understanding the nature of audio content. Speechiness is calculated by analyzing audio frames from a song and classifying them as speech or non-speech based on acoustic features, then calculating the proportion of frames classified as speech. The result is a Speechiness score ranging from 0 to 1, where higher values indicate a greater presence of spoken words or vocal content in the song compared to instrumental music.
Acousticness: A measure of the degree of acoustic instrumentation and arrangements in a song, distinguishing between music dominated by traditional acoustic instruments and that featuring electronic or synthesized sounds. It is calculated by dividing the song's acoustic frames by its non-acoustic (electronic) frames. The result is a numerical Acousticness score, typically ranging from 0 (highly electronic) to 1 (purely acoustic).
Instrumentalness: A numerical measure that quantifies instrumental content in a song, helping differentiate between purely instrumental tracks and those with vocals. To calculate Instrumentalness, the audio waveform is divided into frames, and machine learning algorithms or statistical models classify each frame as instrumental or non-instrumental based on acoustic features. The aggregated result is an Instrumentalness score typically ranging from 0 (vocal-heavy) to 1 (purely instrumental).
Liveness: A numerical measure that gauges the presence of live performance characteristics in a song, distinguishing between studio and live concert recordings. To calculate liveness, audio analysis considers features such as crowd noise, audience reactions, and instrumentation variations. Machine learning models classify audio segments as live or studio based on these features, resulting in a Liveness score typically ranging from 0 (studio recording) to 1 (live performance).
Valence: A numerical measure that quantifies the emotions or mood of a song, indicating whether it conveys positivity or negativity. It is determined through the analysis of various musical elements, including lyrics, melody, harmony, tempo, and instrumentation. Valence scores typically range from 0 (negative or sad) to 1 (positive or happy).
Tempo: A measure of the speed or pace at which a piece of music is performed. It is typically measured in beats per minute (BPM). It quantifies the rhythm and timing of a composition, with a higher BPM indicating a faster tempo and a lower BPM indicating a slower tempo. Tempo is calculated by detecting and counting the beats in music over a one-minute interval.
Duration_min: A measure of how long a song is in minutes
Title: The title of the music video
Views: The number of views a music video has on YouTube
Likes: The number of likes a music video has on YouTube
Comments: The number of comments a music video has on YouTube
Licensed: When a song is "licensed," it means that the rights to use that song for specific purposes or in certain contexts have been legally granted by the copyright holder (typically the songwriter, composer, or music publisher) to another party.
official_video: Whether or not a music video is official. An "official" music video is a video authorized and released by the copyright holders, such as the artist or record label. These videos are professionally produced, widely distributed through official channels, and subject to copyright protection, distinguishing them from unofficial or fan-made videos.
Stream: The number of times a track has been played on Spotify
EnergyLiveness: A custom feature for this dataset that combines the Energy and Liveness features
most_playedon: The platform that the track has been played the most on, either Spotify or YouTube

Taken together, these 24 features are intended to describe how a song sounds and its appeal to listeners. These factors are crucial for artists and music platforms in creating and recommending music that resonates with listeners.

Terms and Concepts

Artificial Intelligence (AI)
Machine Learning
Supervised Learning
Unsupervised Learning
K-Means
Noise
Normalization
Scaling
Dimensionality
Principal Component Analysis (PCA)
Cluster

Questions

What is the difference between supervised learning and unsupervised learning?
In simple terms, describe the main goal of the K-Means algorithm.
What are some limitations of the K-Means algorithm?
What is the main focus of this project involving the K-Means model?

Bibliography

StatQuest. (2018, May 23). StatQuest: K-means clustering. Retrieved September 12, 2023.
Kaggle. (2023). Spotify dataset. Retrieved September 12, 2023.
Data Science A-Z for Beginners and Advanced. Part 41 How to Choose the Number of Clusters. Retrieved September 12, 2023.
StatQuest. (2017, December 4). StatQuest: PCA main ideas in only 5 minutes!!!. Retrieved September 12, 2023.

Materials and Equipment

Laptop or desktop computer
Internet access
Lab notebook
Pen or pencil

Experimental Procedure

Download PDF of Procedure

This project follows the

Scientific Method. Review the steps before you begin.

Setting up the Google Colab Environment

You will need a Google account. If you do not have one, make one when prompted.
Download the spotify.ipynb file from Science Buddies.
Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
Run the code block under Importing Libraries, to bring in all of the functions.

Coding Tip:

Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.

Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
click on the current cell you are working on, then click 'Runtime' and 'Run before'.

Preprocessing the Dataset

Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.

Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will drop Artist, Track, Album, Album_type, Title, Channel, Licensed, official_video, and most_playedon.
1. Features like Artist, Track (name), Album (name), and Title are usually unique values and, in general, they do not tell us much about how the song sounds. They will not help the program group the data.
2. The other features, such as whether a song was licensed or which platform it was played on most, are also unlikely to correlate with the listener's musical tastes.
We have provided the code to delete certain columns from our Pandas DataFrame—a two-dimensional and highly flexible data structure provided by the Pandas library in Python. Add in the names of the columns you want to delete, making sure they match exactly as they are written in the dataset. Then, run the code in the cell.
Dropping NaN Values: NaN stands for Not a Number. NaN values typically indicate missing or incomplete data. When we remove these values, datasets become compatible with algorithms that require completely numerical data. Statistical analyses become more accurate, and the risk of biased predictions due to incomplete information is reduced.
We have provided the code to drop NaN values. Run the code in the cell. Notice the shape of the DataFrame before and after dropping NaN values.
Normalizing/Scaling Features: Since K-Means is a distance-based algorithm, it is crucial to normalize or scale the features to ensure that all features contribute equally to the distance calculations. If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the calculations.
For instance, if Energy values range from 0 to 1 and Views values range from 0 to 8 billion, the larger range for Views could cause Views to overshadow the Energy feature in distance calculations.
1. Normalization: This is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. The technique we will be using is called Min-Max scaling. It will bring the values between 0 and 1.
2. Scaling: In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.
As in the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing. After using the describe() function, we can see that some features are already within the range of 0 to 1, including Danceability, Energy, Speechiness, Acousticness, Instrumentalness, Liveness, and Valence, so we can exclude those features from this step. We will be normalizing our other numerical variables, which in this case would be Loudness, Tempo, Duration_min, Views, Likes, Comments, Energy, and Liveness. Add these names to the list specified by the comment in the code. Then, run the code in the cell.

Clustering the Data

There are various strategies to determine the optimal number of clusters. One of the techniques we'll explore is called the Elbow method. This method involves creating a plot that illustrates the inertia, a measure of how well the data has been grouped by the K-Means algorithm, as a function of the number of clusters. Within this plot, we are on the lookout for a point where the decrease in inertia starts to slow down, resembling the bend of an elbow. This point is indicative of a potential optimal number of clusters for the given dataset. Click here to learn more about the Elbow Method and how to choose the number of clusters.

We have provided the code for a function that works out the optimal number of clusters. Run the code in these cells and observe the graph. Where in the graph is there a big change in inertia? (Find the "elbow" part of the graph.)

Applying K-Means Clustering

We have provided the code to make a K-Means classifier with a specified number of clusters. Input the K value you chose from the previous step into the K-Means model as indicated by the comment.

Visualizing the Model

Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of features used to describe each item in the dataset. There are many methods for reducing dimensionality. We will use a technique called Principal Component Analysis (PCA). Click here to learn more about PCA.
1. In the dataset, there are multiple features (dimensions) that describe each instance. In our Spotify dataset, we have features like Danceability, Energy, and Loudness. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
2. PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
3. In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we are now plotting the data points in a two-dimensional space defined by the two most important principal components. This makes it possible to create a clear and interpretable graph of the K-Means clustering.
The graph generated by the provided code visualizes the number of clusters of the K-Means classifier in a two-dimensional space. These clusters represent the different groups that the songs are categorized into.

Creating Our Song Recommendation Function

We have provided the code for the function find_track_index(track_name, df), which finds the index of a given track name in the DataFrame's Track column. If the track name is found, it returns the index; otherwise, it returns None to signify that the track name is absent in the DataFrame. Run the code in this cell.
We have provided the code for the function find_song_recommendation(track_name, df), which takes a track name and a DataFrame containing song information and clusters. It first identifies the cluster to which the input track belongs, then selects songs from the same cluster, and finally generates and prints five song recommendations from that cluster, aiming to suggest similar songs. Run the code in this cell.
Experiment with inputting different track names into the function. When you input the name of songs you like, does the recommender function also output songs you like? In a journal, keep track of how many of the songs you like out of the five recommended. If it includes songs you have not heard before, find them on Spotify or another streaming service and give them a listen.
1. Repeat this experiment at least five times. Every time, record the proportion of songs that you like. For example, if you liked four of the five songs, write down either 4/5 or 0.8.
2. Have other people give it a try, and record the proportion of songs that they liked.
3. Record your results in a table like Table 1.

	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5
Listener 1 (me)
Listener 2
Listener 3

Table 1. Table for recording what proportion of the five songs each listener liked.

Creating Our Song Randomizer Function

We have provided the code for the function find_random_song(track_name, df), which takes a track name and a DataFrame that contains song information and clusters. It identifies the cluster to which the input track belongs, then selects songs from different clusters, and finally generates and prints five random song recommendations that are distinct from the provided track. Run the code in this cell.
Experiment with inputting different track names into the function. When you input the name of a song you like, the function should output songs that it thinks are very different from that song. Do you still see songs that you like?
1. In your lab notebook, record the proportion of songs that you like in another table like Table 1.
2. Have other people give it a try, and record the proportion of songs that they liked.

Evaluating the Model

We have provided the code for calculating the accuracies of the recommender and random song functions. Input the accuracy rate for each function in the list specified by the comment. There is no need to separate the trials or people.
For instance, if your data looked like Table 2, you would input the values as follows: recommendations_accuracy = [0.8, 0.6, 0.2, 0.8, 1, 1, 1, 0.8, 0.6, 0.1].

	Trial 1	Trial 2	Trial 3	Trial 4	Trial 5
Listener 1 (me)	0.8	0.6	0.2	0.8	1
Listener 2	1	1	0.8	0.6	1

Table 2. Example data table.

Compare the accuracies for both of the functions. Did the song recommendations perform significantly better than the random song generator?

Experimenting and Improving

If the song recommendation function was not as accurate as you would like, what do you think we could do to improve the model?
1. What happens if we drop certain features? For example, we could drop Views since that feature tends to group songs based on popularity. A listener would not necessarily like a song just because it was popular. Which other features could we drop?
2. Could we increase or decrease the number of clusters? What impact do you think this would have on what songs the model recommends?
Repeat the steps of this project, this time applying the changes you think would improve the model. Can you increase its accuracy?

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Post a Question

Variations

Find another dataset on Kaggle and perform this project again. How well does the model work on your new data?

Careers

If you like this project, you might enjoy exploring these related careers:

Data Scientist

Career Profile

Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more

Computer Software Engineer

Career Profile

Are you interested in developing cool video game software for computers? Would you like to learn how to make software run faster and more reliably on different kinds of computers and operating systems? Do you like to apply your computer science skills to solve problems? If so, then you might be interested in the career of a computer software engineer. Read more

Computer Programmer

Career Profile

Computers are essential tools in the modern world, handling everything from traffic control, car welding, movie animation, shipping, aircraft design, and social networking to book publishing, business management, music mixing, health care, agriculture, and online shopping. Computer programmers are the people who write the instructions that tell computers what to do. Read more

Related Links

News Feed on This Topic

, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Build Personal Playlists with Machine Learning." Science Buddies, 1 May 2024, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p012/artificial-intelligence/K-Means-Spotify. Accessed 1 Aug. 2026.

APA Style

Ngo, T. (2024, May 1). Build Personal Playlists with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p012/artificial-intelligence/K-Means-Spotify

Last edit date: 2024-05-01

Explore Our Science Videos

The Scientific Method: Steps and Examples

Ping Pong Pickup Challenge | 2023 Science Buddies Engineering Challenge

The Physics of Bouncing a Ball | Science Project

Top

Free science fair projects.