Abstract
AI has the power to organize data into categories—even when it is not given clear instructions. In this project, we will create an unsupervised K-Means learning model that organizes songs into distinct groups based on their similarities. The best part? You don't need advanced coding skills; just bring an open mind and a desire to explore the fascinating intersection of music and technology. Are you ready to embark on this exciting journey?
Summary
None
Readily available
No issues
Objective
Create an unsupervised K-Means learning model that categorizes songs into a set number of groups, then evaluate the accuracy of the song recommendation function.
Introduction
Imagine it is after school and you are preparing for a study session. As you set up your study area, create a plan, and gather the necessary materials, you open a music streaming platform on your device. Initially, you contemplate playing your go-to study playlist, but you decide to explore the platform's personalized playlist recommendations instead.
This gets you thinking about the intricate process behind the platform's song suggestions. It's fascinating: With a vast catalog spanning millions of songs, the platform consistently recommends tracks that match your musical tastes. Music services like Spotify can do this with the help of artificial intelligence.
Artificial Intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. Machine learning encompasses various techniques, including supervised learning, in which models are trained on labeled data to make predictions, and unsupervised learning, in which models uncover patterns and relationships within data without explicit labels.
K-Means is an unsupervised machine learning algorithm that groups similar data points together into clusters, with the goal of minimizing the differences within each cluster and maximizing the differences between clusters. It is commonly used for tasks like recognizing patterns and making recommendations.
K-Means outperforms manual data analysis because it is more efficient in terms of time and effort. It is great at finding complex patterns that might be hard for people to see. K-Means is consistent and works well with both small and large datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details.
However, K-Means has some limitations. Selecting the appropriate number of clusters (K) can be challenging and subjective. K-Means also struggles with datasets that contain outliers or features of varying importance. Despite these limitations, K-Means is a valuable tool when used carefully and in combination with preprocessing techniques that enhance its performance.
K-Means exemplifies the ability of machine learning algorithms to uncover patterns and make data-driven decisions. Companies and researchers are interested in K-Means for a variety of uses. Companies like Spotify or Netflix, for instance, may use K-Means to automatically generate recommendations for their users, enhancing user engagement and satisfaction.
Watch this video to learn more about K-Means:
In this project, we will give you the basic code for implementing the K-Means model. Your job is to use this code and see how well it can recommend songs that you will enjoy.
The dataset you will be using contains information about 24 features of songs on Spotify:
- Artist: The artist's name
- Track: The title of the track
- Album: The name of the album
- Album_type: The type of album (Album, Single, Compilation)
- Danceability: A measure of how suitable a song is for dancing. It quantifies the rhythm, tempo, and beat strength of a track. Spotify uses machine learning algorithms to analyze these features and assign a danceability score to each song. The score typically ranges from 0 to 1, where higher values indicate more danceable tracks.
- Energy: A measure of the intensity and activity level of a song. It reflects how dynamic and lively a track feels. Energy is calculated by analyzing its loudness and dynamic range. It is assigned a value from 0 to 1.
- Loudness: The perceived volume or intensity of a song. Spotify quantifies Loudness using Loudness Units Full Scale (LUFS), a standardized unit of measurement used in audio engineering and music analysis. It takes into account human perception of loudness, making it suitable for music and audio content. The LUFS scale is designed to measure consistent loudness levels across different audio sources.
- Speechiness: A measure of the extent to which a segment of audio or an entire song contains spoken words, vocalizations, or non-musical sounds, as opposed to instrumental music. It is a valuable metric for categorizing and understanding the nature of audio content. Speechiness is calculated by analyzing audio frames from a song and classifying them as speech or non-speech based on acoustic features, then calculating the proportion of frames classified as speech. The result is a Speechiness score ranging from 0 to 1, where higher values indicate a greater presence of spoken words or vocal content in the song compared to instrumental music.
- Acousticness: A measure of the degree of acoustic instrumentation and arrangements in a song, distinguishing between music dominated by traditional acoustic instruments and that featuring electronic or synthesized sounds. It is calculated by dividing the song's acoustic frames by its non-acoustic (electronic) frames. The result is a numerical Acousticness score, typically ranging from 0 (highly electronic) to 1 (purely acoustic).
- Instrumentalness: A numerical measure that quantifies instrumental content in a song, helping differentiate between purely instrumental tracks and those with vocals. To calculate Instrumentalness, the audio waveform is divided into frames, and machine learning algorithms or statistical models classify each frame as instrumental or non-instrumental based on acoustic features. The aggregated result is an Instrumentalness score typically ranging from 0 (vocal-heavy) to 1 (purely instrumental).
- Liveness: A numerical measure that gauges the presence of live performance characteristics in a song, distinguishing between studio and live concert recordings. To calculate liveness, audio analysis considers features such as crowd noise, audience reactions, and instrumentation variations. Machine learning models classify audio segments as live or studio based on these features, resulting in a Liveness score typically ranging from 0 (studio recording) to 1 (live performance).
- Valence: A numerical measure that quantifies the emotions or mood of a song, indicating whether it conveys positivity or negativity. It is determined through the analysis of various musical elements, including lyrics, melody, harmony, tempo, and instrumentation. Valence scores typically range from 0 (negative or sad) to 1 (positive or happy).
- Tempo: A measure of the speed or pace at which a piece of music is performed. It is typically measured in beats per minute (BPM). It quantifies the rhythm and timing of a composition, with a higher BPM indicating a faster tempo and a lower BPM indicating a slower tempo. Tempo is calculated by detecting and counting the beats in music over a one-minute interval.
- Duration_min: A measure of how long a song is in minutes
- Title: The title of the music video
- Views: The number of views a music video has on YouTube
- Likes: The number of likes a music video has on YouTube
- Comments: The number of comments a music video has on YouTube
- Licensed: When a song is "licensed," it means that the rights to use that song for specific purposes or in certain contexts have been legally granted by the copyright holder (typically the songwriter, composer, or music publisher) to another party.
- official_video: Whether or not a music video is official. An "official" music video is a video authorized and released by the copyright holders, such as the artist or record label. These videos are professionally produced, widely distributed through official channels, and subject to copyright protection, distinguishing them from unofficial or fan-made videos.
- Stream: The number of times a track has been played on Spotify
- EnergyLiveness: A custom feature for this dataset that combines the Energy and Liveness features
- most_playedon: The platform that the track has been played the most on, either Spotify or YouTube
Taken together, these 24 features are intended to describe how a song sounds and its appeal to listeners. These factors are crucial for artists and music platforms in creating and recommending music that resonates with listeners.
Terms and Concepts
- Artificial Intelligence (AI)
- Machine Learning
- Supervised Learning
- Unsupervised Learning
- K-Means
- Noise
- Normalization
- Scaling
- Dimensionality
- Principal Component Analysis (PCA)
- Cluster
Questions
- What is the difference between supervised learning and unsupervised learning?
- In simple terms, describe the main goal of the K-Means algorithm.
- What are some limitations of the K-Means algorithm?
- What is the main focus of this project involving the K-Means model?
Bibliography
- StatQuest. (2018, May 23). StatQuest: K-means clustering. Retrieved September 12, 2023.
- Kaggle. (2023). Spotify dataset. Retrieved September 12, 2023.
- Data Science A-Z for Beginners and Advanced. Part 41 How to Choose the Number of Clusters. Retrieved September 12, 2023.
- StatQuest. (2017, December 4). StatQuest: PCA main ideas in only 5 minutes!!!. Retrieved September 12, 2023.
Materials and Equipment
- Laptop or desktop computer
- Internet access
- Lab notebook
- Pen or pencil
Experimental Procedure

Setting up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the spotify.ipynb file from Science Buddies.
- Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
- Run the code block under Importing Libraries, to bring in all of the functions.
Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.
- Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
- click on the current cell you are working on, then click 'Runtime' and 'Run before'.
Preprocessing the Dataset
Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.
- Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will drop Artist, Track, Album, Album_type, Title, Channel, Licensed, official_video, and most_playedon.
- Features like Artist, Track (name), Album (name), and Title are usually unique values and, in general, they do not tell us much about how the song sounds. They will not help the program group the data.
- The other features, such as whether a song was licensed or which platform it was played on most, are also unlikely to correlate with the listener's musical tastes.
We have provided the code to delete certain columns from our Pandas DataFrame—a two-dimensional and highly flexible data structure provided by the Pandas library in Python. Add in the names of the columns you want to delete, making sure they match exactly as they are written in the dataset. Then, run the code in the cell.
- Dropping NaN Values: NaN stands for Not a Number. NaN values typically indicate missing or incomplete data. When we remove these values, datasets become compatible with algorithms that require completely numerical data. Statistical analyses become more accurate, and the risk of biased predictions due to incomplete information is reduced.
We have provided the code to drop NaN values. Run the code in the cell. Notice the shape of the DataFrame before and after dropping NaN values.
- Normalizing/Scaling Features: Since K-Means is a distance-based algorithm, it is crucial to normalize or scale the features to ensure that all features contribute equally to the distance calculations. If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the calculations.
For instance, if Energy values range from 0 to 1 and Views values range from 0 to 8 billion, the larger range for Views could cause Views to overshadow the Energy feature in distance calculations.
- Normalization: This is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. The technique we will be using is called Min-Max scaling. It will bring the values between 0 and 1.
- Scaling: In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.
As in the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing. After using the describe() function, we can see that some features are already within the range of 0 to 1, including Danceability, Energy, Speechiness, Acousticness, Instrumentalness, Liveness, and Valence, so we can exclude those features from this step. We will be normalizing our other numerical variables, which in this case would be Loudness, Tempo, Duration_min, Views, Likes, Comments, Energy, and Liveness. Add these names to the list specified by the comment in the code. Then, run the code in the cell.
Clustering the Data
There are various strategies to determine the optimal number of clusters. One of the techniques we'll explore is called the Elbow method. This method involves creating a plot that illustrates the inertia, a measure of how well the data has been grouped by the K-Means algorithm, as a function of the number of clusters. Within this plot, we are on the lookout for a point where the decrease in inertia starts to slow down, resembling the bend of an elbow. This point is indicative of a potential optimal number of clusters for the given dataset. Click here to learn more about the Elbow Method and how to choose the number of clusters.
We have provided the code for a function that works out the optimal number of clusters. Run the code in these cells and observe the graph. Where in the graph is there a big change in inertia? (Find the "elbow" part of the graph.)
Applying K-Means Clustering
- We have provided the code to make a K-Means classifier with a specified number of clusters. Input the K value you chose from the previous step into the K-Means model as indicated by the comment.
Visualizing the Model
- Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of features used to describe each item in the dataset. There are many methods for reducing dimensionality. We will use a technique called Principal Component Analysis (PCA). Click here to learn more about PCA.
- In the dataset, there are multiple features (dimensions) that describe each instance. In our Spotify dataset, we have features like Danceability, Energy, and Loudness. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
- PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
- In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we are now plotting the data points in a two-dimensional space defined by the two most important principal components. This makes it possible to create a clear and interpretable graph of the K-Means clustering.
- The graph generated by the provided code visualizes the number of clusters of the K-Means classifier in a two-dimensional space. These clusters represent the different groups that the songs are categorized into.
Creating Our Song Recommendation Function
- We have provided the code for the function find_track_index(track_name, df), which finds the index of a given track name in the DataFrame's Track column. If the track name is found, it returns the index; otherwise, it returns None to signify that the track name is absent in the DataFrame. Run the code in this cell.
- We have provided the code for the function find_song_recommendation(track_name, df), which takes a track name and a DataFrame containing song information and clusters. It first identifies the cluster to which the input track belongs, then selects songs from the same cluster, and finally generates and prints five song recommendations from that cluster, aiming to suggest similar songs. Run the code in this cell.
- Experiment with inputting different track names into the function. When you input the name of songs you like, does the recommender function also output songs you like? In a journal, keep track of how many of the songs you like out of the five recommended. If it includes songs you have not heard before, find them on Spotify or another streaming service and give them a listen.
- Repeat this experiment at least five times. Every time, record the proportion of songs that you like. For example, if you liked four of the five songs, write down either 4/5 or 0.8.
- Have other people give it a try, and record the proportion of songs that they liked.
- Record your results in a table like Table 1.
| Trial 1 | Trial 2 | Trial 3 | Trial 4 | Trial 5 | |
|---|---|---|---|---|---|
| Listener 1 (me) | |||||
| Listener 2 | |||||
| Listener 3 |
Table 1. Table for recording what proportion of the five songs each listener liked.
Creating Our Song Randomizer Function
- We have provided the code for the function find_random_song(track_name, df), which takes a track name and a DataFrame that contains song information and clusters. It identifies the cluster to which the input track belongs, then selects songs from different clusters, and finally generates and prints five random song recommendations that are distinct from the provided track. Run the code in this cell.
- Experiment with inputting different track names into the function. When you input the name of a song you like, the function should output songs that it thinks are very different from that song. Do you still see songs that you like?
- In your lab notebook, record the proportion of songs that you like in another table like Table 1.
- Have other people give it a try, and record the proportion of songs that they liked.
Evaluating the Model
- We have provided the code for calculating the accuracies of the recommender and random song functions. Input the accuracy rate for each function in the list specified by the comment. There is no need to separate the trials or people.
For instance, if your data looked like Table 2, you would input the values as follows: recommendations_accuracy = [0.8, 0.6, 0.2, 0.8, 1, 1, 1, 0.8, 0.6, 0.1].
| Trial 1 | Trial 2 | Trial 3 | Trial 4 | Trial 5 | |
|---|---|---|---|---|---|
| Listener 1 (me) | 0.8 | 0.6 | 0.2 | 0.8 | 1 |
| Listener 2 | 1 | 1 | 0.8 | 0.6 | 1 |
Table 2. Example data table.
- Compare the accuracies for both of the functions. Did the song recommendations perform significantly better than the random song generator?
Experimenting and Improving
- If the song recommendation function was not as accurate as you would like, what do you think we could do to improve the model?
- What happens if we drop certain features? For example, we could drop Views since that feature tends to group songs based on popularity. A listener would not necessarily like a song just because it was popular. Which other features could we drop?
- Could we increase or decrease the number of clusters? What impact do you think this would have on what songs the model recommends?
- Repeat the steps of this project, this time applying the changes you think would improve the model. Can you increase its accuracy?
Ask an Expert
Variations
- Find another dataset on Kaggle and perform this project again. How well does the model work on your new data?
Careers
If you like this project, you might enjoy exploring these related careers:










