Abstract
Given the right data, AI can be good at making predictions. In this project, you will create a KNN machine learning model that tries to predict whether a passenger on the Titanic survived based on multiple factors such as age, sex, and fare price. This project requires little to no coding skill; instead, you will need an open mind and curiosity to learn. Ready to give it a try?
Summary
None
Readily available
No issues
Objective
Train a KNN learning model that can predict whether a given passenger from the Titanic survived.
Introduction
Imagine you are a passenger on the Titanic when the ship hits an iceberg and starts to sink. As people begin to crowd aboard the 20 lifeboats, parents still on the ship beg them to take their children. Elderly men and women tell those helping them to save themselves; they say they've already lived their own lives. Other passengers care only about themselves and ignore others in need.
In the end, of about 2,200 people on the ship, only 706 will have survived.
In the weeks after the disaster, as you pore over the lists of dead and survivors, it will become clear that there are patterns in the types of people who have lived. Many are children whose parents sacrificed themselves to save them. Others are able-bodied individuals who were simply able to escape faster than their fellow passengers. Various factors such as sex, age, and the number of family members they came with seem to have affected whether or not an individual survived.
When we look back on the disaster today, can we use artificial intelligence to find these patterns? Can we predict who survived and who did not?
Artificial Intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. One of the techniques employed in machine learning is the K-Nearest Neighbors (KNN) algorithm, which involves making predictions based on the characteristics of nearby data points. KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time.
KNN is better than analyzing data manually because it saves time and effort. It is great at finding complex patterns that might be hard for people to see. KNN is consistent and works well with both small and big datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details. However, KNN has some limitations. It might not perform optimally when confronted with data that is contaminated by "noise" — random or irrelevant fluctuations that obscure true patterns. Its computational complexity can hinder efficiency with large datasets, and its performance can deteriorate in high-dimensional spaces. The selection of the optimal number of neighbors (K) is not always straightforward, and KNN's equal treatment of all neighbors can struggle with imbalanced data.
KNN is a straightforward machine learning technique used for classification and regression tasks. You have to know how many groups you want it to classify things into. In this case, we are interested in classifying passengers into two groups: those who survived and those who did not. KNN predicts by finding the k closest examples from the data to the new passenger whose fate we are trying to predict. Then, KNN predicts by counting which group has the most similar passengers. We are predicting the fate of passengers based on who are most alike.
However, KNN needs to be trained before it can make accurate predictions. To do this, you'll use part of the data to train the algorithm and the other part to test its performance and see how well it performs. This process helps the algorithm learn from patterns in the training data and apply them to new, unseen data.
KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time. Companies and researchers are interested in KNN for a variety of uses. For instance, companies can use KNN to examine financial data and predict whether a particular investment is a good financial decision. Similarly, healthcare professionals can use KNN to analyze patient data to predict whether someone has a certain disease.
Watch this video to learn more about KNN:
In this project, we will give you the basic code for implementing the KNN model. You will take that basic code and explore how changing the number of neighbors changes the model's accuracy when predicting whether a passenger from the Titanic survived.
Terms and Concepts
- Artificial Intelligence (AI)
- Machine Learning
- K-Nearest Neighbors (KNN)
- Classification
- Regression
- Noise
- Normalization
- Accuracy
- Decision Boundary
- Dimensionality
- Principal Component Analysis (PCA)
Questions
- What is the K-Nearest Neighbors (KNN) algorithm? How does it make predictions?
- What challenges can noisy data pose for the accuracy of the KNN algorithm?
- What is the main focus of this project involving the KNN model, and how will you explore its performance?
Bibliography
- Kaggle. (2012, September 28). Titanic - Machine Learning from Disaster. Retrieved August 9, 2023.
- StatQuest (2023, February 12). One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!! Retrieved August 9, 2023.
- StatQuest. (2017, December 4). StatQuest: PCA main ideas in only 5 minutes!!! Retrieved August 9, 2023.
Materials and Equipment
- Laptop or desktop computer
- Internet access
Experimental Procedure

Setting up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the titanic.ipynb file from Science Buddies.
- Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
- Run the code block under Importing Libraries, to bring in all of the functions.
Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.
- Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
- click on the current cell you are working on, then click 'Runtime' and 'Run before'.
Loading the Dataset into the Google Colab
- Load the Dataset in the Google Colab Notebook by running the cell under this section. Congratulations, the dataset has been successfully loaded!
Preprocessing the Dataset
Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.
- Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will be dropping passenger ID, name, ticket number, cabin information, and port of embarkation. Here are the reasons for each:
- Passenger ID: The passenger ID is usually a unique identifier assigned to each passenger. It does not carry any inherent predictive value and is simply an arbitrary identifier. Including it as a feature could potentially introduce noise (random or irrelevant information that can disrupt the meaningful patterns or signals in data).
- Name: Similar to passenger ID, names are typically unique for each passenger and do not have a direct impact on survival prediction.
- Ticket Information: Similar to passenger ID and name, ticket numbers are unique identifiers and generally do not provide meaningful insights into a passenger's survival.
- Cabin Information: As we can see from the Cabin column, cabin information has a high number of missing values in the dataset, making it difficult to use effectively.
- Port of Embarkation: The port from which a passenger embarked is unlikely to have a significant impact on their survival. Survival on the Titanic is more likely to be influenced by factors such as passenger class, age, sex, and possibly fare price, which are more directly related to a passenger's circumstances during the disaster.
We have provided the code to delete certain columns from our Pandas DataFrame. Add in the names of the columns you want to delete, making sure they match exactly as they are written in the dataset. Then, run the code in the cell.
- Normalizing Features: In KNN, the algorithm relies on measuring distances between data points to make predictions. To ensure accurate results, it is important that all features contribute fairly to these distance calculations. If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the distances.
- For instance, if one feature spans ages from 0 to 100 and another spans fare price from 0 to 512, the fare price's larger range could overshadow the age feature in distance calculations.
- Normalization is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. The technique we will be using is called Min-Max scaling (bringing values between 0 and 1).
- In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.
As in the previous step, we have provided the code to normalize certain columns from our pandas dataframe. Add in the names of the columns that we will be normalizing. We will be normalizing our numerical variables, which in this case would be age, SibSp (# of siblings/spouses aboard the Titanic), Parch (# of parents/children aboard the Titanic), and Fare. Add in these names to the list specified by the comment in the code. Then, run the code in the cell.
- Encoding Categorical Variables: Encoding categorical variables is essential when working with machine learning algorithms, because most algorithms, including KNN, require numerical input. Categorical variables, which represent qualitative attributes like gender, can't be used directly in their original form because they lack a numerical representation that algorithms can process.
- Label Encoding: Label encoding involves converting categorical values into integers. Each category is assigned a unique integer value. Click the link to learn more about label and one hot encoding.
- In our case, the only variable that requires encoding is Sex. It is worth noting that Survived and Pclass are already label encoded, so no further preprocessing is needed for these features.
- Handling Missing Data: Real-world datasets often contain missing values for various reasons, such as data collection errors or incomplete records. Preprocessing involves strategies such as imputation (filling missing values with estimated values) or removal of incomplete rows/columns to ensure the dataset is complete and usable for modeling.
- Missing values in datasets often show up as NaN (Not a Number). By removing rows or columns with NaN values, you improve the overall quality and integrity of your dataset.
- Splitting the Training and Testing Data: Splitting data into training and testing sets is crucial in machine learning. When you do so, you can assess your model's performance on new, unseen data, ensuring its ability to generalize beyond training examples. Click the link to learn more about why we split datasets. We have provided the code to split the dataset into training and testing portions. Take note of how X and y look following this step, as well as the dimensions of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of features in each set.
Coding Tip:
Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.
Training the Model
- We have provided the code to make a KNN classifier with a specified number of neighbors, where we have chosen 5 as the default number. This code trains the classifier using the training data. It then makes predictions on survival outcomes for the test data based on what it learns from the training data. Afterward, it calculates how accurate its guesses were.
- Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.
- Mathematically, accuracy is calculated as:
- Accuracy of 1 means that the model is correct all of the time.
- Mathematically, accuracy is calculated as:
- Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.
Visualizing the Model
- The graph generated by the provided code visualizes the decision boundary of the KNN classifier in a two-dimensional space. This decision boundary represents how the KNN classifier separates different classes based on features of the data.
- Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of different features used to describe each item in the dataset. There are many methods for reducing dimensionality, and we will use a technique called Principal Component Analysis (PCA). Click the link to learn more about PCA.
- In the dataset, there are multiple features (dimensions) that describe each instance. In our Titanic dataset, we have features like age, fare, and number of siblings. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
- PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
- In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we are now plotting the data points in a two-dimensional space defined by the two most important principal components. This makes it possible to create a clear and interpretable graph of the decision boundary of the KNN classifier.
Evaluating the Model
- Describe what the graph predicting the KNN decision boundary using PCA-reduced data is showing.
- How does the graph help us understand the performance and behavior of the KNN classifier?
- Can you explain the relationship between the regions on the graph and the predictions made by the KNN model?
- Can you explain what the accuracy score means?
Experimenting and Improving
We can improve our model's accuracy by experimenting with different numbers of neighbors. Experimenting with a different number of neighbors in a K-Nearest Neighbors algorithm is important because the choice of the number of neighbors, often denoted as k, can significantly impact the performance and behavior of the model.
- When k is too small: The model might give too much weight to a small number of neighbors and miss the overall trend of the data. For example, if you only ask two friends about their recommendations for a restaurant, and one friend had a bad experience, you might avoid that restaurant.
- When k is too big: The model could be heavily influenced by the majority class and not notice the nuances of the minority class. For example, if you ask a lot of friends about their restaurant recommendations, their mixed opinions would balance each other out. Even if a couple of friends had negative experiences, the majority could sway your decision toward trying that restaurant.
Duplicate the provided cell twice, each time modifying the k value—first to 1, then to 8, then to 20. This will help you compare how accurately the KNN classifier predicts with different k values. Which k value yields the highest accuracy?
Ask an Expert
Variations
- Try to come up with an automatic way to look at accuracy vs. number of neighbors.
- If you would like to try your hand at a more complex KNN project, try the Can AI Diagnose Breast Cancer? project.
Careers
If you like this project, you might enjoy exploring these related careers:










