Jump to main content

Can AI Predict Who Survived the Titanic's Sinking?

1
2
3
4
5
46 reviews

Abstract

Given the right data, AI can be good at making predictions. In this project, you will create a KNN machine learning model that tries to predict whether a passenger on the Titanic survived based on multiple factors such as age, sex, and fare price. This project requires little to no coding skill; instead, you will need an open mind and curiosity to learn. Ready to give it a try?

Summary

Areas of Science
Difficulty
Method
Time Required
Very Short (≤ 1 day)
Prerequisites

None

Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Train a KNN learning model that can predict whether a given passenger from the Titanic survived.

Introduction

Imagine you are a passenger on the Titanic when the ship hits an iceberg and starts to sink. As people begin to crowd aboard the 20 lifeboats, parents still on the ship beg them to take their children. Elderly men and women tell those helping them to save themselves; they say they've already lived their own lives. Other passengers care only about themselves and ignore others in need.

In the end, of about 2,200 people on the ship, only 706 will have survived.

In the weeks after the disaster, as you pore over the lists of dead and survivors, it will become clear that there are patterns in the types of people who have lived. Many are children whose parents sacrificed themselves to save them. Others are able-bodied individuals who were simply able to escape faster than their fellow passengers. Various factors such as sex, age, and the number of family members they came with seem to have affected whether or not an individual survived.

When we look back on the disaster today, can we use artificial intelligence to find these patterns? Can we predict who survived and who did not?

Artificial Intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. One of the techniques employed in machine learning is the K-Nearest Neighbors (KNN) algorithm, which involves making predictions based on the characteristics of nearby data points. KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time.

KNN is better than analyzing data manually because it saves time and effort. It is great at finding complex patterns that might be hard for people to see. KNN is consistent and works well with both small and big datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details. However, KNN has some limitations. It might not perform optimally when confronted with data that is contaminated by "noise" — random or irrelevant fluctuations that obscure true patterns. Its computational complexity can hinder efficiency with large datasets, and its performance can deteriorate in high-dimensional spaces. The selection of the optimal number of neighbors (K) is not always straightforward, and KNN's equal treatment of all neighbors can struggle with imbalanced data.

KNN is a straightforward machine learning technique used for classification and regression tasks. You have to know how many groups you want it to classify things into. In this case, we are interested in classifying passengers into two groups: those who survived and those who did not. KNN predicts by finding the k closest examples from the data to the new passenger whose fate we are trying to predict. Then, KNN predicts by counting which group has the most similar passengers. We are predicting the fate of passengers based on who are most alike.

However, KNN needs to be trained before it can make accurate predictions. To do this, you'll use part of the data to train the algorithm and the other part to test its performance and see how well it performs. This process helps the algorithm learn from patterns in the training data and apply them to new, unseen data.

KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time. Companies and researchers are interested in KNN for a variety of uses. For instance, companies can use KNN to examine financial data and predict whether a particular investment is a good financial decision. Similarly, healthcare professionals can use KNN to analyze patient data to predict whether someone has a certain disease.

Watch this video to learn more about KNN:

In this project, we will give you the basic code for implementing the KNN model. You will take that basic code and explore how changing the number of neighbors changes the model's accuracy when predicting whether a passenger from the Titanic survived.

Terms and Concepts

Questions

Bibliography

Materials and Equipment

Experimental Procedure

This project follows the Scientific Method. Review the steps before you begin.

Setting up the Google Colab Environment

  1. You will need a Google account. If you do not have one, make one when prompted.
  2. Download the titanic.ipynb file from Science Buddies.
  3. Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
  4. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
  5. Run the code block under Importing Libraries, to bring in all of the functions.
Coding Tip:

Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.

  • Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
  • click on the current cell you are working on, then click 'Runtime' and 'Run before'.

Loading the Dataset into the Google Colab

  1. Load the Dataset in the Google Colab Notebook by running the cell under this section. Congratulations, the dataset has been successfully loaded!

Preprocessing the Dataset

Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.

  1. Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will be dropping passenger ID, name, ticket number, cabin information, and port of embarkation. Here are the reasons for each:
    1. Passenger ID: The passenger ID is usually a unique identifier assigned to each passenger. It does not carry any inherent predictive value and is simply an arbitrary identifier. Including it as a feature could potentially introduce noise (random or irrelevant information that can disrupt the meaningful patterns or signals in data).
    2. Name: Similar to passenger ID, names are typically unique for each passenger and do not have a direct impact on survival prediction.
    3. Ticket Information: Similar to passenger ID and name, ticket numbers are unique identifiers and generally do not provide meaningful insights into a passenger's survival.
    4. Cabin Information: As we can see from the Cabin column, cabin information has a high number of missing values in the dataset, making it difficult to use effectively.
    5. Port of Embarkation: The port from which a passenger embarked is unlikely to have a significant impact on their survival. Survival on the Titanic is more likely to be influenced by factors such as passenger class, age, sex, and possibly fare price, which are more directly related to a passenger's circumstances during the disaster.

    We have provided the code to delete certain columns from our Pandas DataFrame. Add in the names of the columns you want to delete, making sure they match exactly as they are written in the dataset. Then, run the code in the cell.

  2. Normalizing Features: In KNN, the algorithm relies on measuring distances between data points to make predictions. To ensure accurate results, it is important that all features contribute fairly to these distance calculations. If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the distances.
    1. For instance, if one feature spans ages from 0 to 100 and another spans fare price from 0 to 512, the fare price's larger range could overshadow the age feature in distance calculations.
    2. Normalization is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. The technique we will be using is called Min-Max scaling (bringing values between 0 and 1).
    3. In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.

    As in the previous step, we have provided the code to normalize certain columns from our pandas dataframe. Add in the names of the columns that we will be normalizing. We will be normalizing our numerical variables, which in this case would be age, SibSp (# of siblings/spouses aboard the Titanic), Parch (# of parents/children aboard the Titanic), and Fare. Add in these names to the list specified by the comment in the code. Then, run the code in the cell.

  3. Encoding Categorical Variables: Encoding categorical variables is essential when working with machine learning algorithms, because most algorithms, including KNN, require numerical input. Categorical variables, which represent qualitative attributes like gender, can't be used directly in their original form because they lack a numerical representation that algorithms can process.
    1. Label Encoding: Label encoding involves converting categorical values into integers. Each category is assigned a unique integer value. Click the link to learn more about label and one hot encoding.
    2. In our case, the only variable that requires encoding is Sex. It is worth noting that Survived and Pclass are already label encoded, so no further preprocessing is needed for these features.
    We have provided the code to label encode our Sex feature. Run the code in the cell and pay attention to how that changes the values in our dataframe!
  4. Handling Missing Data: Real-world datasets often contain missing values for various reasons, such as data collection errors or incomplete records. Preprocessing involves strategies such as imputation (filling missing values with estimated values) or removal of incomplete rows/columns to ensure the dataset is complete and usable for modeling.
    1. Missing values in datasets often show up as NaN (Not a Number). By removing rows or columns with NaN values, you improve the overall quality and integrity of your dataset.
    We have provided the code to drop NaN values. Run the code in the cell.
  5. Splitting the Training and Testing Data: Splitting data into training and testing sets is crucial in machine learning. When you do so, you can assess your model's performance on new, unseen data, ensuring its ability to generalize beyond training examples. Click the link to learn more about why we split datasets. We have provided the code to split the dataset into training and testing portions. Take note of how X and y look following this step, as well as the dimensions of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of features in each set.
    Coding Tip:

    Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.

Training the Model

  1. We have provided the code to make a KNN classifier with a specified number of neighbors, where we have chosen 5 as the default number. This code trains the classifier using the training data. It then makes predictions on survival outcomes for the test data based on what it learns from the training data. Afterward, it calculates how accurate its guesses were.
    1. Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.
      1. Mathematically, accuracy is calculated as:
      2. Accuracy of 1 means that the model is correct all of the time.

Visualizing the Model

  1. The graph generated by the provided code visualizes the decision boundary of the KNN classifier in a two-dimensional space. This decision boundary represents how the KNN classifier separates different classes based on features of the data.
  2. Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of different features used to describe each item in the dataset. There are many methods for reducing dimensionality, and we will use a technique called Principal Component Analysis (PCA). Click the link to learn more about PCA.
    1. In the dataset, there are multiple features (dimensions) that describe each instance. In our Titanic dataset, we have features like age, fare, and number of siblings. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
    2. PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
    3. In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we are now plotting the data points in a two-dimensional space defined by the two most important principal components. This makes it possible to create a clear and interpretable graph of the decision boundary of the KNN classifier.

Evaluating the Model

  1. Describe what the graph predicting the KNN decision boundary using PCA-reduced data is showing.
  2. How does the graph help us understand the performance and behavior of the KNN classifier?
  3. Can you explain the relationship between the regions on the graph and the predictions made by the KNN model?
  4. Can you explain what the accuracy score means?

Experimenting and Improving

We can improve our model's accuracy by experimenting with different numbers of neighbors. Experimenting with a different number of neighbors in a K-Nearest Neighbors algorithm is important because the choice of the number of neighbors, often denoted as k, can significantly impact the performance and behavior of the model.

  1. When k is too small: The model might give too much weight to a small number of neighbors and miss the overall trend of the data. For example, if you only ask two friends about their recommendations for a restaurant, and one friend had a bad experience, you might avoid that restaurant.
  2. When k is too big: The model could be heavily influenced by the majority class and not notice the nuances of the minority class. For example, if you ask a lot of friends about their restaurant recommendations, their mixed opinions would balance each other out. Even if a couple of friends had negative experiences, the majority could sway your decision toward trying that restaurant.

Duplicate the provided cell twice, each time modifying the k value—first to 1, then to 8, then to 20. This will help you compare how accurately the KNN classifier predicts with different k values. Which k value yields the highest accuracy?

icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Variations

  • Try to come up with an automatic way to look at accuracy vs. number of neighbors.
  • If you would like to try your hand at a more complex KNN project, try the Can AI Diagnose Breast Cancer? project.

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
Are you interested in developing cool video game software for computers? Would you like to learn how to make software run faster and more reliably on different kinds of computers and operating systems? Do you like to apply your computer science skills to solve problems? If so, then you might be interested in the career of a computer software engineer. Read more
Career Profile
Computers are essential tools in the modern world, handling everything from traffic control, car welding, movie animation, shipping, aircraft design, and social networking to book publishing, business management, music mixing, health care, agriculture, and online shopping. Computer programmers are the people who write the instructions that tell computers what to do. Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Can AI Predict Who Survived the Titanic's Sinking?" Science Buddies, 7 July 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p009/artificial-intelligence/KNN-titanic-survivor. Accessed 7 June 2026.

APA Style

Ngo, T. (2025, July 7). Can AI Predict Who Survived the Titanic's Sinking? Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p009/artificial-intelligence/KNN-titanic-survivor


Last edit date: 2025-07-07
Top
Free science fair projects.