Jump to main content

Can AI Diagnose Breast Cancer?

1
2
3
4
5
51 reviews

Abstract

Getting started with machine learning is like unlocking a new world of possibilities, and the best part is that you don't need to be a computer genius to do it! In this project, you will create a K-Nearest Neighbors (KNN) machine learning model that can predict whether a patient has a benign tumor or malignant breast cancer based on the characteristics of the tumor cell nucleus, such as its radius, perimeter, area, and smoothness.

Summary

Areas of Science
Difficulty
Method
Time Required
Short (2-5 days)
Prerequisites

None

Material Availability

Readily available

Cost
Very Low (under $20)
Safety

No issues

Credits
Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

Objective

Create a KNN learning model that can predict whether a patient has a malignant or benign breast tumor.

Introduction

Accurate and early diagnosis is crucial for effective treatment planning and patient outcomes, especially when it comes to a disease such as breast cancer. Diagnosis begins with determining whether a breast tumor (mass of abnormal cells) is benign or malignant.

Benign breast tumors may continue to grow, but they are not aggressive toward the tissue around them. They remain contained, and often doctors advise leaving them alone. In contrast, malignant breast tumors are cancerous. As they grow, they actively invade and damage the surrounding tissue. Over time they may become metastatic, which means that the cancerous cells have spread through the lymphatic system or blood and created secondary tumors in other parts of the body.

One of the methods used to diagnose whether a breast tumor is benign or malignant involves having a pathologist (a specialized doctor) look at biopsies (tissue samples) taken from the tumor using a fine needle.

The pathologist examines the tissue sample under a microscope, looking for physical changes in the cells' nuclei that indicate that the cells are cancerous. The pathologist may take measurements and make counts of what they see. In the end, they use their data to make a calculated diagnosis of benign or malignant.

As you can probably tell, this process takes time, and it has to be done by a highly trained expert. What if we could use technology to make it faster and easier?

In this science project, you will explore whether AI and machine learning can help accurately diagnose breast cancer. If so, their use could speed up the diagnosis process and bring diagnostic capabilities to geographic areas where cancer pathologists are not available.

Artificial intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. One of the techniques employed in machine learning is the K-Nearest Neighbors (KNN) algorithm, which involves making predictions based on the characteristics of nearby data points. KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time.

KNN can save time and effort compared to analyzing data manually. It is great at finding complex patterns that might be hard for people to see. KNN is consistent and works well with both small and large datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details.

KNN has some limits, too. For example, it might not work as well with data that has a lot of noise—random or irrelevant information that can hide true patterns. And because it makes complex calculations, it can be slow to use with large datasets. It is important to decide if KNN is the right fit based on what you need to do with the data.

KNN is a straightforward machine learning technique used for classification and regression tasks. It predicts by comparing a new data point to a certain number (k) of its nearest neighbors. The algorithm then makes predictions based on how the majority of its neighbors are categorized (for classification) or by averaging its neighbors' values (for regression).

Watch this video for a simple and clear introduction to KNN:

In this science project, we will guide you through training an AI model—using KNN—to diagnose breast tumor biopsies as benign or malignant. You will then test and optimize the model to achieve the best accuracy you can and evaluate how useful you think the model is.

The dataset you will be using comes from Dr. William H. Wolberg from the University of Wisconsin Hospitals in Madison. It is commonly called the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. It contains data about the nuclei of cells found in breast tumor samples.

Cancer involves abnormal cell growth. Cancerous cells divide rapidly, even when the cells are not fully ready. This can lead to changes in the number of chromosomes as well as other physical changes that can be seen in the nuclei of the cells (where the DNA is stored).

 Image Credit: Science Buddies

On the left, a circle contains a drawing of five normal cells, which are roughly round and have single, uniform nuclei. On the right, a circle contains a drawing of seven cancer cells, which are irregular in shape and size and have irregularly shaped nuclei. Once cell even has two nuclei.


Figure 1. Normal breast tissue cells each have a single uniformly shaped and sized nucleus. In contrast, cancerous cells often have abnormally shaped and sized nuclei.

The WDBC dataset contains information about 10 features of the nuclei in the samples:

Taken together, these 10 features are intended to capture the regular, uniform shape of nuclei in normal breast tissue compared to the chaotic and irregular shapes of cancerous nuclei. The WBCD dataset includes a computed mean, standard error, and "worst" or "largest" (mean of the three largest values) for each feature for each image.

Terms and Concepts

Questions

Bibliography

Materials and Equipment

Experimental Procedure

This project follows the Scientific Method. Review the steps before you begin.

Setting Up the Google Colab Environment

  1. You will need a Google account. If you do not have one, make one when prompted.
  2. Download the breastcancer.ipynb file from Science Buddies.
  3. Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
  4. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
  5. Run the code block under Importing Libraries, to bring in all of the functions.

Warning:

Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.

  • Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
  • click on the current cell you are working on, then click 'Runtime' and 'Run before'.

If your Runtime crashes, it is likely due to your computer using too much CPU. To fix this, you have some options:

  • Try closing some tabs or other programs on your computer.
  • Open 'Task Manager' on your computer and force quit some programs that you are not using.
  • Switch the Runtime to GPU instead of CPU by clicking on 'Runtime' -> 'Change Runtime type' -> 'T4 GPU'. Note: The GPU limit for the free version of Google Colab is 12 GB. If you run out of GPU time on Google Colab, you may need to wait to run it again.

Preprocessing the Dataset

Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.

  1. Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will be dropping the ID column. The patient ID is usually a unique identifier assigned to each patient. It doesn't carry any inherent predictive value and is simply an arbitrary identifier. We have provided the code to delete certain columns from our Pandas DataFrame—a two-dimensional and highly flexible data structure provided by the Pandas library in Python. Run the code and confirm that the ID column has been deleted from the dataframe.
  2. Normalizing/Scaling Features: In KNN, the algorithm relies on measuring distances between data points to make predictions. To ensure accurate results, it is important that all features contribute fairly to these distance calculations.If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the distances.

    Watch this video to learn more about scaling data:

    For instance, if one feature spans ages from 0 to 100, and another spans salary from 10k to 50k, the salary's larger range could overshadow the age feature in distance calculations.

    1. Normalization: This is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. There are several techniques used to normalize data, and we'll be using Min-Max scaling to bring the values between 0 and 1.
    2. Scaling: In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.

    As with the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing. We will be normalizing our numerical variables, which in our case would be everything but Diagnosis. Add in these names to the list specified by the comment in the code. Then, run the code in the cell.

  3. Encoding Categorical Variables: Encoding categorical variables is essential when working with machine learning algorithms, because most algorithms, including KNN, require numerical input. Categorical variables, which represent qualitative attributes like gender or group, cannot be used directly in their original form because they lack a numerical representation that algorithms can process.
    1. Label encoding: This involves converting categorical values into integers. Integers are whole numbers, both positive and negative, without any decimal or fraction components (for example, -1, 5, and 42). Each category is assigned a unique integer value.

      In our case, the only variable that requires encoding is Diagnosis. Because there are only two possible outcomes for diagnosis (B for benign and M for malignant), label encoding would assign 0 to benign and 1 to malignant.

    We have provided the code to label encode our Diagnosis feature. Run the code in the cell and pay attention to how that changes the values in our dataframe!

  4. Splitting the Training and Testing Data: Splitting data into training and testing sets is crucial in machine learning. By doing so, you assess your model's performance on new, unseen data, ensuring its ability to generalize beyond training examples.

    Watch this video to learn more about why we split datasets.

    We have provided the code to split the dataset into training and testing portions. Take note of how X and y look following this step, as well as the dimensions of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of features in each set.

Coding Tip: Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.

Training the Model

  1. We have provided the code to make a KNN classifier with a specified number of neighbors, where we have chosen 5 as the default number. This code then trains the classifier using the training data. Afterward, it calculates how accurate its guesses are.
    1. Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.

      Mathematically, accuracy is calculated as:

    2. Precision is another measurement used to evaluate the performance of a machine learning model, and it focuses on the accuracy of positive predictions by the model. It answers the question: Of the instances the model predicted as positive, how many are actually positive?

      Mathematically, precision is calculated as:

    3. Recall is yet another measurement used to evaluate the performance of a machine learning model. Recall measures the ability of the model to correctly identify all positive instances. It answers the question: Of all the actual positive instances, how many did the model predict correctly?

      Mathematically, recall is calculated as:

    In these formulas:

    • True Positives (TP) are the cases where the model correctly predicted the positive class
    • True Negatives (TN) are the cases where the model correctly predicted the negative class
    • False Positives (FP) are the cases where the model incorrectly predicted the positive class when it was actually negative
    • False Negatives (FN) are the cases where the model incorrectly predicted the negative class when it was actually positive
  Actual
Positive Negative
Predicted Positive True
Positive
False
Positive
Negative False
Negative
True
Negative
Table 1. Comparing the predicted result with the actual result, we can determine whether the prediction was a true positive, false positive, true negative, or false negative.

Visualize the Model

  1. Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of different features used to describe each item in the dataset. There are many methods for reducing dimensionality, and here we will use a technique called Principal Component Analysis (PCA).

    Watch this video to learn more about PCA:

    1. In the dataset, there are multiple features (dimensions) that describe each instance. In our breast cancer dataset, we have features like radius, texture, perimeter, etc. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
    2. PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
    3. In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we're now plotting the data points in a two-dimensional space defined by the two most important principal components. A decision boundary is the dividing line or surface that separates different classes in a classification model. This makes it possible to create a clear and interpretable graph of the decision boundary of the KNN classifier. 
  2. The graph generated by the provided code visualizes the KNN classifier's decision boundary in two-dimensional space. This decision boundary represents how the KNN classifier separates different classes based on the data features.
    1. The color blue is Class 0, corresponding to the samples that are benign, and the color orange is Class 1, corresponding to the samples that are malignant.
    2. The blue and orange background mean that if a new point is in blue, then the model will predict that the new point is a benign tumor. If a new point is in orange, the model will predict it is malignant.
    3. If you see an orange point in the blue background, or a blue point in an orange background, the model has misclassified those points. 
  3. We will be comparing different neighbor sizes using a loop that spans from 1 to 21 neighbors. We have the starter code under the graph of the decision boundary, and under each comment we have provided the pseudocode for each step in the loop. (Hint: The code will be very similar to the code in the Training the Model section.)

Evaluating the Model

  1. Can you explain the difference between accuracy, precision, and recall?
    1. Which metric do you think is the most important in the case of diagnosing breast cancer patients, when there are high costs for false positives (a patient who has a benign tumor gets unnecessary treatments or interventions) and false negatives (a patient has a malignant tumor but is not diagnosed and therefore does not receive treatment)?
  2. Describe what the graph predicting the KNN decision boundary using PCA-reduced data is showing.
  3. Can you explain the relationship between the regions on the graph and the predictions made by the KNN model?
  4. How does the graph help us understand the performance and behavior of the KNN classifier?
  5. Based on the graph illustrating the contrast in performance metrics across varying neighbor counts, which value of k seems to optimize accuracy, precision, and recall? What do you think is the reason this neighbor size is optimal?

Experimenting with Weighted KNN (Optional)

Weighted KNN is a variation of the K-Nearest Neighbors algorithm used for classification tasks in machine learning. In standard KNN, all neighboring data points have an equal say in the classification decision. However, in weighted KNN, different neighbors contribute differently to the decision-making process based on their proximity or other factors.

  1. Feature Selection: We can select a subset of features that we would like to assign more weight to. There are many feature selection techniques, but we will be using Univariate Feature Selection, which uses statistical tests to select the most relevant features.
    1. ANOVA (Analysis of Variance) F-statistic is a statistical test used to compare the means of two or more groups to determine whether there are statistically significant differences among them. In the context of feature selection in machine learning, the ANOVA F-statistic is often used to assess the relationship between individual features (variables) and a categorical target variable.

    We have provided the code to perform feature selection. Run these code cells, then complete the project as we did for regular KNN. How do regular KNN and weighted KNN compare?

icon scientific method

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Global Goals

The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.

This project explores topics key to Good Health and Well-Being: Ensure healthy lives and promote well-being for all at all ages.

Variations

  • Try implementing weighted KNN and experiment with different distance metrics.

Careers

If you like this project, you might enjoy exploring these related careers:

Career Profile
Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more
Career Profile
Are you interested in developing cool video game software for computers? Would you like to learn how to make software run faster and more reliably on different kinds of computers and operating systems? Do you like to apply your computer science skills to solve problems? If so, then you might be interested in the career of a computer software engineer. Read more
Career Profile
Computers are essential tools in the modern world, handling everything from traffic control, car welding, movie animation, shipping, aircraft design, and social networking to book publishing, business management, music mixing, health care, agriculture, and online shopping. Computer programmers are the people who write the instructions that tell computers what to do. Read more

News Feed on This Topic

 
, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Can AI Diagnose Breast Cancer?" Science Buddies, 21 July 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p010/artificial-intelligence/KNN-breast-cancer. Accessed 14 June 2026.

APA Style

Ngo, T. (2025, July 21). Can AI Diagnose Breast Cancer? Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p010/artificial-intelligence/KNN-breast-cancer


Last edit date: 2025-07-21
Top
Free science fair projects.