Abstract
Getting started with machine learning is like unlocking a new world of possibilities, and the best part is that you don't need to be a computer genius to do it! In this project, you will create a K-Nearest Neighbors (KNN) machine learning model that can predict whether a patient has a benign tumor or malignant breast cancer based on the characteristics of the tumor cell nucleus, such as its radius, perimeter, area, and smoothness.
Summary
None
Readily available
No issues
Objective
Create a KNN learning model that can predict whether a patient has a malignant or benign breast tumor.
Introduction
Accurate and early diagnosis is crucial for effective treatment planning and patient outcomes, especially when it comes to a disease such as breast cancer. Diagnosis begins with determining whether a breast tumor (mass of abnormal cells) is benign or malignant.
Benign breast tumors may continue to grow, but they are not aggressive toward the tissue around them. They remain contained, and often doctors advise leaving them alone. In contrast, malignant breast tumors are cancerous. As they grow, they actively invade and damage the surrounding tissue. Over time they may become metastatic, which means that the cancerous cells have spread through the lymphatic system or blood and created secondary tumors in other parts of the body.
One of the methods used to diagnose whether a breast tumor is benign or malignant involves having a pathologist (a specialized doctor) look at biopsies (tissue samples) taken from the tumor using a fine needle.
The pathologist examines the tissue sample under a microscope, looking for physical changes in the cells' nuclei that indicate that the cells are cancerous. The pathologist may take measurements and make counts of what they see. In the end, they use their data to make a calculated diagnosis of benign or malignant.
As you can probably tell, this process takes time, and it has to be done by a highly trained expert. What if we could use technology to make it faster and easier?
In this science project, you will explore whether AI and machine learning can help accurately diagnose breast cancer. If so, their use could speed up the diagnosis process and bring diagnostic capabilities to geographic areas where cancer pathologists are not available.
Artificial intelligence (AI) is a branch of computer science focused on building tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. One of the techniques employed in machine learning is the K-Nearest Neighbors (KNN) algorithm, which involves making predictions based on the characteristics of nearby data points. KNN exemplifies how machine learning algorithms harness data to make informed decisions and refine their performance over time.
KNN can save time and effort compared to analyzing data manually. It is great at finding complex patterns that might be hard for people to see. KNN is consistent and works well with both small and large datasets. It can adjust to changes quickly and make decisions automatically. It is also good at handling data with many details.
KNN has some limits, too. For example, it might not work as well with data that has a lot of noise—random or irrelevant information that can hide true patterns. And because it makes complex calculations, it can be slow to use with large datasets. It is important to decide if KNN is the right fit based on what you need to do with the data.
KNN is a straightforward machine learning technique used for classification and regression tasks. It predicts by comparing a new data point to a certain number (k) of its nearest neighbors. The algorithm then makes predictions based on how the majority of its neighbors are categorized (for classification) or by averaging its neighbors' values (for regression).
Watch this video for a simple and clear introduction to KNN:
In this science project, we will guide you through training an AI model—using KNN—to diagnose breast tumor biopsies as benign or malignant. You will then test and optimize the model to achieve the best accuracy you can and evaluate how useful you think the model is.
The dataset you will be using comes from Dr. William H. Wolberg from the University of Wisconsin Hospitals in Madison. It is commonly called the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. It contains data about the nuclei of cells found in breast tumor samples.
Cancer involves abnormal cell growth. Cancerous cells divide rapidly, even when the cells are not fully ready. This can lead to changes in the number of chromosomes as well as other physical changes that can be seen in the nuclei of the cells (where the DNA is stored).

On the left, a circle contains a drawing of five normal cells, which are roughly round and have single, uniform nuclei. On the right, a circle contains a drawing of seven cancer cells, which are irregular in shape and size and have irregularly shaped nuclei. Once cell even has two nuclei.
Figure 1. Normal breast tissue cells each have a single uniformly shaped and sized nucleus. In contrast, cancerous cells often have abnormally shaped and sized nuclei.
The WDBC dataset contains information about 10 features of the nuclei in the samples:
- Radius: the distance from the center of a nucleus to points on the perimeter of the nucleus
- Texture: the standard deviation of gray scale values of nuclei. The gray scale values can be used as a proxy for whether the nucleus has an abnormal distribution of features like chromatin, ribosomes, and nuclear pores.
- Perimeter: the measurement around the outside of the nucleus
- Area: the area of the nucleus
- Smoothness: the variation in the radius lengths of nearby nuclei (local variation in radius lengths)
- Compactness: the perimeter squared, divided by the area of the nucleus
- Concavity: the severity of concave portions of the contour of the nucleus
- Concave points: the number of concave portions of the contour of the nucleus
- Symmetry: a measure of how uniform the nuclei are across the midpoint
- Fractal dimension: a measure of how misshapen the outline of the nucleus is
Taken together, these 10 features are intended to capture the regular, uniform shape of nuclei in normal breast tissue compared to the chaotic and irregular shapes of cancerous nuclei. The WBCD dataset includes a computed mean, standard error, and "worst" or "largest" (mean of the three largest values) for each feature for each image.
Terms and Concepts
- Tumor
- Benign
- Malignant
- Metastatic
- Artificial intelligence (AI)
- Machine learning
- K-Nearest Neighbors (KNN)
- Noise
- Classification
- Regression
- Normalization
- Scaling
- Categorical variables
- Label encoding
- Accuracy
- Precision
- Recall
- Dimensionality
- Principal Component Analysis (PCA)
- Decision boundary
- Univariate Feature Selection
Questions
- What is the difference between benign and malignant tumors in terms of their behavior?
- How can AI and machine learning potentially benefit the field of medical diagnosis, like in breast cancer diagnosis?
- What is the K-Nearest Neighbors (KNN) algorithm? How does it make predictions?
- What are some strengths and limitations of the KNN algorithm?
- What features of the nuclei are included in the Wisconsin Diagnostic Breast Cancer dataset, and what do these features aim to capture?
- How would you evaluate the accuracy of the AI model's predictions, and why is accuracy important in this context?
Bibliography
- Kaggle. (1995, October 31). Breast Cancer Dataset. Retrieved August 9, 2023.
- StatQuest. (2023, February 12). One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!. Retrieved August 9, 2023.
- ritivikmath. (2021, March 18). Should You Scale Your Data ??? : Data Science Concepts. Retrieved August 9, 2023.
- StatQuest. (2017, December 4). StatQuest: PCA main ideas in only 5 minutes!!!. Retrieved August 9, 2023.
Materials and Equipment
- Laptop or desktop computer
- Internet access
Experimental Procedure

Setting Up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the breastcancer.ipynb file from Science Buddies.
- Upload the file to Google Colaboratory (you will need to sign in to your Google account at this point or make an account).
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find there.
- Run the code block under Importing Libraries, to bring in all of the functions.
Occasionally, your Runtime may get disconnected and your local variables will be lost. If you find yourself getting NameErrors such as: name 'variable' is not defined, then you have two options.
- Run all of the cells by clicking on 'Runtime' at the top of the notebook then click 'Run all', or
- click on the current cell you are working on, then click 'Runtime' and 'Run before'.
If your Runtime crashes, it is likely due to your computer using too much CPU. To fix this, you have some options:
- Try closing some tabs or other programs on your computer.
- Open 'Task Manager' on your computer and force quit some programs that you are not using.
- Switch the Runtime to GPU instead of CPU by clicking on 'Runtime' -> 'Change Runtime type' -> 'T4 GPU'. Note: The GPU limit for the free version of Google Colab is 12 GB. If you run out of GPU time on Google Colab, you may need to wait to run it again.
Preprocessing the Dataset
Preprocessing a dataset is a critical step in the machine learning process. It involves preparing the raw data before feeding it into a machine learning algorithm. Preprocessing serves several important purposes, and we will explain and demonstrate one at a time.
- Dropping Features: First, we will drop features that we think will be uninformative for modeling. In this case, we will be dropping the ID column. The patient ID is usually a unique identifier assigned to each patient. It doesn't carry any inherent predictive value and is simply an arbitrary identifier. We have provided the code to delete certain columns from our Pandas DataFrame—a two-dimensional and highly flexible data structure provided by the Pandas library in Python. Run the code and confirm that the ID column has been deleted from the dataframe.
- Normalizing/Scaling Features: In KNN, the algorithm relies on measuring distances between data points to make predictions. To ensure accurate results, it is important that all features contribute fairly to these distance calculations.If features are not scaled or normalized, those with larger numerical ranges can unfairly dominate the distances.
Watch this video to learn more about scaling data:
For instance, if one feature spans ages from 0 to 100, and another spans salary from 10k to 50k, the salary's larger range could overshadow the age feature in distance calculations.
- Normalization: This is the process of scaling all the values in a dataset to a similar range. The goal is to bring the values of different features or variables to a common scale so that they are directly comparable and do not introduce bias or distortion in the analysis. There are several techniques used to normalize data, and we'll be using Min-Max scaling to bring the values between 0 and 1.
- Scaling: In essence, scaling ensures that no single feature overwhelms the distance calculations. This enhances the algorithm's fairness in considering all features, resulting in more accurate predictions.
As with the previous step, we have provided the code to normalize certain columns from our Pandas DataFrame. Add in the names of the columns that we will be normalizing. We will be normalizing our numerical variables, which in our case would be everything but Diagnosis. Add in these names to the list specified by the comment in the code. Then, run the code in the cell.
- Encoding Categorical Variables: Encoding categorical variables is essential when working with machine learning algorithms, because most algorithms, including KNN, require numerical input. Categorical variables, which represent qualitative attributes like gender or group, cannot be used directly in their original form because they lack a numerical representation that algorithms can process.
- Label encoding: This involves converting categorical values into integers. Integers are whole numbers, both positive and negative, without any decimal or fraction components (for example, -1, 5, and 42). Each category is assigned a unique integer value.
In our case, the only variable that requires encoding is Diagnosis. Because there are only two possible outcomes for diagnosis (B for benign and M for malignant), label encoding would assign 0 to benign and 1 to malignant.
We have provided the code to label encode our Diagnosis feature. Run the code in the cell and pay attention to how that changes the values in our dataframe!
- Label encoding: This involves converting categorical values into integers. Integers are whole numbers, both positive and negative, without any decimal or fraction components (for example, -1, 5, and 42). Each category is assigned a unique integer value.
- Splitting the Training and Testing Data: Splitting data into training and testing sets is crucial in machine learning. By doing so, you assess your model's performance on new, unseen data, ensuring its ability to generalize beyond training examples.
Watch this video to learn more about why we split datasets.
We have provided the code to split the dataset into training and testing portions. Take note of how X and y look following this step, as well as the dimensions of X_train, X_test, y_train, and y_test. The numbers inside the parentheses represent the number of rows and the number of features in each set.
Training the Model
- We have provided the code to make a KNN classifier with a specified number of neighbors, where we have chosen 5 as the default number. This code then trains the classifier using the training data. Afterward, it calculates how accurate its guesses are.
- Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.
Mathematically, accuracy is calculated as:
- Precision is another measurement used to evaluate the performance of a machine learning model, and it focuses on the accuracy of positive predictions by the model. It answers the question: Of the instances the model predicted as positive, how many are actually positive?
Mathematically, precision is calculated as:
- Recall is yet another measurement used to evaluate the performance of a machine learning model. Recall measures the ability of the model to correctly identify all positive instances. It answers the question: Of all the actual positive instances, how many did the model predict correctly?
Mathematically, recall is calculated as:
In these formulas:
- True Positives (TP) are the cases where the model correctly predicted the positive class
- True Negatives (TN) are the cases where the model correctly predicted the negative class
- False Positives (FP) are the cases where the model incorrectly predicted the positive class when it was actually negative
- False Negatives (FN) are the cases where the model incorrectly predicted the negative class when it was actually positive
- Accuracy is a measure used to evaluate the performance of a machine learning model. It represents the proportion of correctly predicted outcomes or labels compared to the total number of instances in the dataset. In simpler terms, accuracy tells you how often the model's predictions are correct.
| Actual | |||
|---|---|---|---|
| Positive | Negative | ||
| Predicted | Positive | True Positive |
False Positive |
| Negative | False Negative |
True Negative |
|
Visualize the Model
- Before graphing, we must reduce the dimensionality of the data. Dimensionality refers to the number of different features used to describe each item in the dataset. There are many methods for reducing dimensionality, and here we will use a technique called Principal Component Analysis (PCA).
Watch this video to learn more about PCA:
- In the dataset, there are multiple features (dimensions) that describe each instance. In our breast cancer dataset, we have features like radius, texture, perimeter, etc. Visualizing with more than three dimensions is challenging because we cannot easily represent such high-dimensional spaces on a two-dimensional surface like a graph.
- PCA helps address this issue by transforming the original features into a new set of features called principal components. These principal components are a linear combination of the original features and capture most of the variability in the data. By using just the first few principal components, we can effectively reduce the dimensionality of the data while retaining as much meaningful information as possible.
- In the provided code, PCA is used to reduce the dataset to just two principal components. This means that instead of visualizing the data in its original high-dimensional space, we're now plotting the data points in a two-dimensional space defined by the two most important principal components. A decision boundary is the dividing line or surface that separates different classes in a classification model. This makes it possible to create a clear and interpretable graph of the decision boundary of the KNN classifier.
- The graph generated by the provided code visualizes the KNN classifier's decision boundary in two-dimensional space. This decision boundary represents how the KNN classifier separates different classes based on the data features.
- The color blue is Class 0, corresponding to the samples that are benign, and the color orange is Class 1, corresponding to the samples that are malignant.
- The blue and orange background mean that if a new point is in blue, then the model will predict that the new point is a benign tumor. If a new point is in orange, the model will predict it is malignant.
- If you see an orange point in the blue background, or a blue point in an orange background, the model has misclassified those points.
- We will be comparing different neighbor sizes using a loop that spans from 1 to 21 neighbors. We have the starter code under the graph of the decision boundary, and under each comment we have provided the pseudocode for each step in the loop. (Hint: The code will be very similar to the code in the Training the Model section.)
Evaluating the Model
- Can you explain the difference between accuracy, precision, and recall?
- Which metric do you think is the most important in the case of diagnosing breast cancer patients, when there are high costs for false positives (a patient who has a benign tumor gets unnecessary treatments or interventions) and false negatives (a patient has a malignant tumor but is not diagnosed and therefore does not receive treatment)?
- Describe what the graph predicting the KNN decision boundary using PCA-reduced data is showing.
- Can you explain the relationship between the regions on the graph and the predictions made by the KNN model?
- How does the graph help us understand the performance and behavior of the KNN classifier?
- Based on the graph illustrating the contrast in performance metrics across varying neighbor counts, which value of k seems to optimize accuracy, precision, and recall? What do you think is the reason this neighbor size is optimal?
Experimenting with Weighted KNN (Optional)
Weighted KNN is a variation of the K-Nearest Neighbors algorithm used for classification tasks in machine learning. In standard KNN, all neighboring data points have an equal say in the classification decision. However, in weighted KNN, different neighbors contribute differently to the decision-making process based on their proximity or other factors.
- Feature Selection: We can select a subset of features that we would like to assign more weight to. There are many feature selection techniques, but we will be using Univariate Feature Selection, which uses statistical tests to select the most relevant features.
- ANOVA (Analysis of Variance) F-statistic is a statistical test used to compare the means of two or more groups to determine whether there are statistically significant differences among them. In the context of feature selection in machine learning, the ANOVA F-statistic is often used to assess the relationship between individual features (variables) and a categorical target variable.
We have provided the code to perform feature selection. Run these code cells, then complete the project as we did for regular KNN. How do regular KNN and weighted KNN compare?
Ask an Expert
Global Goals
The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.
Variations
- Try implementing weighted KNN and experiment with different distance metrics.
Careers
If you like this project, you might enjoy exploring these related careers:














