Predicting Thyroid Cancer Recurrence with Machine Learning

52 reviews

Abstract

For cancer patients, remission–a period when the signs and symptoms of cancer are reduced or disappear–brings immense relief, but there is often a chance of recurrence, or the cancer coming back. Have you ever wondered how doctors can predict if cancer might come back in some patients? Thyroid cancer, a type of cancer affecting the thyroid gland, has a recurrence rate of about 5-30%. Depending on many factors, some patients may have a higher chance of thyroid cancer recurrence than others. In this project, you will prepare and encode the data for our Random Forest model and then use the model to determine the top three factors indicating thyroid cancer recurrence. You will then train another model using only the top three factors to see if the top three alone can achieve similar accuracy. What factors do you think will be most predictive of thyroid cancer recurrence?

Summary

Areas of Science

Artificial Intelligence
Human Biology & Health

Difficulty

Method

Engineering Design Process

Time Required

Short (2-5 days)

Prerequisites

None

Material Availability

Readily available

Cost

Very Low (under $20)

Safety

No issues

Credits

Tracey Ngo, Science Buddies

Additional subject matter expertise provided by Laura Ohl, PhD, Science Buddies.

Science Buddies is committed to creating content authored by scientists and educators. Learn more about our process and how we use AI.

https://www.youtube.com/watch?v=oe-oOsHqCk0

Objective

Preprocess thyroid cancer risk factor data to create a Random Forest model that can predict whether a patient will have thyroid cancer recurrence.

Introduction

Cancer can return even after successful treatment, making regular follow-up appointments crucial for early detection of recurrence. Early intervention can significantly improve outcomes if the cancer returns. This is particularly important for thyroid cancer, a type of cancer that originates in the thyroid gland, a small butterfly-shaped organ located at the base of the neck. The thyroid plays a vital role in regulating the body’s metabolism, heart rate, and temperature by producing essential hormones.

Thyroid cancer can develop in various forms, each with different levels of aggressiveness and prognosis. Several factors can increase the risk of recurrence, such as the patient’s age at the time of diagnosis, the initial stage of the cancer, the specific type of thyroid cancer, and much more.

Thyroid cancer is one of the more common endocrine cancers, affecting about 1 in 200 people during their lifetime. Although the prognosis for thyroid cancer is generally favorable, especially in cases detected early, the possibility of recurrence emphasizes the importance of ongoing monitoring and follow-up care.

However, testing for thyroid cancer recurrence requires careful consideration to ensure that patients in remission–meaning the absence of detectable cancer following treatment–receive the most appropriate follow-up care without overburdening them with unnecessary procedures. Regular follow-up testing is crucial for early detection of recurrence (when cancer returns after a period of remission), but it is equally important to choose the right tests and determine the appropriate intervals between them. This approach not only improves patient outcomes by catching recurrences early but also helps in tailoring follow-up care to the specific needs of each patient. To reduce unnecessary testing and focus on those most likely to benefit, it is essential to identify the factors most predictive of thyroid cancer recurrence. This is where machine learning and Random Forests come in. Machine learning models offer a robust method for predicting outcomes, potentially guiding more personalized and effective follow-up care.

Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into Decision Trees, one type of machine learning algorithm. Decision Trees mimic the human decision-making process by breaking down a problem into a series of sequential questions or decisions. Decision Trees often serve as powerful tools for classifying data and solving various problems.

Watch this video to learn more about Decision Trees. We recommend watching the video from 0:18 to 9:15:

https://www.youtube.com/watch?v=_L39rN6gz7Y

However, Decision Trees have some drawbacks, like overfitting (being too specific to training data) and being sensitive to small changes in the data. To address these problems, we can use Random Forests, which combine several Decision Trees to enhance prediction accuracy. Random Forests help reduce overfitting and improve model stability, making them a great option for predicting whether or not a patient will have recurring thyroid cancer.

Watch this video to learn more about Random Forests:

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ

In this project, your task is to preprocess, or prepare, the data for use in our Random Forest model. Then, determine the top three factors indicating thyroid cancer recurrence and run the model again to see if these factors achieve similar accuracy.

Disclaimer: This information is provided for educational purposes only and is not intended as medical advice.

Terms and Concepts

Recurrence
Thyroid cancer
Thyroid gland
Remission
Artificial Intelligence (AI)
Machine learning
Decision Tree
Overfit
Random Forest
Preprocess
Categorical variable
Label encoding
One-hot encoding
Model Accuracy Score
Receiver Operating Characteristic (ROC)
Area Under the Curve (AUC)

Questions

What are some factors that may increase the risk of thyroid cancer recurrence?
How might the use of machine learning models like Random Forests improve the personalization of follow-up care for thyroid cancer patients?
What are some drawbacks of using Decision Trees, and how do Random Forests address these issues?
Why might it be beneficial to focus on the top three factors for predicting thyroid cancer recurrence instead of using all available tests?
What are potential ethical considerations when using machine learning in healthcare?

Bibliography

You can access the dataset here:

Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. .

To learn more about Decision Trees and Random Forests:

Normalized Nerd. (2021, April). Random Forest Algorithm Clearly Explained!. YouTube. Retrieved August 19, 2024.
StatQuest with Josh Starmer. (2021, April). Decision and Classification Trees, Clearly Explained!!!. YouTube. Retrieved August 19, 2024.
StatQuest with Josh Starmer. (2018, February). StatQuest: Random Forests Part 1 - Building, Using and Evaluating. YouTube. Retrieved August 19, 2024.
StatQuest with Josh Starmer. (2020, January). StatQuest: Random Forests Part 2: Missing data and clustering. YouTube. Retrieved August 19, 2024.

Materials and Equipment

Computer with Internet access

Experimental Procedure

Download PDF of Procedure

This project follows the

Engineering Design Process. Confirm with your teacher if this is acceptable for your project, and review the steps before you begin.

Overview

In this project, you will work with a dataset containing information about thyroid cancer patients, including details of their diagnoses, patient characteristics, test results, and whether their cancer recurred. You will preprocess the data using label encoding and one-hot encoding (both of which will be explained later) to prepare it for analysis with a Random Forest model. Your task will also include identifying the most significant factors in predicting thyroid cancer recurrence and evaluating whether using only the top three factors can still yield high predictive accuracy. You will also explore how different combinations of factors impact the model's accuracy.

Setting Up the Google Colab Environment

You will need a Google account. If you do not have one, make one when prompted.
Download the thyroid_cancer_recurrence.ipynb file from Science Buddies. This is the code you will need to process your data.
Download the thyroid_data.csv file from Science Buddies.
1. Open the dataset to familiarize yourself with its structure by going to your Downloads and double-clicking on the file.
2. You will see that the very last column (column P) is called ‘Recurred,’ and that is what we will be trying to predict. This column shows whether or not that patient had thyroid cancer recurrence after 15 years.
3. In columns A-O, you will see other variables such as ‘Age’ and ‘Gender’, as well as what ‘Stage’ their cancer was diagnosed, how responsive their thyroid was, and more. Each of these factors will be described in more detail as you preprocess the data.
Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “Thyroid Cancer Recurrence.” Inside the folder, upload both the thyroid_cancer_recurrence.ipynb file and the thyroid_data.csv file.
Double-click on the thyroid_cancer_recurrence.ipynb file. This should automatically open in Google Colab.
1. Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
2. Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.

1. Process the Data with All Features

1.1 Loading the Data into a Pandas DataFrame

(Code Block 1A) Run this code block to make the files on your Google Drive available to use in the notebook.
(Code Block 1B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.

1.1.1 Label Encoding

When working with machine learning algorithms, it is important to convert categorical variables into numbers since algorithms understand numbers better than words. Categorical variables represent qualities like gender or group, and cannot be used because they lack a numerical form that algorithms can work with. Label encoding is a way to convert categorical variables into positive whole numbers, assigning each category a unique integer value (0, 1, 2, etc.).

(Code Block 1C) In this code block, you will specify the column names (features) from your .csv file that have “Yes” or “No” values: ‘Smoking,’ ‘Hx Smoking,’ ‘Hx Radiotherapy,’ and ‘Recurred.’ Inside the blue brackets under the comment #TODO, type in these column names. We have included the first column name for you. Be sure to use quotes around each column name and separate them using commas. Run the code and then check the output to see how the yes/no values were replaced with 1/0.
1. Note: By common coding conventions, 0 typically means ‘No’ and 1 typically means ‘Yes.’
2. If you get an error when you run the code block, double-check that you don’t have any typos and have used commas and quotes correctly.
3. Here is an example of how the code should look.

Image Credit: Tracy Ngo / Science Buddies

Example of how Code Block 1C should look after inputting the features that have 'Yes' or 'No' factors.

Figure 1. Example of how Code Block 1C should look after inputting the features that have "Yes" or "No" factors.

(Code Block 1D) In this code block, you will input the column names that have only two distinct values: ‘Gender,’ ‘Focality,’ and ‘M.’ Since these columns only contain two different values, it does not significantly impact our algorithm which value is assigned 0 or 1. Type the column names in the brackets under the #TODO comment, and check the output table to see how it has changed.
(Code Block 1E) In this code block, we have provided the code to find the unique values (distinct types in the data) in the ‘Physical Examination’ column. You will use this same code again later to find unique values for other columns as well. ‘When you run this code, you will see five different types of thyroid conditions or goiters (abnormal enlargements of the thyroid gland) in the output. You will rank these from least to most likely to indicate recurrence in the next code block.
(Code Block 1F) In this code block, you will map each ‘Physical Examination’ condition to a number, based on the list below. Rank each condition from least to most severe. Inside the curly braces ( they look like this: { }), map each condition to a number (E.g. {‘Normal’: 0, ‘Diffuse goiter’: 1…}). Note that since left and right single nodular goiters have the same severity, you can assign them the same number. Here, we will rank them from least to most severe:
1. Normal: A normal thyroid gland indicates no visible abnormalities or nodules.
2. Diffuse goiter: A smooth thyroid gland with some swelling around the gland.
3. Single nodular goiter-left: A solitary nodule located on the left side of the thyroid gland. (This should be assigned the same value as Single Nodular Goiter Right.)
4. Single nodular goiter-right: A solitary nodule located on the right side of the thyroid gland. (This should be assigned the same value as Single Nodular Goiter Left.)
5. Multinodular goiter: The presence of multiple nodules within the thyroid gland.
(Code Block 1G) In this empty code block, refer to Code Block 1E, and use the code provided there as a model to help you write the code to find the unique values of the ‘Adenopathy’ column.
(Code Block 1H) In this code block, you will map each ‘Adenopathy’ condition to a number. Adenopathy is the inflammation of glandular tissue or lymph nodes. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number, using the information below as a guide.
1. No: No enlarged lymph nodes present.
2. Right/Left: Enlarged lymph nodes on the right/left side of the neck.
3. Posterior: Enlarged lymph nodes in the posterior neck region.
4. Bilateral: Enlarged lymph nodes on both sides of the neck.
5. Extensive: Enlarged lymph nodes on both sides of the neck.
(Code Block 1I) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Pathology’ column.
(Code Block 1J) In this code block, you will map each ‘Pathology’ condition to a number. Rank each condition from least to most likely to cause thyroid cancer recurrence. Inside the curly braces, map each condition to a number, using the information below as a guide.
1. Micropapillary: A tumor with a histologic pattern of papillary cell groups, with a size of 1 cm or less. Micropapillary thyroid carcinomas generally have an excellent prognosis and a low risk of recurrence or metastasis.
2. Papillary: An irregular solid tumor mass with a histological pattern of malignant epithelial cells and follicular cell differentiation. Papillary thyroid carcinoma (PTC) is the most common type of thyroid cancer.
3. Follicular: A tumor containing malignant epithelial cells with invasive properties seen by histology. Follicular thyroid carcinoma (FTC) is less common than PTC but tends to be more aggressive.
4. Hurthle cell: A tumor that contains greater than 75% malignant Hurthle cells. Hurthle cell carcinoma (HCC) is a rare and lesser-known type of thyroid cancer.
(Code Block 1K) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘T’ column.
(Code Block 1L) In this code block, you will map each ‘T’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. For this feature, rather than using the order we give you, research what ‘T’ stands for and learn about the severity of each state. Here is one resource to help you get started.
(Code Block 1M) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘N’ column.
(Code Block 1N) In this code block, you will map each ‘N’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. Do some research to learn what ‘N’ stands for and find the severity of each state. Here is one resource to help you get started.
(Code Block 1O) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Stage’ column.
(Code Block 1P) In this code block, you will map each ‘Stage’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. Do some research to learn what ‘Stage’ means and find the severity of each state. Here is one resource to help you get started. Note that these features use roman numerals (I=1, II=2, etc.).
(Code Block 1Q) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Response’ column.
(Code Block 1R) In this code block, you will map each ‘Response’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number, using the information below as a guide.
1. Excellent: No indicators of detectable disease after treatment.
2. Indeterminate: An unclear result for the classification of a tumor (benign or malignant) after treatment.
3. Biochemical Incomplete: A lack of full response to treatment seen by continued thyroid dysfunction such as elevated thyroglobulin levels.
4. Structural Incomplete: A lack of full response to treatment seen by persistent tumor size, tumor growth, and/or lymph node swelling.

1.1.3 One-Hot Encoding

Sometimes, it makes more sense to use one-hot encoding instead of label encoding. If categorical data does not have an inherent order, using label encoding can mislead the model into assuming a false ordinal relationship. For example, in our label encoding earlier, it made sense to assign “Low,” “Intermediate,” and “High” to 0, 1, and 2. However, for our ‘Thyroid Function’ variable, hyperthyroidism, for example, is not necessarily better or worse than hypothyroidism. Therefore, one-hot encoding makes more sense for this variable. Check out this video for a quick explanation of one-hot encoding.

(Code Block 1S) This code block takes the ‘Thyroid Function’ column, applies one-hot encoding to it, and returns a NumPy array (a multi-dimensional array used for numerical data processing) where each original category is represented by a binary vector (a sequence of numbers consisting only of 0s and 1s). Run this code block.
(Code Block 1T) This code block creates a new DataFrame with the new one-hot encoded columns. Run this code block.
(Code Block 1U) This code block adds the one-hot encoded features to the original DataFrame. Run this code block.
(Code Block 1V) This code block removes the original ‘Thyroid Function’ column. Run this code block.

1.1.4 Split to Train and Test

(Code Block 1W) Separating the Dataset into Inputs and Target: This code block separates the dataset into two parts:
1. ‘Inputs’ contains all the feature columns except ‘Recurred.’
2. ‘Target’ contains the ‘Recurred’ column, which tells us whether a patient had recurring thyroid cancer or not.
(Code Block 1X) Splitting the Training and Testing Data: Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. We have provided the code to split the dataset into training and testing parts. Pay attention to how X and y look after this step, and the sizes of X_train, X_test, y_train, and y_test. For example, if you see the X_train step as (306, 20), that means there are 306 samples (patients) in the training data, each with 20 features (e.g. Age, Gender, Smoking, etc.).

Coding Tip:

Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.

1.2 Train the Model

(Code Block 1Y) We have provided the code to make a Random Forest classifier. Run this code.
(Code Block 1Z) This code trains the classifier using the training data you gave it. Run this code (this is like pressing play to let the computer do its job and learn from the examples we’ve given it).

1.3 Evaluate the Model

(Code Block 1AA) In the first code block, we will use a Model Accuracy Score (called ‘model.score(X_test, y_test)’ in the code) to figure out how accurate the model was.
1. The Model Accuracy Score will be a number from 0 to 1, which shows the percentage of patients that were correctly predicted to have thyroid cancer recurrence or not in the testing data.
2. For example, a score of 0.6 means there was a 60% accuracy rate. In other words, 6 out of 10 patients’ thyroid cancer recurrence was classified correctly.
(Code Block 1BB) In this code block, simply print out the testing data. We will compare these values to the values in the next code block.
(Code Block 1CC) In this code block, we can see the model’s predictions by looking at something called ‘y_pred.’ When we print ‘y_pred’ (i.e. display it on the screen), we can see the model’s predictions for each patient in the testing dataset. Comparing these values to the values in Code Block 1DD, can you see where the model predicted thyroid cancer recurrence incorrectly?
1. For example, if your y_test was [0, 0, 1, 0, 1] and your y_pred was [0, 0, 0, 1], we can see that the middle patient (the third one) was predicted incorrectly. The patient actually had thyroid cancer recurrence, but the model thought the patient did not. We will visualize our model’s accuracy more clearly in the next steps.

1.4 Visualize the Model

(Code Block 1DD) When you run this code block, it will create a picture or visualization of one of the decision trees in our Random Forest model. The characteristics are arranged from the most important for determining whether a patient has recurring thyroid cancer at the top to the least important for determining recurring thyroid cancer at the bottom. At the top of each box in the decision tree, you will see a characteristic and a number, such as Stage <= 0.5. This example means that if the Stage is less than or equal to 0.5, it will go to the left side of the tree; otherwise, it will go to the right side of the decision tree.
1. There are two values you can change:
  1. On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 99 to view the different decision trees in the Random Forest.
  2. One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
2. Explore some of these decision trees. What thyroid characteristics were most important for predicting whether a patient has recurring thyroid cancer?
(Code Block 1EE) This code calculates and sorts feature importances from the Random Forest model, then creates a horizontal bar plot to visualize each feature’s importance. This helps identify which features most influence the model’s predictions.
1. What were the top three features that were most and least important in predicting thyroid cancer recurrence? Why do you think this might be the case?
(Code Block 1FF - Optional Advanced Analysis) This code is used to evaluate the performance of our model using the ROC (Receiver Operating Characteristic) curve and the AUC (Area Under the Curve) score. The ROC and AUC evaluate a model’s performance across all possible classification thresholds. Watch this video to learn more about ROC and AUC. How was your model’s performance?
1. AUC = 1: Perfect classifier. The model perfectly separates patients with and without thyroid cancer recurrence.
2. AUC = 0.5: The model is as good as random guessing.
3. AUC < 0.5: Worse than random guessing, which is usually a sign that the model is incorrectly predicting the classes.

2. Process the Data Again with Fewer Features

Now, we are going to create another model and train it using only the top three features to see if we can achieve similar results. Testing and storing data can be expensive and time-consuming, and we want to focus on the key predictors to save the patients' time and not overburden them with unnecessary tests.

2.1 Preprocess the Dataset

Note: If at this point you are encountering errors, try clicking on the cell that you are currently working on, then go to ‘Runtime’ -> ‘Run before.’

2.1.1 Select the Top 3 Columns

(Code Block 2A) Consulting the graph from Code Block 1GG, add the top three characteristics for predicting thyroid cancer recurrence by inserting the names into the double brackets under the #TODO comment. Also include the ‘Recurred’ column. Run this code.
1. e.g. df = df[['Adenopathy', 'Response', 'Stage', 'Recurred']]

2.1.2 Split to Train and Test

(Code Block 2B) As in section 1.1.4, we will separate the dataset into inputs and targets.
(Code Block 2C) Again, like in section 1.1.4, we will separate the dataset into train and test.

2.2 Train the Model

(Code Block 2D) As in section 1.2, we have provided the code to make a Random Forest classifier. Run this code.
(Code Block 2E) Again, like in section 1.2, we will train the classifier using the new training data you gave it. Run this code.

2.3 Evaluate the Model

(Code Block 2F) As in section 1.3, we will use a Model Accuracy Score to determine the accuracy of the model. Is the new model that trained on fewer features as accurate as the original one that trained on more features?
(Code Block 2G) In this code block, simply print out the testing data. We will compare these values to the values in the next code block.
(Code Block 2H) In this code block, we can see the model’s predictions by looking at y_pred. Comparing these values to the values in Code Block 2G, can you see where the model predicted thyroid cancer recurrence incorrectly?

2.4 Visualize the Model

(Code Block 2I) When you run this code block, it will create a picture or visualization of one of the decision trees in our Random Forest model.
1. There are two values you can change:
  1. On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 99 to view the different decision trees in the Random Forest.
  2. One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
2. Explore some of these decision trees. What thyroid characteristic was most important for predicting whether a patient has recurring thyroid cancer?
(Code Block 2J) This code calculates and sorts feature importances from the Random Forest model, then creates a horizontal bar plot to visualize each feature’s importance.
1. What are the most and least important characteristics now? Is it different from your feature importance graph from Code Block 1GG? Why do you think so?
(Code Block 2K - Optional Advanced Analysis) This code is used to evaluate the performance of our model using the ROC (Receiver Operating Characteristic) curve and the AUC (Area Under the Curve) score. How is your model’s performance now? Is it comparable to when we made the model with all of the features?

3. Experiment with different combinations of features!

1. What happens to the model’s accuracy when you select the bottom three features?
2. What happens to the model’s accuracy when you randomly choose features?
3. What happens to the model’s accuracy when you only choose one feature? Two features?

Ask an Expert

Do you have specific questions about your science project? Our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.

Post a Question

Global Goals

The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.

This project explores topics key to Good Health and Well-Being: Ensure healthy lives and promote well-being for all at all ages.

Variations

Replace the Random Forest classifier with another machine learning algorithm to see if one algorithm is better than another! (Try SVM or KNN!)
Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic examples of underrepresented classes in the dataset (especially since not many people have Hurthle cells!)

Careers

If you like this project, you might enjoy exploring these related careers:

Data Scientist

Career Profile

Many aspects of peoples' daily lives can be summarized using data, from what is the most popular new video game to where people like to go for a summer vacation. Data scientists (sometimes called data analysts) are experts at organizing and analyzing large sets of data (often called "big data"). By doing this, data scientists make conclusions that help other people or companies. For example, data scientists could help a video game company make a more profitable video game based on players'… Read more

Biochemical Engineer

Career Profile

A nice cool yogurt is the perfect snack. It comes in a variety of delicious flavors like peach, chocolate, and cherry and contains calcium, vitamins, and minerals that are good for you. Yogurt also contains live cultures that your body needs to maintain good health. How did all of those good things get into your yogurt? The answer is that a biochemical engineer helped to develop a recipe to make that yogurt a perfect snack for you. So many of the products that we use every day, from medicine… Read more

Bioinformatics Scientist

Career Profile

The human body can be viewed as a machine made up of complex processes. Scientists are working on figuring out how these processes work and on sequencing and correlating the sections of the genome that correspond to the individual processes. (The genome is an organism's complete set of genetic material.) In the course of doing so, they generate large amounts of data. So large, in fact, that to make sense of it, the data must be organized into databases and labeled. This is where bioinformatics… Read more

Pathologist

Career Profile

Do you enjoy solving mysteries? Getting to the end of a "who did it" mystery novel can be lots of fun! But are there mysteries in real life? You bet there are! A pathologist is a medical detective, and their job is to figure out the root cause of real-life medical puzzles. Pathologists work in a wide range of fields and can help diagnose types of cancer, find out what killed a person, and investigate how disease progresses on a molecular level. If you enjoy employing cool logic to solve… Read more

Epidemiologist

Career Profile

Do you like a good mystery? Well, an epidemiologist's job is all about solving mysteries—medical mysteries—but instead of figuring out "who done it" like a police detective would, they figure out "what caused it." They find relationships between a medical condition and things like human behavior, environmental toxins, genes, medical treatments, other diseases, and geographical location. For example, they ask questions like what causes multiple sclerosis? How can we prevent brain… Read more

News Feed on This Topic

, ,

Cite This Page

General citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.

MLA Style

Ngo, Tracey. "Predicting Thyroid Cancer Recurrence with Machine Learning." Science Buddies, 7 Oct. 2025, https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p020/artificial-intelligence/thyroid_cancer. Accessed 17 July 2026.

APA Style

Ngo, T. (2025, October 7). Predicting Thyroid Cancer Recurrence with Machine Learning. Retrieved from https://www.sciencebuddies.org/science-fair-projects/project-ideas/ArtificialIntelligence_p020/artificial-intelligence/thyroid_cancer

Last edit date: 2025-10-07

Explore Our Science Videos

Raspberry Pi Scratch 2 Tutorial (Science Buddies Kit)

How can air pressure prevent leaks? | STEM Activity

Make a Balloon Car | STEM Activity

Predicting Thyroid Cancer Recurrence with Machine Learning

Abstract

Summary

Objective

Introduction

Terms and Concepts

Questions

Bibliography

Materials and Equipment

Experimental Procedure

Overview

Setting Up the Google Colab Environment

1. Process the Data with All Features

1.1 Loading the Data into a Pandas DataFrame

1.1.1 Label Encoding

1.1.3 One-Hot Encoding

1.1.4 Split to Train and Test

1.2 Train the Model

1.3 Evaluate the Model

1.4 Visualize the Model

2. Process the Data Again with Fewer Features

2.1 Preprocess the Dataset

2.1.1 Select the Top 3 Columns

2.1.2 Split to Train and Test

2.2 Train the Model

2.3 Evaluate the Model

2.4 Visualize the Model

3. Experiment with different combinations of features!

Ask an Expert

Global Goals

Variations

Careers

Related Links

News Feed on This Topic

Cite This Page

MLA Style

APA Style

Explore Our Science Videos