Abstract
For cancer patients, remission–a period when the signs and symptoms of cancer are reduced or disappear–brings immense relief, but there is often a chance of recurrence, or the cancer coming back. Have you ever wondered how doctors can predict if cancer might come back in some patients? Thyroid cancer, a type of cancer affecting the thyroid gland, has a recurrence rate of about 5-30%. Depending on many factors, some patients may have a higher chance of thyroid cancer recurrence than others. In this project, you will prepare and encode the data for our Random Forest model and then use the model to determine the top three factors indicating thyroid cancer recurrence. You will then train another model using only the top three factors to see if the top three alone can achieve similar accuracy. What factors do you think will be most predictive of thyroid cancer recurrence?
Summary
None
Readily available
No issues
Additional subject matter expertise provided by Laura Ohl, PhD, Science Buddies.
Objective
Preprocess thyroid cancer risk factor data to create a Random Forest model that can predict whether a patient will have thyroid cancer recurrence.
Introduction
Cancer can return even after successful treatment, making regular follow-up appointments crucial for early detection of recurrence. Early intervention can significantly improve outcomes if the cancer returns. This is particularly important for thyroid cancer, a type of cancer that originates in the thyroid gland, a small butterfly-shaped organ located at the base of the neck. The thyroid plays a vital role in regulating the body’s metabolism, heart rate, and temperature by producing essential hormones.
Thyroid cancer can develop in various forms, each with different levels of aggressiveness and prognosis. Several factors can increase the risk of recurrence, such as the patient’s age at the time of diagnosis, the initial stage of the cancer, the specific type of thyroid cancer, and much more.
Thyroid cancer is one of the more common endocrine cancers, affecting about 1 in 200 people during their lifetime. Although the prognosis for thyroid cancer is generally favorable, especially in cases detected early, the possibility of recurrence emphasizes the importance of ongoing monitoring and follow-up care.
However, testing for thyroid cancer recurrence requires careful consideration to ensure that patients in remission–meaning the absence of detectable cancer following treatment–receive the most appropriate follow-up care without overburdening them with unnecessary procedures. Regular follow-up testing is crucial for early detection of recurrence (when cancer returns after a period of remission), but it is equally important to choose the right tests and determine the appropriate intervals between them. This approach not only improves patient outcomes by catching recurrences early but also helps in tailoring follow-up care to the specific needs of each patient. To reduce unnecessary testing and focus on those most likely to benefit, it is essential to identify the factors most predictive of thyroid cancer recurrence. This is where machine learning and Random Forests come in. Machine learning models offer a robust method for predicting outcomes, potentially guiding more personalized and effective follow-up care.
Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into Decision Trees, one type of machine learning algorithm. Decision Trees mimic the human decision-making process by breaking down a problem into a series of sequential questions or decisions. Decision Trees often serve as powerful tools for classifying data and solving various problems.
Watch this video to learn more about Decision Trees. We recommend watching the video from 0:18 to 9:15:
However, Decision Trees have some drawbacks, like overfitting (being too specific to training data) and being sensitive to small changes in the data. To address these problems, we can use Random Forests, which combine several Decision Trees to enhance prediction accuracy. Random Forests help reduce overfitting and improve model stability, making them a great option for predicting whether or not a patient will have recurring thyroid cancer.
Watch this video to learn more about Random Forests:
In this project, your task is to preprocess, or prepare, the data for use in our Random Forest model. Then, determine the top three factors indicating thyroid cancer recurrence and run the model again to see if these factors achieve similar accuracy.
Disclaimer: This information is provided for educational purposes only and is not intended as medical advice.
Terms and Concepts
- Recurrence
- Thyroid cancer
- Thyroid gland
- Remission
- Artificial Intelligence (AI)
- Machine learning
- Decision Tree
- Overfit
- Random Forest
- Preprocess
- Categorical variable
- Label encoding
- One-hot encoding
- Model Accuracy Score
- Receiver Operating Characteristic (ROC)
- Area Under the Curve (AUC)
Questions
-
What are some factors that may increase the risk of thyroid cancer recurrence?
-
How might the use of machine learning models like Random Forests improve the personalization of follow-up care for thyroid cancer patients?
-
What are some drawbacks of using Decision Trees, and how do Random Forests address these issues?
-
Why might it be beneficial to focus on the top three factors for predicting thyroid cancer recurrence instead of using all available tests?
-
What are potential ethical considerations when using machine learning in healthcare?
Bibliography
You can access the dataset here:
-
Borzooei, S. & Tarokhian, A. (2023). Differentiated Thyroid Cancer Recurrence [Dataset]. UCI Machine Learning Repository. .
To learn more about Decision Trees and Random Forests:
- Normalized Nerd. (2021, April). Random Forest Algorithm Clearly Explained!. YouTube. Retrieved August 19, 2024.
- StatQuest with Josh Starmer. (2021, April). Decision and Classification Trees, Clearly Explained!!!. YouTube. Retrieved August 19, 2024.
- StatQuest with Josh Starmer. (2018, February). StatQuest: Random Forests Part 1 - Building, Using and Evaluating. YouTube. Retrieved August 19, 2024.
- StatQuest with Josh Starmer. (2020, January). StatQuest: Random Forests Part 2: Missing data and clustering. YouTube. Retrieved August 19, 2024.
More on Thyroid Cancer:
- Cancer Research UK. (n.d.). Number stages for papillary and follicular thyroid cancer. Retrieved August 19, 2024.
- Cancer Research UK. (n.d.). TNM staging for thyroid cancer. Retrieved August 19, 2024.
- N'ara S., Amit M., Fridman E., Gil Z. (2016, January). Contemporary Management of Recurrent Nodal Disease in Differentiated Thyroid Carcinoma. PubMed Central. Retrieved August 19, 2024.
To learn more about One-Hot Encoding and Label Encoding:
- StatQuest with Josh Starmer (2023, February) One-Hot, Label, Target and K-Fold Target Encoding, Clearly Explained!!!. YouTube. Retrieved August 19, 2024.
- Turp, M. (2023, February). Quick explanation: One-hot encoding. YouTube. Retrieved August 19, 2024.
To learn more about train and test split:
- Galaxy Inferno Codes. (2022, June). Validation data: How it works and why you need it - Machine Learning Basics Explained. YouTube. Retrieved August 19, 2024.
- Turp, M. (2023, February). Why do we split data into train test and validation sets?. YouTube. Retrieved August 19, 2024.
To learn more about ROC and AUC:
- DATAtab. (2023, March). ROC Curve and AUC Value. YouTube. Retrieved August 19, 2024.
- StatQuest with Josh Starmer. (2019, July). ROC and AUC, Clearly Explained!. YouTube. Retrieved August 19, 2024.
Materials and Equipment
- Computer with Internet access
Experimental Procedure

Overview
In this project, you will work with a dataset containing information about thyroid cancer patients, including details of their diagnoses, patient characteristics, test results, and whether their cancer recurred. You will preprocess the data using label encoding and one-hot encoding (both of which will be explained later) to prepare it for analysis with a Random Forest model. Your task will also include identifying the most significant factors in predicting thyroid cancer recurrence and evaluating whether using only the top three factors can still yield high predictive accuracy. You will also explore how different combinations of factors impact the model's accuracy.
Setting Up the Google Colab Environment
-
You will need a Google account. If you do not have one, make one when prompted.
-
Download the thyroid_cancer_recurrence.ipynb file from Science Buddies. This is the code you will need to process your data.
Download the thyroid_data.csv file from Science Buddies.
-
Open the dataset to familiarize yourself with its structure by going to your Downloads and double-clicking on the file.
-
You will see that the very last column (column P) is called ‘Recurred,’ and that is what we will be trying to predict. This column shows whether or not that patient had thyroid cancer recurrence after 15 years.
-
In columns A-O, you will see other variables such as ‘Age’ and ‘Gender’, as well as what ‘Stage’ their cancer was diagnosed, how responsive their thyroid was, and more. Each of these factors will be described in more detail as you preprocess the data.
-
-
Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “Thyroid Cancer Recurrence.” Inside the folder, upload both the thyroid_cancer_recurrence.ipynb file and the thyroid_data.csv file.
-
Double-click on the thyroid_cancer_recurrence.ipynb file. This should automatically open in Google Colab.
-
Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
-
Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.
-
1. Process the Data with All Features
1.1 Loading the Data into a Pandas DataFrame
-
(Code Block 1A) Run this code block to make the files on your Google Drive available to use in the notebook.
-
(Code Block 1B) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.
1.1.1 Label Encoding
When working with machine learning algorithms, it is important to convert categorical variables into numbers since algorithms understand numbers better than words. Categorical variables represent qualities like gender or group, and cannot be used because they lack a numerical form that algorithms can work with. Label encoding is a way to convert categorical variables into positive whole numbers, assigning each category a unique integer value (0, 1, 2, etc.).
-
(Code Block 1C) In this code block, you will specify the column names (features) from your .csv file that have “Yes” or “No” values: ‘Smoking,’ ‘Hx Smoking,’ ‘Hx Radiotherapy,’ and ‘Recurred.’ Inside the blue brackets under the comment #TODO, type in these column names. We have included the first column name for you. Be sure to use quotes around each column name and separate them using commas. Run the code and then check the output to see how the yes/no values were replaced with 1/0.
-
Note: By common coding conventions, 0 typically means ‘No’ and 1 typically means ‘Yes.’
-
If you get an error when you run the code block, double-check that you don’t have any typos and have used commas and quotes correctly.
-
Here is an example of how the code should look.
-

Example of how Code Block 1C should look after inputting the features that have 'Yes' or 'No' factors.
Figure 1. Example of how Code Block 1C should look after inputting the features that have "Yes" or "No" factors.
-
(Code Block 1D) In this code block, you will input the column names that have only two distinct values: ‘Gender,’ ‘Focality,’ and ‘M.’ Since these columns only contain two different values, it does not significantly impact our algorithm which value is assigned 0 or 1. Type the column names in the brackets under the #TODO comment, and check the output table to see how it has changed.
-
(Code Block 1E) In this code block, we have provided the code to find the unique values (distinct types in the data) in the ‘Physical Examination’ column. You will use this same code again later to find unique values for other columns as well. ‘When you run this code, you will see five different types of thyroid conditions or goiters (abnormal enlargements of the thyroid gland) in the output. You will rank these from least to most likely to indicate recurrence in the next code block.
-
(Code Block 1F) In this code block, you will map each ‘Physical Examination’ condition to a number, based on the list below. Rank each condition from least to most severe. Inside the curly braces ( they look like this: { }), map each condition to a number (E.g.
{‘Normal’: 0, ‘Diffuse goiter’: 1…}). Note that since left and right single nodular goiters have the same severity, you can assign them the same number. Here, we will rank them from least to most severe:-
Normal: A normal thyroid gland indicates no visible abnormalities or nodules.
-
Diffuse goiter: A smooth thyroid gland with some swelling around the gland.
-
Single nodular goiter-left: A solitary nodule located on the left side of the thyroid gland. (This should be assigned the same value as Single Nodular Goiter Right.)
-
Single nodular goiter-right: A solitary nodule located on the right side of the thyroid gland. (This should be assigned the same value as Single Nodular Goiter Left.)
-
Multinodular goiter: The presence of multiple nodules within the thyroid gland.
-
-
(Code Block 1G) In this empty code block, refer to Code Block 1E, and use the code provided there as a model to help you write the code to find the unique values of the ‘Adenopathy’ column.
-
(Code Block 1H) In this code block, you will map each ‘Adenopathy’ condition to a number. Adenopathy is the inflammation of glandular tissue or lymph nodes. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number, using the information below as a guide.
-
No: No enlarged lymph nodes present.
-
Right/Left: Enlarged lymph nodes on the right/left side of the neck.
-
Posterior: Enlarged lymph nodes in the posterior neck region.
-
Bilateral: Enlarged lymph nodes on both sides of the neck.
-
Extensive: Enlarged lymph nodes on both sides of the neck.
-
- (Code Block 1I) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Pathology’ column.
- (Code Block 1J) In this code block, you will map each ‘Pathology’ condition to a number. Rank each condition from least to most likely to cause thyroid cancer recurrence. Inside the curly braces, map each condition to a number, using the information below as a guide.
-
Micropapillary: A tumor with a histologic pattern of papillary cell groups, with a size of 1 cm or less. Micropapillary thyroid carcinomas generally have an excellent prognosis and a low risk of recurrence or metastasis.
-
Papillary: An irregular solid tumor mass with a histological pattern of malignant epithelial cells and follicular cell differentiation. Papillary thyroid carcinoma (PTC) is the most common type of thyroid cancer.
-
Follicular: A tumor containing malignant epithelial cells with invasive properties seen by histology. Follicular thyroid carcinoma (FTC) is less common than PTC but tends to be more aggressive.
-
Hurthle cell: A tumor that contains greater than 75% malignant Hurthle cells. Hurthle cell carcinoma (HCC) is a rare and lesser-known type of thyroid cancer.
-
- (Code Block 1K) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘T’ column.
- (Code Block 1L) In this code block, you will map each ‘T’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. For this feature, rather than using the order we give you, research what ‘T’ stands for and learn about the severity of each state. Here is one resource to help you get started.
- (Code Block 1M) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘N’ column.
- (Code Block 1N) In this code block, you will map each ‘N’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. Do some research to learn what ‘N’ stands for and find the severity of each state. Here is one resource to help you get started.
- (Code Block 1O) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Stage’ column.
- (Code Block 1P) In this code block, you will map each ‘Stage’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number. Do some research to learn what ‘Stage’ means and find the severity of each state. Here is one resource to help you get started. Note that these features use roman numerals (I=1, II=2, etc.).
- (Code Block 1Q) In this empty code block, refer to Code Block 1E and use the code provided there as a model to help you write the code to find the unique values of the ‘Response’ column.
- (Code Block 1R) In this code block, you will map each ‘Response’ condition to a number. Rank each condition from least to most severe. Inside the curly braces, map each condition to a number, using the information below as a guide.
- Excellent: No indicators of detectable disease after treatment.
- Indeterminate: An unclear result for the classification of a tumor (benign or malignant) after treatment.
- Biochemical Incomplete: A lack of full response to treatment seen by continued thyroid dysfunction such as elevated thyroglobulin levels.
- Structural Incomplete: A lack of full response to treatment seen by persistent tumor size, tumor growth, and/or lymph node swelling.
1.1.3 One-Hot Encoding
Sometimes, it makes more sense to use one-hot encoding instead of label encoding. If categorical data does not have an inherent order, using label encoding can mislead the model into assuming a false ordinal relationship. For example, in our label encoding earlier, it made sense to assign “Low,” “Intermediate,” and “High” to 0, 1, and 2. However, for our ‘Thyroid Function’ variable, hyperthyroidism, for example, is not necessarily better or worse than hypothyroidism. Therefore, one-hot encoding makes more sense for this variable. Check out this video for a quick explanation of one-hot encoding.
-
(Code Block 1S) This code block takes the ‘Thyroid Function’ column, applies one-hot encoding to it, and returns a NumPy array (a multi-dimensional array used for numerical data processing) where each original category is represented by a binary vector (a sequence of numbers consisting only of 0s and 1s). Run this code block.
-
(Code Block 1T) This code block creates a new DataFrame with the new one-hot encoded columns. Run this code block.
- (Code Block 1U) This code block adds the one-hot encoded features to the original DataFrame. Run this code block.
- (Code Block 1V) This code block removes the original ‘Thyroid Function’ column. Run this code block.
1.1.4 Split to Train and Test
- (Code Block 1W) Separating the Dataset into Inputs and Target: This code block separates the dataset into two parts:
- ‘Inputs’ contains all the feature columns except ‘Recurred.’
- ‘Target’ contains the ‘Recurred’ column, which tells us whether a patient had recurring thyroid cancer or not.
- (Code Block 1X) Splitting the Training and Testing Data: Splitting data into training and testing sets is important in machine learning. It helps to see how well your model works on new data. Watch this video to learn more about why we split datasets. We have provided the code to split the dataset into training and testing parts. Pay attention to how X and y look after this step, and the sizes of X_train, X_test, y_train, and y_test. For example, if you see the X_train step as (306, 20), that means there are 306 samples (patients) in the training data, each with 20 features (e.g. Age, Gender, Smoking, etc.).
Coding Tip:
Following the standard coding conventions, X is commonly written in uppercase, while y is usually in lowercase.
1.2 Train the Model
- (Code Block 1Y) We have provided the code to make a Random Forest classifier. Run this code.
- (Code Block 1Z) This code trains the classifier using the training data you gave it. Run this code (this is like pressing play to let the computer do its job and learn from the examples we’ve given it).
1.3 Evaluate the Model
- (Code Block 1AA) In the first code block, we will use a Model Accuracy Score (called ‘model.score(X_test, y_test)’ in the code) to figure out how accurate the model was.
- The Model Accuracy Score will be a number from 0 to 1, which shows the percentage of patients that were correctly predicted to have thyroid cancer recurrence or not in the testing data.
- For example, a score of 0.6 means there was a 60% accuracy rate. In other words, 6 out of 10 patients’ thyroid cancer recurrence was classified correctly.
- (Code Block 1BB) In this code block, simply print out the testing data. We will compare these values to the values in the next code block.
- (Code Block 1CC) In this code block, we can see the model’s predictions by looking at something called ‘y_pred.’ When we print ‘y_pred’ (i.e. display it on the screen), we can see the model’s predictions for each patient in the testing dataset. Comparing these values to the values in Code Block 1DD, can you see where the model predicted thyroid cancer recurrence incorrectly?
- For example, if your y_test was [0, 0, 1, 0, 1] and your y_pred was [0, 0, 0, 1], we can see that the middle patient (the third one) was predicted incorrectly. The patient actually had thyroid cancer recurrence, but the model thought the patient did not. We will visualize our model’s accuracy more clearly in the next steps.
1.4 Visualize the Model
- (Code Block 1DD) When you run this code block, it will create a picture or visualization of one of the decision trees in our Random Forest model. The characteristics are arranged from the most important for determining whether a patient has recurring thyroid cancer at the top to the least important for determining recurring thyroid cancer at the bottom. At the top of each box in the decision tree, you will see a characteristic and a number, such as Stage <= 0.5. This example means that if the Stage is less than or equal to 0.5, it will go to the left side of the tree; otherwise, it will go to the right side of the decision tree.
- There are two values you can change:
- On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 99 to view the different decision trees in the Random Forest.
- One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
- Explore some of these decision trees. What thyroid characteristics were most important for predicting whether a patient has recurring thyroid cancer?
- There are two values you can change:
- (Code Block 1EE) This code calculates and sorts feature importances from the Random Forest model, then creates a horizontal bar plot to visualize each feature’s importance. This helps identify which features most influence the model’s predictions.
- What were the top three features that were most and least important in predicting thyroid cancer recurrence? Why do you think this might be the case?
- (Code Block 1FF - Optional Advanced Analysis) This code is used to evaluate the performance of our model using the ROC (Receiver Operating Characteristic) curve and the AUC (Area Under the Curve) score. The ROC and AUC evaluate a model’s performance across all possible classification thresholds. Watch this video to learn more about ROC and AUC. How was your model’s performance?
- AUC = 1: Perfect classifier. The model perfectly separates patients with and without thyroid cancer recurrence.
- AUC = 0.5: The model is as good as random guessing.
- AUC < 0.5: Worse than random guessing, which is usually a sign that the model is incorrectly predicting the classes.
2. Process the Data Again with Fewer Features
Now, we are going to create another model and train it using only the top three features to see if we can achieve similar results. Testing and storing data can be expensive and time-consuming, and we want to focus on the key predictors to save the patients' time and not overburden them with unnecessary tests.
2.1 Preprocess the Dataset
Note: If at this point you are encountering errors, try clicking on the cell that you are currently working on, then go to ‘Runtime’ -> ‘Run before.’
2.1.1 Select the Top 3 Columns
- (Code Block 2A) Consulting the graph from Code Block 1GG, add the top three characteristics for predicting thyroid cancer recurrence by inserting the names into the double brackets under the #TODO comment. Also include the ‘Recurred’ column. Run this code.
- e.g. df = df[['Adenopathy', 'Response', 'Stage', 'Recurred']]
2.1.2 Split to Train and Test
- (Code Block 2B) As in section 1.1.4, we will separate the dataset into inputs and targets.
- (Code Block 2C) Again, like in section 1.1.4, we will separate the dataset into train and test.
2.2 Train the Model
- (Code Block 2D) As in section 1.2, we have provided the code to make a Random Forest classifier. Run this code.
- (Code Block 2E) Again, like in section 1.2, we will train the classifier using the new training data you gave it. Run this code.
2.3 Evaluate the Model
- (Code Block 2F) As in section 1.3, we will use a Model Accuracy Score to determine the accuracy of the model. Is the new model that trained on fewer features as accurate as the original one that trained on more features?
- (Code Block 2G) In this code block, simply print out the testing data. We will compare these values to the values in the next code block.
- (Code Block 2H) In this code block, we can see the model’s predictions by looking at y_pred. Comparing these values to the values in Code Block 2G, can you see where the model predicted thyroid cancer recurrence incorrectly?
2.4 Visualize the Model
- (Code Block 2I) When you run this code block, it will create a picture or visualization of one of the decision trees in our Random Forest model.
- There are two values you can change:
- On line 2 of the code block, you will see a variable called ‘tree_number.’ You can change this number to be between 0 and 99 to view the different decision trees in the Random Forest.
- One line 5 of the code block, you will see a variable called ‘max_depth.’ You can change this number to view more or less of the decision tree.
- Explore some of these decision trees. What thyroid characteristic was most important for predicting whether a patient has recurring thyroid cancer?
- There are two values you can change:
- (Code Block 2J) This code calculates and sorts feature importances from the Random Forest model, then creates a horizontal bar plot to visualize each feature’s importance.
- What are the most and least important characteristics now? Is it different from your feature importance graph from Code Block 1GG? Why do you think so?
- (Code Block 2K - Optional Advanced Analysis) This code is used to evaluate the performance of our model using the ROC (Receiver Operating Characteristic) curve and the AUC (Area Under the Curve) score. How is your model’s performance now? Is it comparable to when we made the model with all of the features?
3. Experiment with different combinations of features!
-
- What happens to the model’s accuracy when you select the bottom three features?
- What happens to the model’s accuracy when you randomly choose features?
- What happens to the model’s accuracy when you only choose one feature? Two features?
Ask an Expert
Global Goals
The United Nations Sustainable Development Goals (UNSDGs) are a blueprint to achieve a better and more sustainable future for all.
Variations
- Replace the Random Forest classifier with another machine learning algorithm to see if one algorithm is better than another! (Try SVM or KNN!)
- Use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic examples of underrepresented classes in the dataset (especially since not many people have Hurthle cells!)
Careers
If you like this project, you might enjoy exploring these related careers:













