Abstract
Understanding the different types of celestial objects in our galaxy is important for astronomers. It helps them study how these objects form and evolve, map the night sky, understand the structure of the Milky Way and other galaxies, and identify celestial bodies that might host habitable environments. In this project, you will create a boosted tree model to classify celestial objects based on their spectral characteristics.
Summary
None
Readily available
No issues
Objective
Build a boosted tree model to classify celestial objects based on important characteristics.
Introduction
Astronomers use advanced tools to study celestial objects and classify them into different types. But how do they determine which objects belong to which category? Is it based on their brightness, color, or size? Sometimes, classification is straightforward–a star might be identified by its temperature and luminosity, or a quasar by its extremely bright and distant nature. Other times, the distinctions are more subtle. For example, two objects might appear similar, but small differences in their spectra can reveal their true nature. In this project, we will focus on three main types: stars, galaxies, and quasars. Check the bibliography of this page to learn more about the different types.
-
A star is a massive, luminous sphere of plasma held together by gravity, undergoing nuclear fusion in its core.
-
A galaxy is a massive system of stars, stellar remnants, interstellar gas, dust, and dark matter bound together by gravity.
-
A quasar (quasi-stellar object) is an extremely luminous and energetic active galactic nucleus (AGN) powered by a supermassive black hole at the center of a galaxy.

Examples of celestial objects: a star (left), a galaxy (center), and a quasar (right).
Figure 1. Examples of celestial objects: a star (left), a galaxy (center), and a quasar (right).
Astronomers and astrophysicists spend years analyzing celestial objects and their characteristics to better understand their classification. But with billions of objects in our galaxy–including stars, galaxies, and quasars–how can we make sense of such vast amounts of data?
For instance, how can scientists study the properties of thousands of celestial objects in a survey to identify their types? How can researchers classify objects from data gathered by telescopes without manually checking each one? This is where machine learning comes in.
Artificial Intelligence (AI) is a branch of computer science focused on the creation of tools that can solve problems and analyze information. Machine learning is a subdivision of AI. Its goal is to create tools that can learn and improve over time using data. In this project, we will dive into decision trees, one type of machine learning algorithm. Decision trees mimic the human decision-making process by breaking down a problem into a series of sequential questions or decisions. Decision trees often serve as powerful tools for classifying data and solving various problems.
Watch this video to learn more about decision trees. We recommend watching the video from 0:00 to 9:15, the second half of the video is optional and goes into more detail:
However, decision trees have some drawbacks, like overfitting (being too specific to training data) and being sensitive to small changes in the data. To address these problems, we can use boosted trees, which are an ensemble learning method that combines multiple weak models, often decision trees, to create a strong model. Each tree is trained to correct the errors of the previous one, improving the overall accuracy and performance of the model. This technique is particularly effective for tasks like classification, where complex patterns in the data need to be recognized.
Watch this video to learn more about boosted trees:
In this project, your task is to identify the features that best indicate a celestial object’s type and fine-tune the boosted tree model’s learning rate to optimize its performance.
Terms and Concepts
- Star
- Galaxy
- Quasar
- Artificial Intelligence (AI)
- Machine learning
- Decision tree
- Overfitting
- Boosted tree
- Categorical variable
- Label encoding
- Local Outlier Factor (LOF)
- Confusion matrix
- Classification report
- Precision
- Recall
- F1-score
- Support
- Learning rate
- Hyperparameter
- Parameter
Questions
- What characteristics do astronomers use to classify celestial objects?
- Why might classifying celestial objects be challenging based solely on their appearance?
- How do decision trees work?
- Why are boosted trees considered an improvement over regular decision trees?
- Why do you think it is important for scientists to classify celestial objects? What could we learn from this process?
Bibliography
The dataset we will be using in this project can be found here:
- fedesoriano. (2022, January). Stellar Classification Dataset - SDSS17. Kaggle. Retrieved January 27, 2025.
The code is based on this Kaggle user's code:
- Beyza Nur Nakkaş. (2022, February). Stellar Classification - 98.4% Acc 100%. Kaggle. Retrieved January 27, 2025.
To learn more about stars, galaxies, and quasars:
- HubbleSite. (n.d.). Galaxies. Nasa. Retrieved January 27, 2025.
- Khan Academy. (n.d.). What is a Star?. Retrieved January 27, 2025.
- Nasa. (n.d.). Quasars. Retrieved January 27, 2025.
- Wikipedia. (2025, January). Stellar classification. Retrieved January 27, 2025.
To learn more about decision trees and boosted trees:
- Econoscent. (2020, October). Visual Guide to Gradient Boosted Trees (xgboost). Youtube. Retrieved January 27, 2025.
- Normalized Nerd. (2021, January). Decision Tree Classification Clearly Explained!. YouTube. Retrieved January 27, 2025.
To learn more about learning rate:
- deeplizard. (2017, November). Learning Rate in a Neural Network explained. YouTube. Retrieved January 27, 2025.
- UncomplicatingTech. (2023, August). Effect of Learning Rate in Neural network model!. YouTube. Retrieved January 27, 2025.
To learn more about preprocessing methods: Local Outlier Factor (LOF), Synthetic Minority Oversampling Technique (SMOTE), and feature scaling:
- Data Magic (by Sunny Kusawa). (2022, March). SMOTE - Handle imbalanced dataset | Synthetic Minority Oversampling Technique | Machine Learning. YouTube. Retrieved January 27, 2025.
- LaBarr, Aric. (2023, December). What is the Local Outlier Factor. YouTube. Retrieved January 27, 2025.
- TheDataPost. (2020, May). Feature Scaling. YouTube. Retrieved January 27, 2025.
To learn more about why we split our data into train and test:
- deeplizard. (2017, November). Train, Test, & Validation Sets explained. YouTube. Retrieved January 27, 2025.
- Turp, Misra. (2023, February). Why do we split data into train test and validation sets?. YouTube. Retrieved January 27, 2025.
To learn about how to read a confusion matrix:
- V7. (2022, September). Confusion Matrix: How To Use It & Interpret Results [Examples]. Retrieved January 27, 2025.
To learn more about ROC and AUC:
- StatQuest by Josh Starmer. (2019, July). ROC and AUC, Clearly Explained!. YouTube. Retrieved January 27, 2025.
Materials and Equipment
- Computer with Internet access
Experimental Procedure

Overview
In this project, you will preprocess the data by eliminating data or columns deemed unnecessary by statistical analysis. You will also adjust parameters of a boosted tree model to achieve the highest accuracy in classifying celestial objects into categories (e.g., STAR, GALAXY, and QSO).
Setting Up the Google Colab Environment
- You will need a Google account. If you do not have one, make one when prompted.
- Download the stellar_classification.ipynb file from Science Buddies. This is the code you will need to process your data.
- Download the stellar_data.csv file from Science Buddies.
- Open the dataset to familiarize yourself with its structure by going to your Downloads and double-clicking on the file.
- You can learn more about each column by clicking this link here.
- Within your Google Drive, click on ‘MyDrive,’ then create a new folder and rename it “stellar_classification.” Inside the folder, upload both the stellar_classification.ipynb file and the stellar_data.csv file.
- Double-click on the stellar_classification.ipynb file. This should automatically open in Google Colab.
- Read the Troubleshooting Tips and How to Use This Notebook sections. Follow the instructions you find in that section.
- Run the block under Importing Libraries to ensure you have access to all the functions we will use for this project.
1. Loading the Data into a Pandas DataFrame
-
(Code Block 1A) Run this code block to create a DataFrame, which is like a table that will be used to load and manipulate the data in the notebook. You will see the data from your .csv file populate in a table below the code block.
2. Preprocessing
2.1 Label Encoding
When working with machine learning algorithms, it is important to convert categorical variables into numbers since algorithms understand numbers better than words. Categorical variables represent qualities like gender or group, and cannot be used because they lack a numerical form that algorithms can work with. Label encoding is a way to convert categorical variables into positive whole numbers, assigning each category a unique integer value (0, 1, 2, etc.).
- (Code Block 2A) This code block converts the labels in the “class” column (e.g., “GALAXY”, “STAR”, and “QSO”) into numbers (0, 1, 2) using a mapping dictionary. It ensures the data is ready for machine learning. Run this code block.
2.2 Detecting Outliers
Outliers are data points that differ significantly from the majority of values in a dataset. Detecting and removing them is important because they can skew results, reduce model accuracy, and hide true patterns in the data. Outliers often result from errors, anomalies, or rare events and can mislead analysis if not addressed. Removing them helps make data cleaner and improves reliability, but it is important to check if they provide valuable insights before removing them.
In our case, outliers in celestial object data can arise from both natural and observational causes. Natural causes include unusual objects like rapidly rotating stars, active galaxies, or rare events such as supernovae and gamma-ray bursts. Observational causes include measurement errors, contamination from nearby objects, blending of light from multiple sources in crowded regions, and issues with data processing or telescope equipment. Sometimes, human mistakes, such as mismatching objects in catalogs or using incorrect models, can also create outliers. These need to be carefully analyzed to determine whether they represent real phenomena or just errors.
-
(Code Block 2B) This code block uses the Local Outlier Factor (LOF) algorithm to detect outliers in a dataset. It creates a LOF model, fits it to the dataset, and predicts whether each data point is normal (1) or an outlier (-1). LOF identifies outliers by comparing the density of each point to its neighbors, making it useful for spotting anomalies in data. Click on this link to learn more about the LOF algorithm.
-
(Code Block 2C) This code block detects outliers using scores from the LOF model. It stores these scores, sets a current threshold (-1.5), and identifies points with scores below the threshold as outliers. Finally, it will count and print the total number of outliers.
- (Code Block 2D) This code block removes rows identified as outliers from the LOF model. This ensures that the DataFrame is cleaned by removing outliers, leaving only inliers for further analysis or modeling. Run this code block.
2.3 Feature Selection
- (Code Block 2E) This code block generates a heatmap to help easily identify relationships between features in the dataset, such as strong positive correlations (closer to 1) or negative correlations (closer to -1).
- Which features are most correlated with one another? Which ones have the weakest correlation?
- (Code Block 2F) This code block calculates how strongly each numerical feature in the DataFrame is related to the ‘class’ column using correlations. It then sorts these correlations, showing which features are most positively or negatively related to ‘class.’ This helps in selecting important features and understanding the data.
- (Code Block 2G) This code block removes specific columns from the DataFrame. Inside the list under the #TODO comment, add the column names of the columns you want to drop. Here are some things to consider:
- Features with very low correlation (close to 0) have minimal impact on predicting class.
- Retain columns with higher correlations (close to 1), as they are more strongly related to the target (class).
- Consider keeping features with negative correlations if the absolute value of the correlation is significant. A negative correlation indicates an inverse relationship with the target variable (class), which can still be valuable for prediction.
- Drop columns with missing values or NaN correlation unless you can impute or determine their importance through other means.
2.4 Handling Imbalanced Data
Training on a balanced dataset is important because it ensures that the model learns equally from all classes and doesn't become biased toward the majority class. In an imbalanced dataset, the model may prioritize accuracy by focusing on the dominant class while ignoring the minority classes, leading to poor performance for underrepresented categories. A balanced dataset improves the model’s ability to generalize and make fair predictions across all classes. There are many ways to combat imbalanced data, and in this project, we will be using an oversampling method called SMOTE, which will be explained in more detail below.
- (Code Block 2H) This code block creates a count plot to show how many entries belong to each category in the ‘class’ column. Each bar represents a unique class, and its height shows the count of entries for that class. The plot helps you see if the dataset is balanced or has more entries in some categories than others.
- (Code Block 2I) This code block splits the data into input features (X) and the target variable (y). It removes the ‘class’ column to create X, which contains all the other data, and selects the ‘class’ column for y, which is the target for prediction.
- (Code Block 2J) This code block applies the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to balance an imbalanced dataset by generating synthetic samples for the minority class. Click on this link to learn more about SMOTE.
- (Code Block 2K) This code block creates a count block to show the distribution of target variable y (class labels) after resampling using the SMOTE technique. This plot helps confirm if the dataset is balanced after resampling.
- (Code Block 2L) This code block scales the input features (x) to standardize them, so they have a mean of 0 and a standard deviation of 1. This helps machine learning models work better and train faster by putting all features on the same scale. Click on this link to learn more about feature scaling.
3. Split to Train and Test
We will split our dataset into train and test to avoid overfitting and ensure the model is tested on data it has never seen before. Click on this link to learn more about splitting our data into train and test.
- (Code Block 3A) We have provided the code to split the dataset into training and testing parts. Run this code block.
- (Code Block 3B) This code block prints out the sizes of X_train, y_train, X_test, and y_test. If you see the X_train size as (121668, 9), that means there are 121669 samples (celestial objects) in the training data, each with 9 features (e.g., redshift, plate, MJD, etc.).
4. Training the Model
- (Code Block 4A) We have provided the code to create and train a boosted tree classifier. The accuracy of the model will be printed below. Run this code block.
5. Evaluating the Model
- (Code Block 5A) This code block classifies data using a confusion matrix, and displays the heatmap of correct and incorrect predictions for each class (GALAXY, STAR, QSO). This helps identify the model’s accuracy and errors. Click on this link to learn more about how to read a confusion matrix.
- Are there any classes the model struggles to classify correctly? Why might this be?
- (Code Block 5B) This code block displays a classification report to evaluate the performance of the model on the test dataset. The report includes key metrics for each class:
- Precision: Measures how many of the predicted positive instances are actually correct.
- Recall: Measures how many of the actual positive instances were correctly predicted.
- F1-score: The harmonic mean of precision and recall, balancing the two metrics.
- Support: The number of true instances for each class in the test set.
- (Code Block 5C) This code block evaluates and visualizes the ROC curves to show the balance between true positive and false positive rates, and the area under the curve (AUC) scores indicate the model’s performance for each class. Higher AUC scores mean better classification. Click on this link to learn more about ROC and AUC.
- (Code Block 5D) This code block shows how well the model predicts each class (GALAXY, STAR, QSO) using a bar chart. The chart compares actual labels to predictions and highlights mistakes.
- Which class has the most prediction errors? Why do you think this is the case?
- Are there any classes the model predicts accurately most of the time? What might contribute to this accuracy?
6. Refine Your Model
Now, it is time to experiment and improve the model by adjusting various settings, such as the threshold for outliers, learning rate, which features to include or exclude, maximum depth of the trees, and the number of trees in the ensemble. Remember that accuracy can be found in Code Block 4A and that all code should be run again when adjusting parameters.
In a notebook or spreadsheet, create a table like the one below to keep track of changes and its affect on the model's accuracy:
| Outlier Threshold | Learning Rate | Max Depth | Number of Trees | Features Included | Accuracy |
|---|---|---|---|---|---|
| -1.5 | 0.1 | 5 | 50 | obj_ID, alpha, delta, run_ID, rerun_ID, cam_col, field_ID | 0.979 |
- Adjust the threshold for outliers (Code Block 2C)
- Try adjusting the threshold value to see how that affects the number of outliers. Remember that LOF assigns scores close to -1 for inliers. Scores significantly lower than -1 (e.g., below -1.5 and -2) often indicate outliers.
- A common starting point is -1.5, but it is important to adjust it based on the dataset and the desired level of sensitivity.
- Adjust the learning rate (Code Block 4A)
- The learning rate is a hyperparameter, which is a value set before training that controls the behavior of the training process and is not learned from the data. The learning rate determines the size of the steps the algorithm takes when adjusting the model’s parameters (weights that do change with training). Click on this link for a quick explanation of learning rate.
- A learning rate that is too high might overshoot the optimal solution, fail to converge, or exhibit erratic behavior.
- A learning rate that is too low may take a long time to converge or get stuck in a local minimum.
- The learning rate is a hyperparameter, which is a value set before training that controls the behavior of the training process and is not learned from the data. The learning rate determines the size of the steps the algorithm takes when adjusting the model’s parameters (weights that do change with training). Click on this link for a quick explanation of learning rate.
- Final Step: Iterate and Track Your Results
- Throughout your experimentation, keep a detailed log of your adjustments and their corresponding results in the data table shown above.
Ask an Expert
Variations
- Adjust the number of trees in the ensemble (Code Block 4A)
- In ensemble methods like boosted trees, the “number of trees” refers to how many individual decision trees are included in the ensemble.
- Having more trees typically improves performance as the predictions from multiple trees are averaged (or voted on) to reduce variance and improve accuracy. However, adding too many trees increases computational cost and training time.
- Fewer trees reduce computational cost but may result in a less accurate model if too few trees are used.
- In ensemble methods like boosted trees, the “number of trees” refers to how many individual decision trees are included in the ensemble.
- Experiment with dataset features (Code Block 2G)
- Consider including more or less features from the dataset.
- Experiment with including all available features. Observe any changes in accuracy and training time.
- Next, aim to achieve high accuracy with the fewest features possible. Which features contribute most to classifying the celestial object types accurately? Reflect on why these specific features are significant.
Careers
If you like this project, you might enjoy exploring these related careers:









