Use DNA Sequencing to Trace the Blue Whale's Evolutionary Tree
AbstractThe first land animals took their tentative steps out of the ocean and onto solid ground around 365 million years ago. Over millions of years, these early ancestors developed into tetrapods, including amphibians, reptiles, dinosaurs, birds, and mammals. Then, around 50 million years ago, the reverse process occurred: the mammalian ancestor of today's whales returned to the ocean. In this genomics science fair project, you will use mitochondrial protein sequencing to trace the evolution of whales and identify their closest living relatives that still have four legs.
David Whyte, PhD, Science Buddies
Edited: Svenja Lohner, PhD, Science Buddies
In this genomics science fair project, you will trace the blue whale's family tree using genomic sequences in the GenBank database and the BLAST search tool.
The evolutionary story of the whales is dramatic. An ancient land mammal returned to the sea, about 50 million years ago, to become the forerunner of today's whales. In doing so, it lost its legs, and all of its internal organs became adapted to a marine existence. This ocean invasion by mammals is the reverse of what happened millions of years previously, when the first animals crawled out of the sea onto land.
Figure 1. The blue whale (Balaenoptera musculus) is a marine mammal belonging to the suborder of baleen whales (called Mysticeti). At up to 110 feet (about 34 meters) in length, it is believed to be the largest animal ever to have existed. (Wikipedia, 2008.)
What did the first whales look like, and what gave rise to them? For a long time, scientists could only speculate, for the oldest fossils anyone knew of had already assumed the basic appearance of whales. Charles Darwin speculated in the first edition of The Origin of Species that bears might be the precursors of whales: "In North America the black bear was seen ... swimming for hours with widely open mouth, thus catching, like a whale, insects in the water. Even in so extreme a case as this, if the supply of insects were constant, and if better adapted competitors did not already exist in the country, I can see no difficulty in a race of bears being rendered, by natural selection, more and more aquatic in their structure and habits, with larger and larger mouths, till a creature was produced as monstrous as a whale."
Today we have much better information about the evolution of whales than was available to Darwin. The new information comes primarily from two sources. The first source is an abundance of intermediate fossils that have been uncovered over the past two decades that allow paleontologists to trace the development of modern whales, step by step, back to their beginnings early in the Eocene epoch. This time period is often referred to as the dawn of the age of mammals, and lasted from about 55 million to 34 million years ago. The second source of new insights into whale evolution comes from analysis of DNA samples.
DNA acquires mutations over time. The longer two species have been separated, the greater will be the number of different mutations that will accumulate in each species' genome. A phylogenetic tree can be built based on the accumulated changes between genes in different organisms: fewer genetic changes implies a relatively recent divergence. Based on pair-wise comparisons of a set of genes from different organisms, it is possible to construct a tree that reflects the genetic distances between the organisms. This sort of tree-building was done in the past, solely by comparing the bones, anatomy, behavior, etc., of many animals. With genomics tools, it can be done online from your computer, and it can include animals that don't even have bones, provided there is DNA sequence data available. You can watch the two videos below to learn more about phylogenetics and reading phylogenetic trees.
The focus of this genomics science fair project is to explore the evolutionary tree of the whales using DNA sequences available in GenBank, a free database of sequences derived from hundreds of different organisms. Since new sequences are always being added, the tree you make will reflect the newest genomic information available.
Although this is a genomics science fair project, it is also about whales. To build the tree of the whale family, you will need to spend some time getting familiar with the scientific names and key features of about 25 whale and dolphin species. You will also learn about the blue whale's closest relatives that did not make the transition to a marine habitat.
The process you will use to generate an evolutionary tree for the whales is outlined as follows:
Obtain the protein sequence for a whale mitochondrial protein.
- A mitochondrial sequence is very useful for studying evolution because it does not recombine and it accumulates mutations at a faster rate than chromosomal DNA.
For the purposes of this science fair project, you will use the cytochrome c oxidase 1 protein (cox1) from the blue whale. This will be the query sequence.
- Other genes or other whales can be used as variations on this science fair project.
- Use the bioinformatics search tool, BLAST, to identify sequences from other organisms that are related to the query sequence. BLAST will find genes that are related to the query. The more related the genes are to each other, the closer the organisms they are derived from are to each other on the evolutionary tree.
- Use a simple tree-building tool to generate an evolutionary tree based on the BLAST output. There are more-sophisticated tools available, but the BLAST tool will suffice for this science fair project.
The key issues the tree will address are:
- What are the nearest relatives of the blue whale, determined by DNA analysis?
- What is the identity of the closest blue whale relative that still has four legs? (It is not a bear!)
The tools involved in this science fair project are simple to use and very powerful. The evolutionary trees that you make from your BLAST analysis will be based on molecular data for hundreds of species derived from samples collected from all over the globe. In the short time it takes to build a BLAST tree, you can see evidence for relationships between animals that would be difficult or impossible to obtain by traditional methods that are based on fossils.
Terms and Concepts
To start this genomics science fair project, learn about whale evolution using the Internet and your local library. Also, go to the National Center for Biotechnology Information (NCBI) website and explore the databases and tools available. You will be using BLAST to search the reference protein database.
- Eocene epoch
- Phylogenetic tree
- Pair-wise comparisons
- Evolutionary tree
- Cytochrome c oxidase 1 protein (cox1)
- Query sequence
- Accession number
- Base pair
- Mitochondrial DNA
- Mutation rate
- What are the key characteristics of mammals?
- What adaptations occurred in the whale's anatomy as it evolved from a land-based to a marine mammal?
- Make a timeline of mammalian evolution, focusing on cetaceans.
- Why use mitochondrial genomes for evolutionary analysis?
- What does the acronym BLAST stand for?
These websites discuss the current data and hypotheses about whale evolution:
- The Thewissen Lab. (2008). Whale Origins Research. Retrieved August 27, 2021.
- KQED. (2001). Whale evolution. Retrieved August 18, 2008.
- MarineBio.org. (2008). Blue Whales. Retrieved August 27, 2021.
- Wikipedia Contributors. (2008). Blue Whale. Retrieved August 18, 2008.
- Understanding Evolution. (n.d.). The evolution of whales. Retrieved August 27, 2021.
These websites are useful resources for understanding how DNA can be used to build evolutionary trees and the bioinformatics tools used to do this:
- National Center for Biotechnology Information. (2008). Basic Local Alignment Search Tool. Retrieved August 18, 2008.
- National Center for Biotechnology Information. (2008). Welcome to NCBI. Retrieved August 18, 2008.
- National Center for Biotechnology Information. (2004, April 1). Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources. Retrieved August 27, 2021.
- European Molecular Biology Laboratory's European Biology Institute. (2008). ClustalW2. Retrieved November 13, 2008.
Materials and Equipment
- Computer with high-speed Internet connection
- Lab notebook
Before you start with this project, it might be helpful to familiarize yourself with the bioinformatic tools and websites that you are going to use. You can watch the two videos below to learn more about the BLAST tool and the NCBI website and databases.
To begin, obtain the query sequence, the cytochrome c oxidase subunit 1 protein, from the blue whale.
- Open the NCBI home page.
- In the "Search" box, make sure that "Nucleotide" is selected from the drop-down menu.
- Type in "blue whale mitochondrial complete genome" in the "Search for" box.
- The resulting page will list documents in the Nucleotide database that contain "blue whale" or its scientific name Balaenoptera musculus.
- If there is more than one result, look for the one that has the accession number NC_001601.1. This number is a unique GenBank identifier for this record.
- Click on the link to the record for the accession number NC_001601.1. Scan up and down the page for NC_001601.1. It has a wealth of information, including the names of the researchers who submitted the sequence, the date of submission, the sequence of the proteins coded for by the mitochondrial DNA, and the complete mitochondrial DNA sequence (at the bottom of the page).
- Note that the mitochondrial genome is 16,402 base pairs long and contains 13 genes. You can find information for each gene in the "Features" section (labeled as "gene"). Right underneath the gene information, you will find information on the protein the gene encodes (labeled as "CDS"). You will use the protein sequence from one of these genes to establish the evolutionary tree.
- On the page in the "Features" section, look for the CDS that says "/product="cytochrome c oxidase subunit I." This entry should list the protein sequence of the cytochrome c oxidase subunit 1. The protein sequence is the string of capitalized letters right after "/translation=." Alternatively, you can click on the protein_id link "NP_007058.1" to see the page describing the protein. There you will also find the protein sequence at the very bottom of the page. The protein sequence is the query that you will use for your search.
Now that you have the query sequence, open the BLAST page, following the instructions below. BLAST is a program that compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences.
- Open a new window or tab in your internet browser and go to the NCBI main page, and click on the BLAST link in the "Popular Resources" list on the right.
- There are several versions of BLAST. Since you want to use a protein sequence to search a protein database, click on "Protein BLAST" under the "Web BLAST" heading.
- Fill out the protein BLAST query form.
- Copy and paste the cytochrome c oxidase subunit I protein sequence into the box labeled "Enter accession number, gi, or FASTA sequence". You can also paste the protein accession number, NP_007058, into the box.
- In the "Job title" box, type "Blue whale cox1 protein."
- In the "Database" box, choose "Reference proteins (refseq_protein)" from the drop-down menu. The reference proteins are non-redundant, so there is just one cox1 sequence for each species.
- Under "Algorithm," click on "blastp" (which is protein vs. protein).
- Keep all the other search parameters including the Algorithm parameters at their default settings.
- Next to the BLAST button at the bottom of the page, check the box "show results in a new window."
- Click on BLAST to start the search.
- Be patient. It will take a minute or two for the BLAST search to finish. The results page will resemble the one in Figure 2.
Screenshot of the BLAST results page on the ncbi.nlm.nih.gov website. The results page in the BLAST tool on the NCBI webpage shows a list of protein sequences that match the search term. Results provide additional information such as the percentage match a result has to a specific query string.
Figure 2. BLAST results for cytochrome c oxidase subunit 1 protein from the blue whale. (NCBI, 2021.)
- On the top left of the BLAST results page you will find the summary section (blue in Figure 2), which provides information on different aspects of your search. On the top right there is a box that allows you to filter your results based on certain criteria (red in Figure 2). Below the top section, the BLAST results are shown (yellow in Figure 2). There are four different tabs called "Description," Graphic Summary," "Alignments," and "Taxonomy." Each tab presents the search results in a different way.
- The "Description" tab contains a summary table of hits found by BLAST and is the default tab shown.
- The "Graphic Summary" tab shows a color key of the alignments. The color key shows the degree of similarity for the sequences.
- The "Alignment" section contains the detailed pairwise alignments between query and database sequences.
- The "Taxonomy" section provides details of the taxonomic distribution of matches BLAST found.
- Click on the "Alignment" tab and look at the result page, which should look similar to the one in Figure 3. The first result will be the query sequence against itself, with 100% identity.
- Note the "% identity" listed for each alignment. If the "% identity" between two species is 97%, then these two species differ by 3% in the protein sequence. Remember, the larger the % difference, the more distant they are in the family tree.
- Pick 30 BLAST results (10 from the top, 10 from the middle, and 10 from the bottom of the results list) and make a data table listing the scientific classification of each species (order, sub-order, family, sub-family, genus, species), its common name (you will need to look this up), the % identity, and the actual number of protein sequence changes (for example, "Identities = 493/512" means there are 19 changes between the sequences) in your lab notebook.
- In the "Description" tab, you can uncheck all the results by clicking on the "select all" box in the top left corner of the BLAST results table. Then you can check the individual sequences you want to select for your analysis and look at their alignments in the "Alignments" tab or get more information on the taxonomy of the individual species in the "Taxonomy" tab. Go through each of the results tabs to find the information you need.
- What do you notice about your data as you go further down the BLAST result page?
Figure 3. A detailed view of two aligned sequences as shown in the "Alignment" tab of the BLAST output page. (NCBI, 2021.)
Your BLAST results provide the data needed to determine which species are most related to the blue whale. The next step is to visualize the data as a tree.
To generate a tree, click on the "Distance tree of results" link next to "Other reports" above the results table in the summary section of your BLAST report. This includes all of the hits from the BLAST search in the tree. To simplify the tree, you can also select a subset of sequences in the "Description" tab and then click on the link "Dinstance tree of results" within the results table. This tree will only include the hits from your selected sequences. The tree should look similar to the one in Figure 4.
- Use the drop-down list under "Sequence label" and select "taxonomic name" to simplify the names on the tree.
- Keep the other tree parameters at their default settings.
- The program has clustered some of the species together (leaves), which is indicatied by a green triangle. Hover over the tip of the triangles with your curser and expand each cluster of leaves to identify the individual members. You might need to zoom into the tree using the slider in the top toolbar of the tree to be able to read the taxonomic names of each leave.
A line diagram shows the relationship of cox1 proteins from species related to the blue whale. Each line represents a branch within the tree. At the tips of the lines the species names are written.
Figure 4. Tree showing relationship of cox1 proteins from species related to the blue whale. (NCBI, 2021.)
- Create a simplified distance tree with just the 30 sequences you selected for your data table in step 16. In the "Description" tab, select your 30 sequences and then click on "Distance tree of results" within the BLAST results table. Your tree will only include the 30 selected sequences.
- Compare the tree with all the BLAST results and your simplified tree. How are they similar or different? Which branches of the tree are missing? Which species are within the missing branches?
- Make a figure showing your simplified tree, with the data in your data table (% differences of the sequences) added.
- You might want to re-draw the tree, using a computer graphics program or by hand, to enlarge the text and to add pictures of the animals.
- Looking at the tree that includes the full set of your BLAST results, what are the blue whales' nearest "DNA cousins"? How does your result compare to traditional trees made prior to DNA analysis? What are the blue whale's nearest relatives that are not whales or dolphins? If you need help reading the phylogenetic tree, you can view the Phylogenetics and Reading Phylogenetic Trees or the Phylogenetic analysis of pathogens video.
- Based on your data, you are in a position to estimate the time of divergence of the blue whale from the other species listed in your BLAST table. If you assume that the blue whale (Balaenoptera musculus) split from the fin whale (Balaenoptera physalus) 10.3 million years ago (see Whale Origins, also referenced in the Bibliography), how many mutations occurred in the cox1 protein per million years? Use this data to determine the time of divergence for the other species in the BLAST table, assuming the mutation rate is constant.
Ask an Expert
Repeat the procedure with a different protein sequence from the blue whale mitochondrial genome.
- Do you get the same tree? Is the apparent mutation rate the same for different proteins?
- Repeat the procedure using the DNA sequence for the genes. This allows you to use genetic changes that occurred in the DNA, but that did not affect the protein sequence.
- Increase the number of BLAST hits to include species that are less related (Increase "Max target sequences" under "Algorithm parameters").
- What is the % difference between blue whale and Homo sapiens' cox1 protein sequences?
- Living cetaceans are subdivided into two highly distinct suborders, Odontoceti (the echolocating toothed whales) and Mysticeti (the filter-feeding baleen whales). Are there specific mutations that distinguish the Odontoceti from the Mysticeti in your BLASTs?
- Download the protein sequences of the BLAST hits using the "Get selected sequences" button, then use tools at the European Bioinformatics Institute (EBI), such as Clustalw, to make an evolutionary tree. ClustalW compares every sequence to all of the others.
If you like this project, you might enjoy exploring these related careers:
- Science Fair Project Guide
- Other Ideas Like This
- Genetics & Genomics Project Ideas
- Big Data Project Ideas
- My Favorites