Objective
The goal of this science project is to compare a protein's amino acid sequence in a couple of animals to infer how closely related they are to one another. You have access to the same online genomics databases and software that scientists use every day. You can select a few animals you are interested in and find a protein common to all of them using information you can find online. Then you can use an online software tool to line those proteins up and compare the sequences. The more similar the amino acid sequence of the protein, the more closely related the two species are! In doing this project, you will learn about making use of the tremendous amounts of DNA and protein sequence data that is available online and using publicly available software to analyze and study these sequences. You will also learn about using DNA or protein sequences to understand evolution and how different organisms are related to one another.
Introduction
[Note: The first part of this introduction is same as for the Tree of Life - I (basic); however, you will find that the rest of this project idea is much more advanced.]
Think about or draw out your family tree adding aunts, uncles, and cousins. (If you don't have siblings or cousins just draw a big family tree from your imagination.) Based on your family tree, you can see that you are more closely related to your sister (or brother) than you are to your cousin; that is there are fewer "branches" separating you from your sister than there are separating you and your cousin.
Now imagine that a biologist arrived at a big family reunion and had no idea who were sisters, cousins, aunts, uncles, etc. but tried to sort it out by how all of you look. Just based on how you look, would s/he be able to guess which of the two kids standing next to you is your sister and which is your cousin? In many families, the biologist may be able to make a pretty good guess based on your visible features (called your morphology), like number of arms/legs/eyes, hair color, nose shape, etc. (Notice that some of these morphological features are shared by all humans but that other features can be used to distinguish you from one another.) But this is not a failsafe approach to determining familial relationships -- as some people look more like their cousin than their sister, right? You could just use morphology to make a good guess.
So what is the best way to determine how related you are to one another (besides just asking -- but stick with me here)? The biologist would have to look at your DNA! You get half of your DNA from your mother and half from your father. Both of those "halves" are very similar to one another -- with one difference about every 1000 base pairs (but out of three billion total letters -- that's three million differences!). And your mother and father got their DNA from their parents and so on up the family tree. Your DNA should be MUCH more similar to your sister's than your cousin's because you and your sister both got your DNA from the same parents, whereas there are many more branches in the tree (and thus many more matings and DNA base pair differences entering the tree) between you and your cousin. That is, you are much more similar genetically to your sister because you have more recent common ancestors than you and your cousin.
Family Trees In Biology
So how does all of this apply to biology? For centuries, scientists have been trying to draw the family tree that reflects the history and evolution of all animals on the earth. This tree would show which species are more closely related to one another, like the case where you are "closer" to your sister on your family tree than you are to your cousin. For example, humans are more closely related to chimpanzees than to dolphins, so chimps and humans would have fewer branches between them on the "animal family tree."
How do scientists make this family tree? For many years, scientists relied on comparisons of morphological characteristics (like hair, teeth, limbs, fins, hearts, livers, eyes, etc.) to try to figure out who was more closely related to whom. These kinds of comparisons are often accurate, but as you saw in the example of a human family, these physical characteristics can sometimes be misleading. Evidence of this concept is that different scientists would come up with different trees/relationships by using different sets of morphological information! So which tree is "right?"
To think about how to identify the "right" tree, we have to think about how these animals became different from one another throughout evolution. All heritable morphological changes (those changes that can be passed down to the next generation) are a result of changes (mutations) in an organism's DNA. This mutation can lead to a change in a protein sequence or a change in when, where or how much of the protein gets made. That's it! One or a couple of these changes can lead to big a difference in morphology and/or the way a single cell in the organism can function. So over billions of years of evolution, a slow accumulation of DNA sequence (and thus some protein sequence) changes has led to the existence of all of the earth's different species -- with some more closely related to one another than others. This whole process is called molecular evolution.
So, as we saw with the family reunion example, the best way to see how related two organisms are is to compare their DNA or protein sequences. (Remember that a protein's sequence is encoded in its gene's DNA - so the only way to get a protein sequence change is to get a change in the DNA that codes for it.) Those organisms with the most similar DNA/protein sequence are almost surely more closely related than those with less similar DNA/protein sequences.
Why didn't scientists use DNA sequences to build the trees 100 years ago? First, it has only been about 50 years since the discovery that DNA is actually the genetic material that gets passed on through generations. Second, DNA and protein sequencing technologies have only recently gotten efficient enough that DNA/protein sequence data is available from many different kinds of animals. With all of this new information, scientists are working hard to build the "true" animal family tree. And there have been cases where the tree built using DNA sequence data differs from those built using morphological data! (Can you explain for your project why DNA sequence is the "gold standard" for determining relatedness between animals?)
Note: Even though sequence comparison is the gold standard, it is not perfect. Sometimes comparisons of different proteins will yield different trees. Which one is right? Why might this happen?
Your Project
Your goal in this project is to locate an "old" evolutionary tree (what we've been calling the animal family tree) that was built based on morphological (or non-DNA sequence based) data. You may be able to find this kind of tree in an older biology textbook or possibly online. You can then use this tree to form a hypothesis about what you expect to find if you compare a particular protein's sequence between some species on the tree. Do you think you will confirm the relationships/relatedness implied by the old tree or do you think the sequence-based comparison will yield different relationships? Why did you form this hypothesis? (A really good project would follow this protocol for at least a few different proteins and see if you get the same results for each.)
First, you can find a protein in humans that interests you. (It is a little simpler to do protein comparisons for tree building than DNA sequences, but you are welcome to try either one.) Then you can go to a public database called "Genbank" where scientists from all over the world deposit DNA and protein sequences that they discover. From Genbank, you can find the human sequence of the protein you have chosen. Next, you can ask a computer program called BLAST to search all of the protein sequence databases available to find the most similar protein sequence in other organisms (the same gene or protein found in a different species is called a homolog). Finally, you can compare the amino acid sequences of the protein from the different species you found to draw your own version of the animal family tree (containing only the animals used in your study). This comparison is done by making an "alignment" of all of the proteins you are comparing and then "counting" the differences. Do the relationships shown in your tree agree with where your animals appear on the old tree? Can you think of an explanation for whatever outcome you get?
Keys:
Terms, Concepts and Questions to Start Background Research
Bibliography
Background knowledge/info
Choosing a gene/protein to study
Finding the DNA/mRNA/protein sequence for that gene
Finding homologs in other species
Multiple sequence (DNA/RNA/protein) alignment tools
General DNA/Protein analysis tools
Tutorial On Which This Idea Is Based
Experimental Procedure
| Tips for Searching Genbank: | ||||||||||||||||||||||||||||||||||||||||||||||
| * | For making a better query, try asking for both the name of the protein and the organism from which you would like that sequence. (For example, try entering "myoglobin homo sapiens" to look for the human version of the myoglobin protein sequence.) | |||||||||||||||||||||||||||||||||||||||||||||
| * | When sorting through the results, many entry titles are followed by two Latin words in parentheses such as (Homo sapiens). These words represent the scientific name of the animal from which the sequence came. Some animals are used very often in genetic research and are called "model organisms." You are very likely to run across sequences from these organisms. See below for a table of some of the commonly used model organisms and their scientific names. | |||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||
| The genomes of many other animals are being sequenced (mainly) for sequence comparison studies such as the one you are performing. Here are a few of these types of organisms you might run across. If the name you find is not on this list, just do a web search and see if you can figure out to which animal the name refers! | ||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||
| * | These queries are very literal and just look for your search words anywhere in the title or description of the sequence. So if you enter "myoglobin," you may get a hit to a protein described as a "myoglobin binding protein" or "similar to myoglobin," neither of which is what you are looking for. You can get around this by using the "Limits" option: Click on the word "Limits" below the text box where you enter your gene or protein name. In the new window, go to the pull down menu that says "All Fields" and change it to "Title." This way, your search term has to appear in the title of the database entry. Now, if you press "Go" to redo your search, you should see a checked box next to the term "Limits." You still may get multiple hits, so just keep scrolling through the results until you find the myoglobin protein itself. | |||||||||||||||||||||||||||||||||||||||||||||
| * |
Some sequence entries represent fragments or pieces of a protein that have been important for one experiment or another. For this project, you need the whole sequence, so try to avoid using sequences that have the words partial, chain or subunit in them. If all of the entries seem to be incomplete, you may need to choose a different protein. Note: If you are doing a project that involves searching for DNA/mRNA sequences, some entries will be marked as either "partial cds" or "complete cds," where cds stands for "coding sequence." You should choose "complete cds" if such an entry can be found for your gene -- remember this point applies only to DNA/mRNA searches. | |||||||||||||||||||||||||||||||||||||||||||||
| * | Finally, what is the best way to refer to the "right" sequence when you find it? The first string of letters/numbers in the output of any search on NCBI is called the accession number. This may look something like NP_00555 or AAH14547. This is just a unique identifier for that sequence file in that database. So you could tell anyone else in the world you are using the sequence from accession number NP_00555 and they would be able to access the exact same record. If you want to get back to that file, you can just enter that accession number for your query and the proper file should come up every time. | |||||||||||||||||||||||||||||||||||||||||||||
| * | One other thing you might run in to… What is a cDNA? A cDNA is just the sequence of a DNA string that would be COMPLEMENTARY (thus the "c") to an mRNA of interest. (RNA molecules are inherently unstable, so scientists often make a DNA copy of an RNA to make experiments easier.) So a cDNA sequence represents an mRNA found in a cell. | |||||||||||||||||||||||||||||||||||||||||||||
| Tips for Formatting Sequences: | |
| * | Are you looking at DNA or protein sequence? If the string of letters in your sequence is made up of a, t, g and c, then you are looking at a DNA or RNA sequence. If the string looks like "random" letters from the alphabet (like MQPLLG), then you are looking at a protein sequence. Each amino acid has been assigned a single letter code to represent it. (You can find this code online or in a textbook.) |
| * | When copying your sequence of interest to save it or enter it into another program, there are a couple of ways you can do it. Whichever way you choose, it is probably easiest to "store" and manipulate the sequence in a word processing program. You can then save it as a "text" file so that it doesn't have any weird word processing formatting. (You may also be able to upload a text file into a DNA analysis program.) |
| * | The first way to save the sequence is just using the string of sequence alone. From the accession number file, you can just copy and paste the letters onto a word processing document. Then just take out any of the numbers that were copied over. You don't need to worry about spaces and returns because the programs you will be using are trained to ignore them and only read the "letters" in the file. Then, you can just copy and paste this saved text into the program you are using. |
| * |
Most programs that analyze DNA/mRNA/protein sequences require them to be in a certain format. Thus, when you are saving the sequences that you plan to use in your later analysis, you should save them in the conventional format. The most common format is called "FASTA" format. FASTA files are structured with the ">" symbol followed by the name of the sequence (like myoglobin), then a return and the text of the sequence. Most of the programs you use will recognize a FASTA file and know that what comes after the > and before the first return is the "name" of the sequence. Here is an example of a FASTA format sequence:
>Myoglobin protein mglsdgewql vlnvwgkvea dipghgqevl irlfkghpet lekfdkfkhl ksedemkase dlkkhgatvl talggilkkk ghheaeikpl aqshatkhki pvkylefise ciiqvlqskh pgdfgadaqg amnkalelfr kdmasnykel gfqg |
| * | The simplest way to get your sequence into FASTA format is to use the FASTA format display option in Genbank. Once you're on the page containing the DNA or protein sequence that you want, you can get your sequence into this format by clicking on the pull down menu next to the box with the word "Display" in it and changing it from "default" to FASTA, then clicking the "Display" button. You can then either cut and paste the formatted sequence into a text file, or use the pull down menu next to the "Send To" button and change it to "File." If you then click on the button, it will download the sequence on your screen to the file name you specify. |
| * | Note that you don't need to worry about spaces and returns in most sequence analysis programs because they ignore them and only read the "letters" in the file. |
Variations
Credits
| Author: | Shelley Force Aldred, Department of Genetics, Stanford University |
| Sponsor: | Molecular Sciences Institute (MSI), Berkeley, California |
| Management: | The Kenneth Lafferty Hess Family Charitable Foundation |
Last edit date: 2005-09-11 13:42:00
Copyright © 2002-2008 Kenneth Lafferty Hess Family Charitable Foundation. All rights reserved.
Reproduction of material from this website without written permission is strictly prohibited.
Use of this site constitutes acceptance of our
Terms and Conditions of Fair Use.
Science Buddies gratefully acknowledges its Presenting Sponsor
Science Fair Project Home
Our Sponsors
About Us
Volunteer
Donate
Contact Us
Online Store
Privacy Policy
Image Credits
Site Map
Science Fair Project Ideas
Science Fair Project Guide
Ask an Expert
Teacher Resources
Science Fair Competitions