Abstract
The DNA in our cells contains our "blueprints," but it's the proteins in our cells that do most of the work. The Human Genome Project has allowed us to start reading the blueprints, but we still don't understand what most of the proteins do. This is a fairly advanced project that explores ways of identifying the function of unknown proteins.Objective
The goal of this project is to learn how to uncover the functions of an unknown protein using computational methods. In doing so, you will learn about the connection between how a protein performs its job(s) and how its corresponding DNA sequence changes throughout evolution. You will also get a chance to explore computational methods for figuring out the specific functions of a protein, based only on its amino acid (protein) sequence. You will choose several proteins to study, and for each of these proteins, you will generate a hypothesis about which parts of the protein are likely to be most important to its function. This will be done by comparing the human version of the protein to other (non-human) versions of the protein. You will then use databases that identify signatures of common functional components of proteins and assess whether or not these analyses are consistent with your hypothesis.
Introduction
Proteins are the "building blocks of life" -- not only are many of the physical structures of cells made of proteins, but many of the tasks that are essential to life are carried out by proteins. Human beings have genes encoding about 30,000 proteins, yet only a small fraction of these proteins have been studied intensely enough to be well understood. The functions of the vast majority of proteins -- found in humans and elsewhere -- remain entirely unknown. The first step in figuring out the function of a protein is determining its amino acid sequence. The human genome project (and other genome projects) has achieved this goal for a large number of proteins. Unfortunately, the amino acid sequence alone tells you very little about the function of a protein, and the next steps in figuring out a protein's function are much less clearly defined, and very challenging. But, because scientists stand to gain so much from understanding the functions of a larger number of proteins, a major focus of this generation's scientists has been, and continues to be, figuring out how to learn as much as possible about a protein from its amino acid sequence and its corresponding DNA sequence.
One of the first steps that scientists typically take in characterizing a newly discovered protein, is to identify the parts of the protein that are essential to its function. This way, even if they aren't able to figure out what each of those parts do, they can at least focus their future studies on the most important parts of the protein. One way to identify important parts of a protein is to track which parts have remained relatively unchanged throughout evolution. If a random DNA mutation changes the amino acid that it codes for, the likelihood that that particular DNA change will be passed on to future generations depends on whether or not the changed amino acid destroys the function of the protein. If the change occurred in an important part of the protein, and destroyed that protein's function, then that mutated protein is not likely to be passed on throughout evolution. Similarly, if the change occurred in an unimportant part of the protein, and did not affect its function, then that changed sequence is much more likely to be passed on to future generations. Thus, when you compare DNA or protein sequences between two distantly related species, you would expect fewer differences in the "important" parts of the protein and an accumulation of more changes in less important parts. The term "conserved" is used to describe the DNA/RNA/amino acid sequences that remain relatively unchanged throughout evolution, and it is generally assumed that such conservation is an indication of functional importance.
Scientists have learned a lot about the nature of proteins by comparing the same protein across many different species. One of the key findings has been that proteins are often modular -- that is, their major job is often accomplished by the combined action of several small functional units. These smaller functional units are called domains. Another key finding is that nature often recycles the same protein domain repeatedly in order to accomplish the same task in very different proteins. Because of this, it has been possible to build databases of sequences (signatures) associated with particular functions. These databases can in turn be used to help figure out the function of an unknown protein based only on its DNA, mRNA or amino acid (protein) sequence. By identifying the domains that a protein contains, it is often possible to figure out what the overall job of the protein is. For example, if a protein contains a DNA binding domain and an estrogen-response domain, it is likely that this protein's job is to control the levels of production of another protein in response to estrogen levels.
In this project you will use one set of tools to identify regions of a protein that are likely to be functionally important, and another set of tools to explore what those functions might be.
Keys:
Terms, Concepts and Questions to Start Background Research
Bibliography
Background knowledge/info
Choosing a gene/protein to study
Finding the DNA/mRNA/protein sequence for that gene
Finding homologs in other species
Multiple sequence (DNA/RNA/protein) alignment tools
Basic Protein Analysis Tools (protein domain detection)
Advanced Protein Analysis Tools (3D protein structure visualization software) *May require downloading and installing tools on your computer.
Additional resources
Experimental Procedure
| Tips for Searching Genbank: | ||||||||||||||||||||||||||||||||||||||||||||||
| * | For making a better query, try asking for both the name of the protein and the organism from which you would like that sequence. (For example, try entering "myoglobin homo sapiens" to look for the human version of the myoglobin protein sequence.) | |||||||||||||||||||||||||||||||||||||||||||||
| * | When sorting through the results, many entry titles are followed by two Latin words in parentheses such as (Homo sapiens). These words represent the scientific name of the animal from which the sequence came. Some animals are used very often in genetic research and are called "model organisms." You are very likely to run across sequences from these organisms. See below for a table of some of the commonly used model organisms and their scientific names. | |||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||
| The genomes of many other animals are being sequenced (mainly) for sequence comparison studies such as the one you are performing. Here are a few of these types of organisms you might run across. If the name you find is not on this list, just do a web search and see if you can figure out to which animal the name refers! | ||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||
| * | These queries are very literal and just look for your search words anywhere in the title or description of the sequence. So if you enter "myoglobin," you may get a hit to a protein described as a "myoglobin binding protein" or "similar to myoglobin," neither of which is what you are looking for. You can get around this by using the "Limits" option: Click on the word "Limits" below the text box where you enter your gene or protein name. In the new window, go to the pull down menu that says "All Fields" and change it to "Title." This way, your search term has to appear in the title of the database entry. Now, if you press "Go" to redo your search, you should see a checked box next to the term "Limits." You still may get multiple hits, so just keep scrolling through the results until you find the myoglobin protein itself. | |||||||||||||||||||||||||||||||||||||||||||||
| * |
Some sequence entries represent fragments or pieces of a protein that have been important for one experiment or another. For this project, you need the whole sequence, so try to avoid using sequences that have the words partial, chain or subunit in them. If all of the entries seem to be incomplete, you may need to choose a different protein. Note: If you are doing a project that involves searching for DNA/mRNA sequences, some entries will be marked as either "partial cds" or "complete cds," where cds stands for "coding sequence." You should choose "complete cds" if such an entry can be found for your gene -- remember this point applies only to DNA/mRNA searches. | |||||||||||||||||||||||||||||||||||||||||||||
| * | Finally, what is the best way to refer to the "right" sequence when you find it? The first string of letters/numbers in the output of any search on NCBI is called the accession number. This may look something like NP_00555 or AAH14547. This is just a unique identifier for that sequence file in that database. So you could tell anyone else in the world you are using the sequence from accession number NP_00555 and they would be able to access the exact same record. If you want to get back to that file, you can just enter that accession number for your query and the proper file should come up every time. | |||||||||||||||||||||||||||||||||||||||||||||
| * | One other thing you might run in to� What is a cDNA? A cDNA is just the sequence of a DNA string that would be COMPLEMENTARY (thus the "c") to an mRNA of interest. (RNA molecules are inherently unstable, so scientists often make a DNA copy of an RNA to make experiments easier.) So a cDNA sequence represents an mRNA found in a cell. | |||||||||||||||||||||||||||||||||||||||||||||
| Tips for Formatting Sequences: | |
| * | Are you looking at DNA or protein sequence? If the string of letters in your sequence is made up of a, t, g and c, then you are looking at a DNA or RNA sequence. If the string looks like "random" letters from the alphabet (like MQPLLG), then you are looking at a protein sequence. Each amino acid has been assigned a single letter code to represent it. (You can find this code online or in a textbook.) |
| * | When copying your sequence of interest to save it or enter it into another program, there are a couple of ways you can do it. Whichever way you choose, it is probably easiest to "store" and manipulate the sequence in a word processing program. You can then save it as a "text" file so that it doesn't have any weird word processing formatting. (You may also be able to upload a text file into a DNA analysis program.) |
| * | The first way to save the sequence is just using the string of sequence alone. From the accession number file, you can just copy and paste the letters onto a word processing document. Then just take out any of the numbers that were copied over. You don't need to worry about spaces and returns because the programs you will be using are trained to ignore them and only read the "letters" in the file. Then, you can just copy and paste this saved text into the program you are using. |
| * |
Most programs that analyze DNA/mRNA/protein sequences require them to be in a certain format. Thus, when you are saving the sequences that you plan to use in your later analysis, you should save them in the conventional format. The most common format is called "FASTA" format. FASTA files are structured with the ">" symbol followed by the name of the sequence (like myoglobin), then a return and the text of the sequence. Most of the programs you use will recognize a FASTA file and know that what comes after the > and before the first return is the "name" of the sequence. Here is an example of a FASTA format sequence:
>Myoglobin protein mglsdgewql vlnvwgkvea dipghgqevl irlfkghpet lekfdkfkhl ksedemkase dlkkhgatvl talggilkkk ghheaeikpl aqshatkhki pvkylefise ciiqvlqskh pgdfgadaqg amnkalelfr kdmasnykel gfqg |
| * | The simplest way to get your sequence into FASTA format is to use the FASTA format display option in Genbank. Once you're on the page containing the DNA or protein sequence that you want, you can get your sequence into this format by clicking on the pull down menu next to the box with the word "Display" in it and changing it from "default" to FASTA, then clicking the "Display" button. You can then either cut and paste the formatted sequence into a text file, or use the pull down menu next to the "Send To" button and change it to "File." If you then click on the button, it will download the sequence on your screen to the file name you specify. |
| * | Note that you don't need to worry about spaces and returns in most sequence analysis programs because they ignore them and only read the "letters" in the file. |
Variations
Credits
| Author: | Chana Palmer, Department of Genetics, Stanford University |
| Sponsor: | Molecular Sciences Institute (MSI), Berkeley, California |
| Management: | The Kenneth Lafferty Hess Family Charitable Foundation |
Last edit date: 2006-01-12 18:59:00
If you like this project, you might enjoy exploring careers in Genetics & Genomics.
![]() |
Genetic Counselor Many decisions regarding a person's health depend on knowing the patient's genetic risk of having a disease. Genetic counselors help assess those risks, explain them to patients, and counsel individuals and families about their options. |
|
Join Science Buddies
Become a Science Buddies member! It's free! As a member you will be the first to receive our new and innovative project ideas, news about upcoming science competitions, science fair tips, and information on other science related initiatives. |