Bioinformatics - The Perfect Marriage of Computer Science & Medicine
|Time Required||Average (6-10 days)|
|Prerequisites||A good knowledge of basic concepts in genetics and good computer database searching skills|
|Material Availability||Readily available|
|Cost||Very Low (under $20)|
AbstractFind out the real explanation for why your parents are so weird! Here is a science project that lets you explore the internet to find out why your "DNA blueprint" is so important to health and disease. In this science project you will use methods that bioinformatics and biotech scientists perform on a daily basis to decipher the human genome in their efforts to diagnose and treat genetic diseases.
Use publicly available web-based bioinformatics resources to search for a disease of interest and SNPs, single-nucleotide polymorphisms, associated with that disease.
Science Buddies would like to thank the following volunteers from Schering-Plough who contributed towards writing this project:
- Beth Basham Ph.D.
- Melissa Bilardello, B.S.
- Sarah Bodary, Ph.D.
- Jamie Furneisen. M.S.
- Jennifer Louten, Ph.D.
- Sheela Mohan-Peterson, J.D.
- Venkataraman Sriram, Ph.D.
Edited by Sara Agee, Ph.D., and Teisha Rowland, Ph.D., Science Buddies
Cite This PageGeneral citation information is provided here. Be sure to check the formatting, including capitalization, for the method you are using and update your citation, as needed.
Last edit date: 2017-07-28
Biomedical Informatics is a broad discipline that encompasses bioinformatics and computational biology. Online bioinformatics resources, such as the database Online Mendelian Inheritance in Man, or OMIM, allow bioscience researchers to search up-to-date information on human genes, genetic traits and disorders. This project will take you through the step-by-step process of researching a specific disease of interest and how a single base change in one's DNA could be associated with that disease. This project should take approximately one week to complete.
Scientists are on a constant quest to improve and lengthen the quality of human life. DNA, the blueprint of life, has hidden clues for this quest. Identifying these clues is analogous to the cliché often heard "finding a needle in a haystack." The "haystack" for this project is the public bioinformatics databases, such as OMIM, containing a multitude of genetic information and the "needle" is the SNP, (pronounced snip), single-nucleotide polymorphism.
A Single Nucleotide Polymorphism, or SNP, is a small genetic change, or variation, that can occur within a person's DNA sequence. SNPs represent the most frequent type of DNA variation found in the human population. These variations can be used to study and track inheritance in families. Despite the fact that more than 99% of human DNA sequences are the same across the population, small variations in DNA sequence, such as SNPs, can have a major impact on how humans respond to disease, environmental factors, and medicines. Interestingly, SNPs are evolutionarily stable. This means they don't change much from generation to generation. That being said SNPs are of great interest and value for biomedical research. Developmental pharmaceutical products or medical diagnostics are being influenced by SNP data.
The cartoon depiction of a SNP in Figure 1, below, shows how DNA strand 1 differs from DNA strand 2 at a single base-pair location (a C/T polymorphism).
Figure 1. Here you can see a single nucleotide polymorphism, or SNP, that results in a small genetic change between sequence 1 and 2 (Wikipedia contributors, n.d.).
DNA, or deoxyribonucleic acid, supplies a set of instructions for each living organism. Every cell in each organism contains an entire copy of DNA. Genes are sets of nucleotide sequences encoded and stored in DNA. Each gene encodes for a certain protein. DNA is transcribed into mRNA, messenger ribonucleic acid, and then translated into protein. Proteins are defined by amino acid sequences. A single amino acid is encoded by three nucleotides called a codon. There are 64 possible codons and only 20 amino acids, as shown in Figure 2, below. Since there are only 20 amino acids, multiple codons encode for the same amino acid. This is known as degeneracy of the genetic code. Because of this degeneracy in the genetic code some SNPs do not result in changes in the protein sequence. This is called a synonymous change. If a SNP results in a change in the protein sequence this is termed a non-synonymous change. Finding single nucleotide changes in the human genome may be like "finding a needle in a haystack," however, bioinformatics resources make it possible to do just that.
Figure 2. This codon table shows how the genetic code is converted into a sequence of amino acids that make up a protein (image courtesy of Schering-Plough).
Variations in the DNA sequences of humans can affect how humans develop diseases and respond to medicines. While SNPs do not cause disease, they can help determine the likelihood that someone will develop a particular disease. Computational Biology, the actual process of analyzing and interpreting data, combined with Bioinformatics is used to for the technology called database-mining. With the completion of The Human Genome Project in 2003, vast amounts of genomic data have been made available for database-mining, the process of generating hypotheses regarding function or structure of a gene or protein of interest by identifying similar or dissimilar sequences in DNA. The International HapMap Project is designed to provide information to researchers with the HapMap, a catalog of common genetic variants that occur in human beings as well as a description of the variants and where they are located in our DNA. This catalog provided information that researchers need to link genetic variants to the risk for specific illnesses.
How do scientists utilize computers for mining of biological data to study genetics and disease association? In this science project, you will utilize the World Wide Web to access free bioinformatics resources to search for a disease of interest, identify SNP(s) associated with that disease, and make a hypothesis regarding the effect of the SNP(s). These public databases provide a vault of information that can be searched in many ways. We have provided one example; however, you may use your own method. With the availability of millions of SNPs, scientists now believe that exciting advances in medicine are in our near future. It is now your turn to mine databases for SNPs and make a hypothesis on the outcome on the human phenotype based on your research.
Terms and Concepts
To do this type of experiment you should know what the following terms mean. Have an adult help you search the internet, or take you to your local library to find out more!
- Allele - Alleles are different forms of the same gene. Allelic variation in a gene arises through mutation of the DNA sequence defining the gene and may or may not be associated with trait variation (e.g., height, eye color).
- Amino acid - The molecules that make up a protein. Each codon encodes for a specific amino acid.
- Bioinformatics - The collection, classification, storage, and analysis of large volumes of biological information (e.g., genomic, metabolomic, proteomic) using computers.
- Codon - Three bases in a DNA or RNA sequence which specify a single amino acid.
- DNA (Deoxyribonucleic Acid) - DNA is the chemical that forms a basic molecular code for how a living being should operate. DNA is the biological heredity material passed down from parent to child. Four bases called adenine (A), guanine (G), cytosine (C), and thymine (T) constitute DNA. It is present in the nucleus of almost all cells in an organism.
- Exons - Exons are sequences of DNA that code information for protein synthesis that are transcribed to messenger RNA, which in turn are translated into at least a portion of a protein.
- Gene - DNA sequences that contain a code that can be translated into a particular protein. For example, CFTR gene has the information that is necessary for a cell to make the CFTR protein.
- Genome - The DNA sequence of the entire organism's chromosomes. e.g., Human Genome
- Genomics - The study of the entire human genome. Genomics explores not only the actions of single genes, but also the interactions of multiple genes with each other and with the environment.
- Genotype - People inherit one allele for a gene from each parent such that they have two copies of each gene. The pair of alleles defines a person's genotype. For a gene that has two alleles in the population (e.g., an A allele and a G allele), there are three possible genotypes—AA, AG, and GG.
- HapMap - A partnership of scientists and funding agencies from Canada, China, Japan, Nigeria, the United Kingdom, and the United States to develop a public resource that will help researchers find genes associated with human disease and response to pharmaceuticals. HapMap's goal is to ultimately develop a haplotype map of the human genome and identify haplotype blocks.
- Homozygous - A genotype in which the two copies of the gene that determine a particular trait are the same.
- Heterozygous - Possessing two different forms of a particular gene, one inherited from each parent.
- Introns - Introns are segments of a genes situated between exons that are removed before translation of messenger RNA and do not function in coding for proteins or protein fragments.
- Junk DNA/Non-Coding Region - A region of the genome where the DNA has no known function (i.e., it does not code for a protein, regulatory sequence, or other functional elements). These regions usually consist of repeating DNA sequences. The majority of the human genome has no known function; only 2 percent to 5 percent of the DNA sequence codes for genes.
- Locus - The position of a gene on a chromosome. This term is a classical genetic concept used to understand gene order, gene distance, and gene function before gene and genomic DNA sequences were known.
- mRNA - Messenger RNA, or a single-stranded molecule of ribonucleic acid that is transcribed from the DNA and then translated into protein.
- Mutation - A mutation is a change in a DNA sequence. If the mutation occurs during the development of an egg or a sperm (i.e., gametes), then it becomes a heritable mutation. If the mutation occurs in any other body cell (i.e., part of the soma), then it is called a somatic mutation and it is not heritable. Somatic mutations are a cause of cancer. Mutations can be of many different types-substitutions, deletions, or insertions. Mutations in the DNA can be synonomous, e.g., not having any effect on the translated protein, or non-synonomous, causing amino acid changes in the translated protein.
- Non-synonymous change - If a SNP results in a change in the protein sequence.
- RNA (Ribonucleic Acid) - A chemical found in the nucleus and cytoplasm of cells; it transcribes the protein-coding instructions of DNA into a code that the protein-building ribosomes of a cell can understand. The chemical structure of RNA is similar to DNA-RNA also contains adenine (A), guanine (G), and cytosine (C), but instead of thymine (T), RNA contains uracil (U).
- SNPs (Single Nucleotide Polymorphisms) - Currently, there is estimated to be about 6 million positions in the human genome where a mutation occurred at a single nucleotide (A, T, C, or G) and both its alleles are now greater than 1 percent prevalent in the population. These SNPs are important for studies of genetic or genomic associations with disease because the alleles are common in the population.
- Synonymous change - If a SNP does not result in a change in the protein sequence. It is also known as a silent change.
This science project is based on research that provides often inconclusive but strongly correlative evidence that associates SNPs to risk of disease. The notion is that, with the availability of information about the complete human genome, we would be able to predict the risk of an individual contracting a disease or identify individuals with specific qualitative traits ('smart' genes, 'criminal' genes, 'intuition' genes etc.). One outcome of such advance would be personalized medicine where it is possible to treat each individual with a custom-made drug or even perform preventive therapy. However, on the flip side, ethical concerns need to be addressed with respect to individual human rights (The Minority Report movie debate).
Here are some questions that you will be thinking about while doing this science project:
- What is FASTA format?
- If two genes are homologous, are they similar?
- What are different types of mutations and how do they affect protein function?
- What is the probability of a single base mutation affecting protein function?
Here are some useful resources on SNPs and genetics that may help you complete this project:
- University of Utah, Health Sciences. (n.d.). Making SNPs Make Sense. Learn.Genetics. Genetic Science Learning Center. Retrieved October 8, 2014, from http://learn.genetics.utah.edu/content/pharma/snips/
- Oak Ridge National Laboratory (ORNL). (February 2011). The Gene Gateway Workbook. U.S. Department of Energy Office of Biological and Environmental Research. Retrieved October 8, 2014, from http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/ggworkbook1.pdf
- Moult, Y., Melamud, Z.. (2005). SNPs3D. Retrieved October 8, 2014, from http://www.snps3d.org/
Here are some useful textbooks with background information for you to review:
- Wood, E.J., Smith, C.A., Pickering, W.R. (eds), 1997. Life Chemistry and Molecular Biology, Portland, OR: Portland Press.
- Drlica, K., 1996. Understanding DNA and Gene Cloning: A Guide for the Curious, 3rd Ed., New York, NY: John Wiley & Sons.
News Feed on This Topic
Materials and Equipment
- Computer with Internet access
- Lab notebook
Bioinformatics - The Perfect Marriage of Computer Science & Medicine
Searching for your disease
The OMIM database in NCBI is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI, the National Center for Biotechnology Information. The database contains textual information and references. It also contains copious links to MEDLINE and sequence records in the Entrez system, and links to additional related resources at NCBI and elsewhere.
- Search for the disease of your choice in the OMIM database (http://www.ncbi.nlm.nih.gov/omim), shown in Figure 3, below.
- For the purpose of simplifying directions, cystic fibrosis will be used as an example.
Figure 3. The OMIM database has information on human genes and genetic disorders.
- Listed will be clinical results associated with your disease, as shown in Figure 4, below. These results will include genes as well as descriptions of related medical conditions. Click on the different results to see what they are. Find a result that is a gene and continue on to step 3.
- Tip: Results that are genes will list a "HGNC Approved Gene Symbol" near the top of the webpage, as shown in Figure 5, below.
Figure 4. When you search for a disease in the OMIM database, you will get many clinical results associated with the disease.
- The gene webpage will have lots of information on the disease-related gene, as shown in Figure 5, below. In your lab notebook, write down the "HGNC Approved Gene Symbol" for your gene.
- For example, for the cystic fibrosis gene shown in Figure 5, below, this would be CFTR.
Figure 5. Clicking on a gene results will bring you to a webpage with lots of information on that specific gene.
- On the right side of the gene webpage, under "External Links for Entry," click on the "Gene Info" heading, highlighted in orange in Figure 6, below. Then click on "NCBI Gene" from the dropdown menu. This will take you to the NCBI Gene database, which has additional information on the gene, as shown in Figure 7, below.
Figure 6. Clicking on "Gene Info," highlighted in orange here, will show you other databases with additional information on the gene. Clicking on "NCBI Gene" in this dropdown menu will take you to the NCBI Gene database.
Figure 7. The NCBI database, shown here, will have additional information on the gene you found using the OMIM database.
- On the right side of the gene webpage, under "Related information," scroll down until you find "SNP: GeneView" (as shown in Figure 8, below) and click on it. This will bring you to a webpage that lists all of the SNPs associated with the gene you chose above, as shown in Figure 9, below.
Figure 8. Clicking on "SNP: GeneView," highlighted in orange here, will take you to a webpage on the SNPs associated with the gene you chose.
Figure 9. The SNP database (dbSNP) will list all of the SNPs associated with a gene of interest.
- To navigate the large table of SNPs on this webpage (shown on the lower half of Figure 9, above), here is a guide for the different columns of information presented:
- Region: Location on gene where SNP is found.
- Chr. Position: Location on chromosome where SNP is found.
- mRNA Position: Coordinates for where the SNP is found in the mRNA.
- dbSNP rs# cluster id: Unique identifier in the database.
- Heterozygosity: Measure of genetic variation in a population.
- Validation: Other sources of information supporting this SNP.
- 3D: SNP mapped on a 3D structure
- Function: The type of mutation the SNP causes.
- dbSNP allele: What amino acid is effected by the SNP.
- Protein residue: Amino acid change
- Codon pos: The position in the codon that is changed.
- Amino acid position: The position of the amino acid that is changed within the protein sequence.
- PubMed: Links to related scientific articles on PubMed.
- Find a SNP that has its "Function" listed as "missense." Click on the "dbSNP rs# cluster id" given for that SNP, as shown circled in yellow in Figure 9, above.
- This will take you to a webpage with information on the SNP, such as the sequence it occurs in, the location of the mutation, and other resources, as shown in Figure 8, below. Scroll down to view and investigate all of the available information.
Figure 10. Clicking on a SNP "dbSNP rs# cluster id" will take you to a webpage with a lot of information on the SNP. Explore the webpage to find out more about the SNP.
- In your lab notebook, write down the SNP rsID. This should be at the top of the webpage, after "Reference SNP (refSNP) Cluster Report," as shown circled in green in Figure 10, above.
- For example, in Figure 10, above, the rsID is rs1800072.
- In your lab notebook, also write down the name that appears under "HGVS Names" that starts with "NP_000483," as shown circled in red in Figure 10, above.
- The right part of this name should include the amino acid change and the location of the change, for example, "Val11Ile."
- To see if your SNP has an impact on protein structure and function, go to the SNPs3D website and search for the gene associated with your disease of interest, as shown in Figure 11, below: www.snps3d.org.
Figure 11. The SNPs3D website has information on the impact of a SNP on protein structure and function, as well as other SNP-related information.
- You should see a gene webpage similar to Figure 12, below.
Figure 12. This is the SNPs3D webpage for a gene (CFTR, associated with cystic fibrosis).
- On the gene webpage, scroll down until you see a list of SNPs, as shown in Figure 13, below. Find your SNP of interest.
- You can find your SNP of interest by looking for its SNP rsID (under "snp_id") that you wrote down in your lab notebook.
- Alternatively, you can find your SNP by looking for the part of the SNP HGVS name you wrote down that has the amino acid change, but convert the three-letter amino acid name to its single-letter name. For example, if you had written "Val11Ile" then look for "V11I."
Figure 13. On the webpage for a gene on SNPs3D, scroll down to see a list of related SNPs. Each SNP will have numerical values that show how likely the SNP is to damage the protein. Negative red numbers indicate a likely damaging SNP, whereas blue numbers indicate a SNP that is unlikely to be damaging.
- After finding your SNP of interest, see whether it affects the structure and function of the gene, as shown in Figure 13, above. Specifically, if a SNP has negative red numbers, that SNP is predicted to have a damaging effect on the protein, but if your SNP has blue numbers, it is not predicted to be damaging, and the SNP would be harmless. In other words, the higher the number, the more likely the mutation is to not be damaging.
- Click on red or blue numbers (under "svm profile" or the "svm structure") to learn more about what these values mean.
- For some SNPs, there are models available that show where the SNP is in the 3D structure of the protein. If under the "model frequency role" column your SNP of interest has a triangular-shaped symbol, as shown in Figure 13, above, circled in green, you can click on it to see the 3D structure. This will take you to a webpage like Figure 14, below.
Figure 14. When viewing the 3D model of a SNP in a protein, red residues indicate a likely damaging SNP, whereas yellow numbers indicate a SNP that is unlikely to be damaging.
- Now try to establish a sequence-structure-function relationship for your SNP. First, search for the GENE in OMIM by repeating steps 1–3 above.
- On the OMIM gene page for your gene of interest, under "External Links," click on "Protein" (immediately above "Gene Info," shown in Figures 5 and 6, above). Then click on "UniProt" from the dropdown menu. This will take you to the UniProtKB database, which has additional information on the protein your gene of interest encodes for, as shown in Figure 15, below.
Figure 15. The UniProtKB database has additional information on a protein of interest.
- Click on the "Sequences" link on the left side of the UniProtKB webpage, as shown in Figure 15, above, highlighted in green. This will take you to the amino acid sequence for your protein of interest, as shown in Figure 16, below.
Figure 16. Clicking on "Sequences" at the top of the UniProtKB webpage for a protein of interest will take you to the amino acid sequence of that protein, as shown here.
- Click on "FASTA" at the top of the "Sequences" section. Select the FASTA protein sequence using your mouse, as shown in Figure 17, below, and copy it.
Figure 17. The FASTA form of the protein sequence is a convenient form to use when comparing the sequence to other sequences.
- Go to the SMART site, shown in Figure 16 below: http://smart.embl-heidelberg.de/smart/set_mode.cgi?NORMAL=1 Once there, paste the protein sequence into the "Protein sequence" box and click on "Sequence SMART."
Figure 18. The SMART website has information on the protein domains of different proteins.
- You should see a webpage that shows the domains of your protein of interest, as shown in Figure 17 below. Rolling your mouse over a domain will show you more information about the protein domain. Find the domain where your mutation is located. Hint: The name you wrote down in step 10, above, includes the sequence location of your SNP of interest. For example, if it said "Val11Ile" then the SNP is in amino acid 11.
- For help searching the literature, read the guide to Resources for Finding and Accessing Scientific Papers. You may also find you need some help from someone experienced with genetics or bioinformatics to read and understand the papers.
Figure 19. This SMART webpage shows the protein domains of the protein sequence that was searched.
- Read the description of the domain and the InterPro abstract, as shown in Figure 19, above, circled in red. Assuming that the SNP results in a mutation in this domain, what could be the biochemical effects of this mutation? How might these effects relate to the disease?
- You can search PubMed for articles on effects of the mutation, as shown in Figure 20, below: http://www.ncbi.nlm.nih.gov/pubmed/ Tip: Try searching for the name of the gene and the SNP, such as "CFTR, V11I."
Figure 20. PubMed is a searchable database of scientific publications.
If you like this project, you might enjoy exploring these related careers:
Nanosystems EngineerImagine creating a new material, medicine, or electrical component that is too small to see. How would you design it? What could the new invention do? These are precisely the types of questions that nanosystems engineers answer every day. Nanosystems engineers design and build new technologies using the smallest building blocks, atoms, and molecules. Read more
Genetic CounselorMany decisions regarding a person's health depend on knowing the patient's genetic risk of having a disease. Genetic counselors help assess those risks, explain them to patients, and counsel individuals and families about their options. Read more
- Variation 1: Environmental Factor - Gene Interaction:
Identify how certain environmental factors may affect genes and their association to diseases by using Genetic Association Database (http://geneticassociationdb.nih.gov/). NOTE: This database is open-access and allows any user to input data. Use caution while using the data and only select data that has been endorsed by 'Gene Expert' or 'Disease Expert'.
- Click on 'Environmental Factor Gene Interaction' link on the left menu of the website. On the top of the page, click on the link to see a complete list of environmental factors.
- Choose an environmental factor of interest (for e.g., tobacco smoke) by clicking on it.
- You can see entries that describe gene association with specific diseases.
- Are you able to identify any SNPs in this category? Follow links to research more for each category.
- Variation 2: Multi-Species Association / Conserved SNPs:
Using the databases referenced in this project, try to identify gene mutations that are common to multiple species. If a mutation is more frequent across multiple-species and if the mutation can be matched with its phenotype across species, it provides validity to your hypothesis. Highly conserved regions (across species) have an increased likelihood of being functionally important.
- For similar Science Buddies science project ideas that use SNPs and genetics, check out Drugs & Genetics: Why Do Some People Respond to Drugs Differently than Others?, A Prescription for Success: Drugs & Your Genetics, and Trace Your Ancient Ancestry Through DNA.
Recent Feedback Submissions
|Sort by Date||Sort by User Name|
What was the most important thing you learned?
I learn about SNP and about the naming convention of SNP we use. I found snpedia.com which is very useful to know about SNPs.
What problems did you encounter?
Can you suggest any improvements or ideas?
Science Buddies materials are free for everyone to use, thanks to the support of our sponsors. What would you tell our sponsors about how Science Buddies helped you with your project?
I am a Software Engineer, having interest in Bioinformatics. I took the test on your website which told me about my interest in Bioinformatics, In fact I am already learning this subject. This website give me real confidence and finding a small help while self learning is like getting help form a elder brother. Which is really helpful, thanks for providing such content. Science Buddies team rocks!!
Overall, how would you rate the quality of this project?
What is your enthusiasm for science after doing your project?
Compared to a typical science class, please tell us how much you learned doing this project.
|Do you agree?||Report Inappropriate Comment|
Ask an ExpertThe Ask an Expert Forum is intended to be a place where students can go to find answers to science questions that they have been unable to find using other resources. If you have specific questions about your science fair project or science fair, our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.
Ask an Expert
News Feed on This Topic
Looking for more science fun?
Try one of our science activities for quick, anytime science explorations. The perfect thing to liven up a rainy day, school vacation, or moment of boredom.Find an Activity