The Cancer Genome Anatomy Project
Abstract
Finding a cure for cancer is one of scientists' greatest challenges today. But first, they have to study and understand the disease. In this science project, you will explore the software available on the Cancer Genome Anatomy Project (CGAP) website and use bioinformatics tools to identify genes whose level of expression is higher in cancer tissue.Objective
The goal of this science project is to use web-based bioinformatics tools to identify genes that are over-expressed in pancreatic cancer tissue. You will use the online tool called SAGE Genie, available at the website for the Cancer Genome Anatomy Project (CGAP). This project will give you an idea of how bioinformatics tools can be used to explore large datasets of biological information.
Credits
David Whyte, PhD, Science Buddies
Share your story with Science Buddies!
Last edit date: 2013-03-15
Introduction
The Cancer Genome Anatomy Project (CGAP), a program of the National Cancer Institute (NCI), studies the molecular changes that occur when a normal cell is transformed into a cancer cell. Cancer is a term for diseases in which abnormal cells divide without control and can invade other tissues. Cancer cells can arise in many different parts of the body. The five deadliest types of cancer, and the number of deaths they cause per year (estimated for 2008) in the United States are listed below:
| Type of Cancer | Number of Deaths per Year in the U.S. |
| Lung | 162,000 |
| Colorectal | 50,000 |
| Breast | 40,000 |
| Pancreatic | 34,000 |
| Prostate | 30,000 |
A complete list can be found at the National Cancer Institute's website: Common Cancers.
Cancer researchers are interested in finding biological differences between cancerous tissue and normal tissue because these differences may be "Achilles' heels" that they can exploit to fight the cancer. For example, cancer cells often have patterns of gene expression that differ from their normal cell counterparts. Some genes are expressed at higher levels (over-expression), and others are expressed at lower levels, or not at all.
Genes that are over-expressed in the cancerous tissue are of particular interest because over-expression is a trait you would expect of a gene that is causing the cancer to grow. For example, if a gene codes for a protein that usually functions to cause a cell to divide, then making more of this protein may lead to uncontrolled growth (HER2, for example). On the other hand, other gene products that normally function to control or inhibit growth, may be lost as a normal cell becomes a cancer cell (PTEN, for example). If we compare the cell to a car, over-expression of a growth-stimulating gene is analogous to jamming the accelerator down, whereas loss of an inhibitory gene is analogous to losing the ability to apply the brakes.
Figure 1 shows a theoretical "profile" of gene expression for eight genes in normal and cancerous tissues. It is possible that the over-expressed gene, #8, may be helping the development and growth of the tumor. There are many other criteria that cancer researchers look at to determine the precise role of a gene in cancer, but gene expression is a key factor, and it is one that you can explore without getting your lab coat on!
|
| Figure 1. Shown here is theoretical gene expression in normal and cancerous tissues. Comparing the "profile" of normal versus cancerous tissues highlights genes like #8 that deserve further study because of their suspiciously high expression level. |
In this genomics science fair project, you will use software tools to identify molecular-level differences between normal human tissue and cancerous tissue. Specifically, you will identify genes that are over-expressed in cancerous tissue derived from the pancreas. The pancreas is a gland, about 6 inches long, that lies behind the stomach. Its two main functions are 1) to produce juices that help digest food, and 2) to produce hormones, such as insulin and glucagon, which help control blood-sugar levels. The digestive juices are produced by exocrine pancreas cells and the hormones are produced by endocrine pancreas cells. About 95 percent of pancreatic cancers begin in exocrine cells.
The method used by CGAP to study gene expression is called SAGE, which stands for Serial Analysis of Gene Expression. This method allows researchers to compare the expression of thousands of genes in different tissue types. If you are interested in finding out more about this technology, the CGAP website has great information.
An excellent way to really explore gene expression is to use the bioinformatics tools on the CGAP website, which allow you to perform "virtual experiments" on real data sets of gene expression data. The goal of this project is to become familiar with a set of tools used to explore the cancer genome. Once you have followed the steps outlined below, you can use these tools to independently identify genes that are over-expressed in cancer.
Terms and Concepts
Before you perform the bioinformatics search using the tools at the CGAP website, make sure you have a good understanding of these terms and concepts:
- Bioinformatics
- Cancer
- Gene expression
- Exocrine pancreas cells
- Gene library
- Adenocarcinoma
- Genomics
Questions
- What does it mean to say a gene is "expressed" in a tissue?
- What types of genes would you expect to be over-expressed in cancer tissue?
Bibliography
- The CGAP website has many useful resources to help you learn about cancer and gene expression.
- The Cancer Genome Anatomy Project. (n.d.). CGAP Slide Tour. Retrieved March 27, 2008 from
http://cgap.nci.nih.gov/Info/concept
- The Cancer Genome Anatomy Project. (n.d.). Retrieved March 27, 2008 from http://cgap.nci.nih.gov/
- The Cancer Genome Anatomy Project. (n.d.). CGAP Slide Tour. Retrieved March 27, 2008 from
http://cgap.nci.nih.gov/Info/concept
- This paper describes the CGAP project. You don't need to read this to be able to use the tools on the site, but it is helpful if you want to learn about CGAP in depth.
Riggins, G. (2001). Genome and genetic resources from the Cancer Genome Anatomy Project. Human Molecular Genetics, Vol 10, No 7, 663-667. Retrieved March 27, 2008 from http://hmg.oxfordjournals.org/cgi/reprint/10/7/663.pdf - This National Cancer Institute's webpage has information about the causes of cancer:
National Cancer Institute. (n.d.). What is Cancer? Retrieved March 27, 2008 from http://www.cancer.gov/cancertopics/what-is-cancer - A list of the most common cancers and the annual deaths associated with each can be found at this website from the National Cancer Institute:
National Cancer Institute. (n.d.). Common Cancer Types. Retrieved March 27, 2008 from http://www.cancer.gov/cancertopics/commoncancers
Materials and Equipment
- High-speed Internet access
- Lab notebook
Share your story with Science Buddies!
Experimental Procedure
- To start this science project, go to the CGAP website: http://cgap.nci.nih.gov/.
- Click on "SAGE Genie."
- Click on "The SAGE Digital Gene Expression Displayer."
- Select "Two SAGE library pools" and then click on "Submit."
- On the next page, select the following options:
- Select: "Short tags"
- List libraries by: "Tissue Type"
- For Pool A, select "Include," Tissue Type "Pancreas," Tissue histology "Normal."
- For Pool B, select "Include", Tissue Type "Pancreas", Tissue histology "Cancer."
- "Exclude Cell Lines" should be checked in Pool A and Pool B.
Note: Pools are one or more SAGE libraries combined. - Click "Submit Query."
So what exactly did you just set up? The program is now set to compare gene expression in normal pancreatic tissue (Pool A) to gene expression in cancerous pancreatic tissue (Pool B). You will be looking for genes that are over-expressed in cancerous tissue, since they might be involved in advancing the cancer disease process. You want to compare normal vs. cancer in the same tissue type—in this case, the pancreas—so that you are not identifying genes whose differential expression is due to the fact that the tissues are from different organs.
(Under-expressed genes can also provide very important clues about what is happening in cancer cells, but for the purposes of this project, you'll focus on over-expression.)
- The next page gives you the opportunity to alter the statistical parameters for your gene search—use the default selections
(F = 2, Q = 0.1). (Increasing F, or decreasing Q, will increase the stringency of the search; the search will yield fewer genes, but their differential expression will be more statistically significant.) The "Chromosome" field should remain as "All."
- SAGE_Pancreas_adenocarcinoma_B_96-6252 should be check-marked as Pool B.
- SAGE_Pancreas_normal_B_1 should be check-marked as Pool A.
- SAGE_Pancreas_adenocarcinoma_B_91-16113 should be check-marked as Pool B.
In the next step, the software will identify genes that are differentially expressed in these two pools.
- You will be required to input an e-mail address where the server can send you a notice when the comparison search is done. Typically the search takes only a few minutes. Once you've filled in your e-mail address, click "Submit Query" at the bottom of the page.
- The next page lists all of the genes that are differentially expressed. Here is what the column headings mean:
- Tag refers to the short nucleic acid sequence that identifies the gene. There is one tag for each copy of the gene. The more tags there are in a library, the more copies of that gene product (mRNA) there are.
- Gene or Accession refers to the gene's name, or if it is an unknown gene without a name, to its accession number.
- Libraries refers to the number of libraries the tag was found in. A library contains all of the gene's products (mRNAs) expressed in a given tissue. If there is a "2" under column B, that gene was found in both of the tumor libraries.
- The Tags column represents the basic data: how many copies of the gene were found in each pool. (Remember that you chose "A" as the normal tissue sample and "B" as the cancerous tissue sample).
- Tag Odds A:B is a number that reflects how significant the data is:
The formula for the sequence odds ratio is:
- NaN stands for "not a number" and occurs when the denominator of the equation is 0 (i.e., there are no sequences of a gene in Pool B).
- Q measures the "false discovery rate," which is the probability that sampling error, rather than real biology, accounts for the fact that a tag has been observed to occur more frequently in one library pool than in a second library pool. Low values of Q mean the difference is more likely to be real, and not caused by random chance. In other words, a low value of Q (< in="" occurs="" 01)="" Look for genes that have low values of Q when you pick genes for further analysis.
- Scroll to the bottom of the list to find the genes that are over-expressed in pancreatic cancer tissue (B>A, where B is the pool of genes from cancerous tissue and A is the pool of genes from normal tissue).
Take a look at the table of genes that you have generated.
Here is one entry, for the CEL gene, where B is not greater than A:
Table 1. CGAP data for the CEL gene. (CGAP website, n.d.)CEL is expressed in the normal tissue library, but not in either of the cancer tissue libraries. The tag for this gene is found 399 times in the normal tissue library, and not at all in the cancer Pool B. So this gene is shut off when the tissue becomes cancerous.
But we want genes that are over-expressed in the cancer Pool B. Here is an example of a gene with high cancer-associated gene expression:
Table 2. CGAP data for the CLASP2 gene. (CGAP website, n.d.)The CLASP2 gene is found in both cancer libraries, but not in the normal library. It was found 3619 times in the pool of genes from the cancerous tissue. This looks like a gene that is really over-expressed in pancreatic cancer.
- Click on the link under Tag that has the nucleic acid sequence TCCCTTCTAT. For CLASP2, you will get a small table like this:
Figure 2. CLASP2 SAGE data. (CGAP website, n.d.) - Click on the picture of a body under SAGE Anatomic Viewer. You will see a table with data comparing expression of CLASP2 in a variety of normal and cancerous tissues. Below is a small section of this table showing pancreatic tissue, with red indicating over-expression. Note that CLASP2 over-expression is associated with pancreatic cancer, but not with other types of cancer.
Figure 3. CLASP2 is over-expressed in pancreatic cancer. Blue indicates average expression level, red indicates high expression level. (CGAP website, n.d.) - Go back to Table 2 with the data below and click on the CLASP2 link:
Table 2. CGAP data for the CLASP2 gene. (CGAP website, n.d.)This link brings up the Gene Info page. This page has a wealth of information about the CLASP2 gene. Imagine how long it would take you to find this information by yourself! There has been tremendous progress in the last few years annotating all of the human genes and proteins.
- As a final example of what is on the CGAP site, click on the "Monochromatic SAGE/cDNA Virtual Northern" link under "Gene Expression Data" on the CLASP2 Gene Info page. (A northern is a type of mRNA blot used to measure gene expression levels. High expression levels will yield a dark band like the one on the Virtual Northern in Figure 4, below.)
This final page gives a visual look at the expression level of CLASP2 in normal and cancerous tissues. As you scroll down, note that CLASP2 is over-expressed in pancreatic cancerous tissue, compared to normal pancreatic tissue.
Figure 4. Graphical image of CLASP2 gene expression. Note that CLASP2 expression is very high in pancreatic cancerous tissue, but not in normal pancreatic tissue. (CGAP website, n.d.) - Click on some of the other genes that are over-expressed in pancreatic cancer and follow the links to learn more about them.
- There are many resources on the CGAP website that were not dealt with in this project. Explore the site to learn more about gene expression and cancer.
- Can you hypothesize what functions the over-expressed genes might have in cancer cells? This is how some cancer research projects in university and biotechnology labs are launched!
Share your story with Science Buddies!
Variations
- Now that you have had a tour of the CGAP website, repeat the procedure above for another type of cancer and identify one or more genes that are over-expressed in cancerous tissue.
- Use the annotation tools on the CGAP website to learn more about the cancer-associated genes you identified.
- Can you identify genes that are over- or under-expressed in more than one type of cancer? Can you make a case for why a particular gene is over-expressed? Put your ideas in the form of a hypothesis.
Share your story with Science Buddies!
Ask an Expert
The Ask an Expert Forum is intended to be a place where students can go to find answers to science questions that they have been unable to find using other resources. If you have specific questions about your science fair project or science fair, our team of volunteer scientists can help. Our Experts won't do the work for you, but they will make suggestions, offer guidance, and help you troubleshoot.Ask an Expert
Related Links
If you like this project, you might enjoy exploring these related careers:

Nanosystems Engineer
Imagine creating a new material, medicine, or electrical component that is too small to see. How would you design it? What could the new invention do? These are precisely the types of questions that nanosystems engineers answer every day. Nanosystems engineers design and build new technologies using the smallest building blocks, atoms, and molecules. Read more
Genetic Counselor
Many decisions regarding a person's health depend on knowing the patient's genetic risk of having a disease. Genetic counselors help assess those risks, explain them to patients, and counsel individuals and families about their options. Read more


