Question about bio informatics

kyhekm · Post by **kyhekm** » Wed Feb 17, 2010 11:38 pm

Hi

I'm trying to compare genes which cause regeneration process in zebrafish's heart and genes which cause heart disease or heart failure.

I'm trying to find common and different points and to predict other unknown gene's functions. To do this, I need to use websites like Genebank to compare the genes.

However, I'm not really familiar with using those websites.I looked on several projects in Science Buddy like "Computational Exploration of Protein Function" , "Bioinformatics - The Perfect Marriage of Computer Science & Medicine" etc. But I still don't get it 100%.

When I compare the genes, where should I particularly focus on? I heard that domain does important rule in gene's function...(not sure tho..)

Please tell me how I can proceed this project.

MelissaB · Post by **MelissaB** » Thu Feb 18, 2010 2:46 am

Hi,

This is outside my area of expertise, so hopefully another expert will chime in here. However, what I would be looking for if I were you is not the gene sequence, but the protein sequence. Mutations that do not affect the protein sequence are likely to have smaller (if any) effects on protein function.

The 'domain' you have heard of is also a protein domain: http://en.wikipedia.org/wiki/Protein_domain. Whether or not you can look at this will depend on what is known about the proteins that the genes you are interested in code for.

I would start with some simple BLAST searches with the protein sequences for the genes you are interested in. Do you need help with that, or are you already familiar with BLAST?

kyhekm · Post by **kyhekm** » Thu Feb 18, 2010 8:37 am

Hi

I know BLAST is all about comparing one particular gene's sequence to other gene or protein's sequence and find similar ones.
Does this sound right?

What I am confused is whether I should focus on protein or gene. My first idea was comparing 15 important genes in heart regeneration to other genes which are related to heart failure or problem as I wrote previously. However, you told me it is good to focus on protein.
Then should I compare those 15 genes to protein which cause heart failure or problem?

and is it good idea to focus on domain? or are there any other parts which significant role like domain in protein?

Best,
Ryan

MelissaB · Post by **MelissaB** » Fri Feb 19, 2010 3:32 am

Remember that genes code for proteins--so in a way, you are asking the same question. However, changes in proteins are likely to be more important than changes in single nucleotides, since single nucleotide changes are not likely to change the protein sequence and therefore affect the function of the protein.

Yes, BLAST allows you to compare gene sequences to other gene sequences or protein sequences to other protein sequences.

Active sites in proteins are the other thing you may want to focus on. However, as I said previously, you will need to do research on the genes you are interested in and their proteins. You may find that there is information available about domains and active sites, or that there is no information available, depending on whether or not someone has studied that particular protein. I suggest you start looking for papers on this as soon as possible. You can try Google scholar, but you will probably have to go to a university or college library to actually get the papers, because you often only get abstracts on Google Scholar.

Good luck!

aelin · Post by **aelin** » Tue Feb 23, 2010 9:39 am

Hi Ryan,

Your project is very interesting and definitely very useful, and Melissa's post is definitely what you should be looking to at the moment. I just have a few small things to add on top of it.

1. Another reason why studying the protein sequence might be more interesting is due to all forms of RNA/protein modification (starting with alternative mRNA gene splicing, all the way to post-translation modification). Often, the exact same DNA sequence encodes several different proteins due to these modifications, and they can often have drastically different properties and shapes, and therefore different functions. Examining proteomics is one very important aspect of expressed differences between organisms. However, it would still be good to look for large differences between the DNA sequences (like Melissa said, point mutations at the third position of a codon generally are acceptable - with a few exceptions like in sickle cell anemia - and don't affect the protein, but a deletion or insertion that shifts the reading frame would be a big change).

2. The active sites of proteins are very important in terms of the functionality of the protein. If the active site is modified in some way between the two organisms, then it is likely that this will prevent or generate some functions.

3. What would be even further interesting to consider is the intracellular signaling that results from these proteins (this will most likely be a result of active site analysis, but it's good to do the analysis explicitly anyways). If you know that two of the proteins that you are interest in interact in some manner, and one of them is different in a mutant, then analyzing how the signaling is affected gives key insight into the resulting functionality.

Hope this helps!
Aaron Lin

carolinethorn · Post by **carolinethorn** » Wed Feb 24, 2010 6:26 pm

Hi Ryan,

This is a very interesting project. I do bioinformatics research so will try keep an eye on how this is going and try to help if you get stuck with anything.

If I understand your first post, what you thought was that genes involved in zebrafish regeneration and heart disease/failure in humans might have something in common that you can identify as a feature and there might be other genes that have these features that previously were of unknown function. This seems a reasonable hypothesis.

I am now going to contradict the other experts advice. All of their points are valid. While protein structures and domains are very important to the functioning of a protein these are different depending on the exact job a protein is carrying out. So if all your zebrafish proteins were involved in the same kind of job (eg, they all interact with myosin) they might have a similar motif in the structure. Same with the heart disease/failure proteins, if they were all doing the same kind of job. But I would think that is unlikely, so the type of signal you are looking for is less obvious and it might not be found in the protein signal. However, if you use the DNA sequence including the upstream and downstream regulatory regions you might be more likely to find some sequence motif that is specific for genes expressed in heart or cardiovascular tissue.

So, how are you planning to compile your list of top genes for each category?

Best of luck,
Caroline

kyhekm · Post by **kyhekm** » Fri Feb 26, 2010 4:14 pm

First of all, thank you for answering on my posts you guys!!

Caroline, I got what you are trying to say but I have several questions I want to ask you.

1) For example, let's suppose that I am trying to compare two genes; Anxa5(Gene which is upregulated in regeneration process of zebrafish. Website: http://www.ncbi.nlm.nih.gov/nuccore/284 ... rt=genbank) and ACE(Gene which is found in heart failure. Website: http://www.ncbi.nlm.nih.gov/entrez/disp ... ?id=106180)

In this case, how can I find the upstream and downstream regulatory regions in websites? I am not good at dealing with those websites so I need help so I can learn how to use them.

2) And I have other questions. Some says to me focus on active sites and you said to me its good to focus on upstream and downstream regulatory regions. Can you list me things which I can focus on and which one is good to focus on?

3) "So, how are you planning to compile your list of top genes for each category?" you said this but I'm not sure what do you mean by this. Are you asking me whether I chose genes to focus on? A guy who I know is majoring in bio-informatics and he gave me some sources of genes what I can focus. In other words, I have 15 genes of zebrafish to focus on but I couldn't choose genes of heart failure/disease. I am planning to look up as soon as possible.

I was planning to do this project for my science fair but I think I can't finish this in 2 weeks(registration for science fair is in 2 weeks

) so I will just proceed and try to make some small research paper by end of this semester. Please help so I can achieve my goal!

Thanks a lot,
Ryan

carolinethorn · Post by **carolinethorn** » Fri Feb 26, 2010 7:16 pm

Hi Ryan,

I'm glad you realise it's too ambitious to try and do everything in 2 weeks. I hope we can help you meet your goals of getting background information and beginning to learn about bioinformatic tools so you can write a paper by the end of the semester.

1. I was thinking you were going to try and look at all the sequences at once rather than a pairwise comparison. I think it would be hard to choose which genes to pair up to test and hard to see matches unless they are exact. If you look across all the sequences you might find something that is common to all, not exactly the same in each one but kind of similar.

There are some tricks to figuring out which genbank sequences to use. Basically genbank can have many sequences for the same gene, some have just the coding sequence some have more. The first sequence you link to says it is a mRNA sequence so it only has coding sequence, no regulatory regions or introns. It has links to the protein sequence so if you decided to look more at proteins you could find it from here.
But when you are ready to get your gene sequences i can talk you through it.

2.
You could focus on active sites - the best way to do this might be to use your zebrafish candidate genes and take the protein sequence of each one in turn and search against a database of human proteins (using BLASTP) and find matches particularly at the active site and see if there is any evidence for the human proteins involvement in heart disease. i.e. to use your zebrafish list to identify the list of human gene candidates.
You could take the zebrafish list and a human candidate genes list and compare genomic sequences and maybe find a motif involved in heart expression.
You could take the zebrafish list and a human candidate genes list and find the functional annotations (called Go annotations) and look for a pattern of those that is similar across all or differs between groups.

I can't tell you what to focus on, it's more about what aspect you find more interesting. The first option might be more straightforward in terms of techniques that more people can help with.

3. Yes, I was asking if you had a list or how you planned to make your list of genes. Its always good to make sure you know where your starting from. So ask your friend why he chose these genes - is it from the literature or from expression experiment data or both?

Ok, this is a really long reply so i'll stop now. Hope its answered some of your questions. Post back with more.

-Caroline

kyhekm · Post by **kyhekm** » Sat Aug 07, 2010 7:26 pm

Hello

I just want to check whether 'carolinethorn' is still here!!
I am trying to begin my experiment again so I need your help.
If you are here, please leave short comment so I can see.

Best,
Ryan

carolinethorn · Post by **carolinethorn** » Sun Aug 08, 2010 6:16 am

Yes, still here - how is it going?
-Caroline

kyhekm · Post by **kyhekm** » Fri Sep 10, 2010 1:24 pm

Hello,

I started my research again as an independent study course in my school.
As I said earlier, I've already made the list of 15 genes which are all found in zebrafish's heart, retina, fin regeneration.
I thought it was good idea to use 15 common genes which are all found in three regeneration processes.
And I also made the list of 15 genes which are involved in the heart diseases. I got some of them from research papers and others from Google.

Now, I think I should start comparing two groups of gene in one to one. So, there can be 15*15=225 possible pairs I need to check.
When I compare a pair of genes, is it right way to compare Fasta sequence of each gene or is it right to compare whole sequence of each gene?

And when I compare a pair of gene, is there any way to compare two genes except by doing it hands?
I don't think I can compare two genes by using BLAST since there is only one Fasta sequence to put it. Is this right?

Best,
Ryan

carolinethorn · Post by **carolinethorn** » Fri Sep 10, 2010 5:11 pm

Hi Ryan,

Sounds like things are getting started. Your math is right for how many pairs of comparisons you would need to do. You are also right that if you want to do pairwise comparisons that the usual BLAST set up will not do that. There are a lot of choices for algorithms to do pairwise comparisons. Here is a link that lists several resources for these types of comparisons.
http://molbiol-tools.ca/Alignments.htm
I have used ALIGN (the first one). It might be worth trying out a few to see which have the friendliest interfaces or plan to compare the results. If you are handy at programming some of the resources (Like FASTA at UVA, http://fasta.bioch.virginia.edu/fasta_w ... down.shtml) allow you to download the pairwise alignment program and run it on your own computer so you could set it up to work on all your sequences.
Most of the programs use FASTA sequence format (confusing that there is the program with the same name as the format) but some allow you to do it by the accession number.
Post back as you encounter more questions,
best of luck,
Caroline

kyhekm · Post by **kyhekm** » Mon Sep 13, 2010 12:59 pm

Hi
I've looked through the website and I think ALIGN is the most simple and easy to read the result. So I will go with that.
I have several questions though.

1. When I do this project, I often refer to this website ( https://www.sciencebuddies.org/science- ... t&from=TSW ) to know how to do bioinformatics project.
In number five of that website, it says, "Choose a SNP in an Exon that is missense. (Results in a change in amino acid.)"
So when I am comparing two genes, do I only need to compare genes which have 'missense' function?
And which dbSNP is good to compare among lots of dbSNPs?

2.I randomly chose two dbSNPs which have missense function from MTHFD1L and ANXA5 and I got the result from the ALIGN like below.
>_ 201 nt vs.
>_ 401 nt
scoring matrix: , gap penalties: -12/-2
32.4% identity; Global alignment score: -288

What do scoring matrix, gap penalties and global alignment score mean?
And as you can see I got 32.4% identity. Is this good percentage for identity? To use the data in the research paper, at least what percent of identity should two genes have?

Thank you,
Ryan

carolinethorn · Post by **carolinethorn** » Mon Sep 13, 2010 3:53 pm

Hi Ryan,
Good job for picking a method.
But I think we need to go over some core terminology so we are talking the same language. I also need to read the science buddies resource and what you've been doing. I don't have time to do that right now but will try and get to it later tonight or tomorrow morning.
-Caroline

carolinethorn · Post by **carolinethorn** » Tue Sep 14, 2010 6:50 am

Hi Ryan,

So I have read through the project idea website and the goal of that project is a little different from yours so we need to alter the protocol a little. The sample project is looking at points of variation found in the same gene but different people, SNPs. What i think you wanted to do was to look for overall similarities between genes involved in heart regeneration (fish) and those involved in heart disease (human). The kinds of similarities are called motifs - the motifs that might suggest common functions, common regulation or a common evolutionary origin.

Rather than using the NCBI pages for OMIM/diseases you need to start from the ones that catalog all the sequences called Entrez gene.
http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
For each gene you want to test you will need to retrieve the DNA sequence. Search the gene symbol from your collected list. Then select from the results the one that matches - make sure to look at the species to select Homo sapiens for your human genes, and Danio rerio for zebrafish.

On the gene page you need to scroll down to the section mrked NCBI reference sequences (RefSeq). I think you should get the "genomic sequence". This is the DNA sequence of the whole gene. Click on the FASTA version of the genomic sequence. The make a Word file and copy and paste the sequence into it starting from the > sign.

Doing this for 30 genes is going to take some time. Maybe it would be good to start with your top 3 from each group.
If you give me the genes you are going to try I can have a go too and then it will be easier to discuss the results.

-Caroline

kyhekm · Post by **kyhekm** » Wed Sep 15, 2010 1:30 pm

Hello

Three genes in below are from Heart disease
MTHFD1L
PSRC1
MIA3

Three genes in below are from Zebrafish regeneration
annexin A5
serine carboxypeptidase 1
myristoylated alanine rich protein kinase C substrate

I tried to figure PSRC1 and annexin A5 and found that PSRC1 has different types of genomic sequence such as primary assembly, alternate assembly(Celera), and alternate assembly (HuRef). Among three different types, which one should I focus on?
And when I search annexin A5 I couldn't find anxa A5 but anxa 5b from Danio rerio. In this case, is it ok to use anxa5b? or Should I exclude this gene from my list?

Thank you,
Ryan

carolinethorn · Post by **carolinethorn** » Thu Sep 16, 2010 6:36 am

Hi Ryan,

I would try and pick all the reference sequences for your human genes from the same source if possible. So choose the "Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly" where possible - if there isn't one from this source then make a note of it and which alternative source you used.

I looked through the page for annexin 5b (http://www.ncbi.nlm.nih.gov/gene/337132) and it says the protein name for this gene is annexin A5 so I think this is what you are looking for. I would surmise that the papers you selected this candidate from were talking about the protein. We often hit problems like this where the protein name and gene name are slightly different or there are a few different names used for the same thing. We try to have everyone use the same set of names or standard nomenclature, but often people like to use the names they always used in the past and this can be tough for doing bioinformatics.

hope the next step goes well, keep us posted,
Caroline

kyhekm · Post by **kyhekm** » Sat Sep 18, 2010 6:42 pm

Hi

Do you know other websites like ALIGN?
It works for short sequences but it seems like it doesn't work for long sequences.
I typed my e-mails in there and it says my e-mail address doesn't exist.
Does that work for your e-mail? none of my e-mails are nor working for ALIGN...

Best,
Ryan

deleted-71827 · Post by **deleted-71827** » Sat Sep 18, 2010 7:05 pm

Hey Ryan,
I know many scientists make extensive use of BLAST, so perhaps you can add that to your list of resources. It's a pretty useful data analysis tool used in bioinformatics pretty often, so hopefully that will help you out a little.

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Best of luck!

carolinethorn · Post by **carolinethorn** » Sun Sep 19, 2010 5:31 am

Hi Ryan,

ALIGN should work for long sequences so i'll have a go myself to check and let you know.

-Caroline

carolinethorn · Post by **carolinethorn** » Sun Sep 19, 2010 11:06 am

Hi Ryan,

I did a test using genomic sequence of PSRC1 (human) and ANXA5 (zebrafish), at the ALIGN Query website that was the first link on the list of all the sequence comparison (http://xylian.igh.cnrs.fr/bin/align-guess.cgi). I too got error messages and it said my email was wrong or something. Something must be wrong with this and its not work wasting time figuring it out when there are other choices.

I had a go with the next link
LALIGN at http://www.ch.embnet.org/software/LALIGN_form.html
I chose local alignment and also tried global alignment. I used the default settings. This worked.

What's the difference between these two types of alignment? local looks for small pieces in both sequences that look the same. Global tries to line up the whole lengths against each other. Global is good for looking at two sequences that are really similar - like the same gene from human and mouse, or two subtypes of the same receptor. But local alignment is better for finding motifs - so lets try and use a local alignment program.

I also had a go with FASTA at this link http://fasta.bioch.virginia.edu/fasta_w ... rm=compare
This worked also. To be I liked the output style from this better and it also seemed to find shorter areas with closer matches than using LALIGN at embnet. I think those might be more identifiable as motifs.

Why don't you have a try with those two and see which you like best. Post back with your results.
-Caroline

kyhekm · Post by **kyhekm** » Sun Sep 19, 2010 11:31 pm

Hello

I tried both and the second never worked for me. I will try it at some other place again.
And I found the limit of the first website which is LALIGN.
As you know 'Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly' is super long and this whole thing doesn't fit into sequence section in LALIGN.

I hope ALIGN works but it seems like it is never going to work.
I will try to find other programs.

Best,
Ryan

carolinethorn · Post by **carolinethorn** » Mon Sep 20, 2010 5:14 am

Hi Ryan,

Maybe I wasn't clear enough with my instructions for retrieving the sequence for PSRC1. It should only be a small part of Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly that corresponds to the PSRC1 gene - about 3279bases long. Yes the whole Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly is the whole genome and would be way to long to align on a home computer!
I'll see if i can take some screen shots of what i did to get my PSRC1 sequence to help.
Later,
-Caroline

carolinethorn · Post by **carolinethorn** » Mon Sep 20, 2010 7:03 am

i'm having trouble getting the screenshots to upload as the file is too big. i'll ask the science buddies people for help and see if they can do it. sorry i am not better at explaining in words!
-Caroline

amyC · Post by **amyC** » Mon Sep 20, 2010 8:19 am

Hi Ryan - I've attached a PDF from Expert carolinethorn.

Amy
ScienceBuddies

kyhekm · Post by **kyhekm** » Mon Sep 20, 2010 6:51 pm

Hello
Thanks for all your works.
So I don't need to copy whole human genes but around 3000~4000 base pairs right?
and how about the zebrafish genes? Do I need to put same number of base pairs as I did to human genes?

And as you mentioned in the pdf file, when I compare two different regions(one from pair A and the other one from pari B), I need to see whether two different regions are at same base pair number and whether they are similar or same. Right?

And how do you track down whether one of the region's function is already known or not? Is there any other website to figure this out?

Best,
Ryan

carolinethorn · Post by **carolinethorn** » Tue Sep 21, 2010 6:27 am

Hi Ryan,

You are welcome, I am happy to help a student who is enthusiastic and looking to learn.

Yes, I think using the whole gene sequence is the best idea -it just happens that PSRC1 is pretty short. Some of the other genes will be much longer but the alignment should still work. Since you are not trying to line up the whole human and fish sequences to each other it doesn't matter if they are different lengths, the software searches through both sequences for short stretches that are similar. So use the whole gene sequence for each gene.

To start with just keep copies of all your alignment results. You could save the html file of that page or copy and paste into another program but make sure you label which genes are being compared. I think a key skill in bioinformatics is being very methodical about keeping all of your files and labeling them well. And like all good scientists, keeping track of your methods. You could use the screen shots from my pdf to help with your methods write up but make sure you write what you did in your own words.

There are a few different ways we can look to find out if the function of your putative motif. (The results that have short almost identical aligned sequence are called putative or possible motifs.) One way is to see if that region of the gene has been annotated in the NCBI or zfin database, another is to search a motifs database. I know a few protein structure motifs search databases but don't know if there are any that can do different kinds of motifs (protein, DNA regulation, DNA expression) all at the same time. I will ask some colleagues about the best online tools for this and get back to you.

best of luck,
Caroline

kyhekm · Post by **kyhekm** » Wed Sep 22, 2010 8:18 pm

Hello

I'm still little bit confused about comparing two genes.
Let's suppose we are comparing MTHFD1L(human) and anxa5b(zebrafish).
As you can see in the website of MTHFD1L (http://www.ncbi.nlm.nih.gov/gene/25902), there are two genomics in Primary Assembly: NC_000006.11 and NT_025741.15.
1)I think there is no difference in sequence between them. Is it ok to use either one to compare?

2)And when you compare it is good to use all sequences even though it is really long. Right?

3)When you look for putative motif, above what percent can I use? I mean when you run FASTA program, it shows percentage for each part. And I wonder above what percentage, I can use to find putative motif. About 80%? 90%?

If you can tell me how to find putative motif via internet later, it will be great!
I will keep working on comparing genes.

Best,
Ryan

kyhekm · Post by **kyhekm** » Wed Sep 22, 2010 9:01 pm

I forgot to write one more thing.

In my list of genes of zebrafish, there are several unknown genes such as wu:fc60b09.
And that one doesn't have any genomic sequences. In this case, should I just exclude that one?
Or is there other way to find the sequence?

carolinethorn · Post by **carolinethorn** » Thu Sep 23, 2010 6:53 am

Hi Ryan,

Some answers...

>1)I think there is no difference in sequence between them. Is it ok to use either one to compare?
I think so but try and be consistent and keep a record.
>2)And when you compare it is good to use all sequences even though it is really long. Right?
Yes, use the whole gene.
>3)When you look for putative motif, above what percent can I use? I mean when you run FASTA program, it shows percentage for each part. And I wonder above what percentage, I can use to find putative motif. About 80%? 90%?
I don't know yet. We are kind of still figuring out our methods here. Maybe just save all the results for now and then figure this out later. I asked a colleague about motifs and they recommended a program called MEME. I havent had a chance to look at it properly yet though.

And my questions for you...

1. Do you have your hypothesis yet?
2. Remind me again how you got your two sets of genes -
3. Do you think there will be motifs that are present in both sets of genes but it is something that is not exclusive to heart function? how would you control for this?

just some things to think of,
Caroline

Question about bio informatics

Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics

Re: Question about bio informatics