Protein Domain Identification

deleted-324986 · Post by **deleted-324986** » Thu Dec 03, 2015 6:21 pm

Hello,
I'm doing a project called "computational exploration of protein function", and I would like a help on step 5 of the procedure.

It says to do a protein domain identification (using PROSITE ,expasy or CDD,ncbi) to figure out the functiuon of the domain that was found. I've already found a domain using T-COFFEE, but I don't know how to run the programs for protein domain identification.

How and in what format should I put in the domain for those programs? Also are there any other websited available to do a domain identification?
Thank you for your help

SciB · Post by **SciB** » Fri Dec 04, 2015 7:01 am

Hi,

You can probably identify your domain by simply doing a BLAST search: http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml
The protein domain sequence data can be entered in FAST-A format (<descriptor single-letter amino acid sequence) or as just the single-letter amino acids.

The ExPASy-Prosite program also uses the FAST-A sequence format and you can try that also and compare the results: http://prosite.expasy.org/

Your domain's amino acid sequence may already be in FAST-A format. Does it have a descriptor following < and then the amino acid sequence using the single letter designations? If it is just the amino acids you can still do the BLAST search with that on the NCBI site.

Let us know if you have more questions.

Sybee

deleted-324986 · Post by **deleted-324986** » Sun Dec 06, 2015 6:20 pm

hi,
Thank you for your advise. I'm still not sure how to use ExPaSy-Prosite program.
This is the domain I got from T-COFFEE:
(collagen)
gi|802665820|gb ---------------------------------[Pristionchus pacificus]
gi|466538|dbj|B SSGLQGDPGQTPTAEAVQVPPGPLGL-------[Homo sapiens]
gi|467517|dbj|B GQVRIWATYQTMLDKIREVPEGWLIFVAEREEL[Mus musculus]
How shold I enter these domain to PROSITE for the program to run?
Thank you!

SciB · Post by **SciB** » Sun Dec 06, 2015 8:07 pm

Hi,

You just put the descriptor first and then the amino acid sequence. Omit the species name at the end. You can do copy/paste into the Prosite search box.

Your first hit has a descriptor but no amino acid sequence so can't be used, but the other two should give you some information about what the amino acid sequence corresponds to.

Hope this works for you. I would also try the NCBI BLAST search to see if you get the same information.

Let us know how it comes out.

Sybee

deleted-324986 · Post by **deleted-324986** » Tue Dec 08, 2015 1:21 pm

hi
I've put in
gi|466538|dbj|B SSGLQGDPGQTPTAEAVQVPPGPLGL
for the PROSITE search box, but it didn't work.
Am I putting in the correct format?
thank you

SciB · Post by **SciB** » Tue Dec 08, 2015 7:04 pm

Sorry! I forgot to remind you to put a > before the descriptor. That tells the program that the entry is FAST-A format with a descriptor followed by the amino acid sequence.

I tried the search in Prosite (http://prosite.expasy.org/scanprosite/) with the > but there were no hits. That means your sequence SSGLQGDPGQTPTAEAVQVPPGPLGL was not recognized by Prosite as a protein motif or domain--just a part of the sequence of the human collagen protein:

collagen [Homo sapiens] 1,678 aa protein.
Chromosome: X.Map: Xq22.Sex: male/female.Clone_lib: whole eye and kidney.Note: clones MS[6,17,29,71,98] and TM[27,29,30,46,51,52,53].Accession: BAA04809.1GI: 466538

Next I tried looking for your sequence in the NCBI Conserved Domain Database (CDD): http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi

This program did find some motifs but I don't know how important they are:

Conserved domains on [gi|466538|dbj|BAA04809.1|] collagen [Homo sapiens]

List of domain hits
Name Accession Description Interval E-value
C4 pfam01413 C-terminal tandem repeated domain in type 4 procollagen; Duplicated domain in C-terminus of ... 1454-1561 2.17e-62

C4 pfam01413 C-terminal tandem repeated domain in type 4 procollagen; Duplicated domain in C-terminus of ...1562-1676 3.00e-57

Collagen pfam01391 Collagen triple helix repeat (20 copies); Members of this family belong to the collagen ...893-952 4.78e-04

Collagen pfam01391 Collagen triple helix repeat (20 copies); Members of this family belong to the collagen ...1078-1135 5.84e-03

I remembered from biochem class that collagen has a triple helix structure--three molecules running parallel and cross-linked to each other: http://www.ncbi.nlm.nih.gov/books/NBK21582/
That's what makes collage so strong. The linkages are between lysines on adjacent molecules. The usual motif is Gly Pro X or Gly X HyP where 'X' is any amino acid other than Gly or Pro. Glycine is a small aa that allows closer contact of the strands. That's why there is so much glycine in collagen. The prolines allow the strands to wind around each other in a helical form, sort of like DNA only with three strands instead of two.

Here's a pretty good description of collagen structure and synthesis: https://en.wikipedia.org/wiki/Collagen

If you need help with the names of the amino acids, here's a key: http://www.thelabrat.com/protocols/aminoacidtable.shtml

I hope this information helps your project. Let us know if you have more questions.

Sybee

deleted-324986 · Post by **deleted-324986** » Thu Dec 10, 2015 8:04 pm

Hi,
The domain I put in was straight from T-COFEE, but what should I do to make PROSITE recognize that this was a domain and not just a part of a sequence of human protein?
Thank you

SciB · Post by **SciB** » Sat Dec 12, 2015 6:23 pm

Try searching Prosite again, but before you run the scan uncheck the box that says exclude those motifs that have a high likelihood of occurrence in the sequence. I tried this and got one hit--a site for addition of a myristyl group. This is a fatty acid that can be added to a protein to make it attach better to a lipid membrane in a cell. http://prosite.expasy.org/cgi-bin/prosi ... 39.scan.gz

Also, out of curiosity to see what would happen, i did a Prosite scan for the whole 1678 amino acid human collagen sequence with the exclusion box unchecked. I got a bunch of hits this time--glycine-rich regions, proline-rich regions, myristylation sites, etc. http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi

I think maybe your sequence is among those in the glycine- or proline-rich list.

Did you try a Prosite scan for the mouse sequence? Did it work? Does T-COFFEE tell you what the domain is?

Hope this helps.

Sybee

Protein Domain Identification

Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification

Re: Protein Domain Identification