This server is intended
for the interactive prediction of the mapping between the members of
two families of interacting proteins (i.e. to predict which ligand
within one family interacts with which receptor within the other
family). The idea behind the method is based on previous works (1, 2, 3) which demonstrate that
interacting protein families tend to have similar phylogenetic trees.
The extension implemented in this server to predict the mapping between
the members of two families is based on the idea that the right mapping
would be the one which produces the highest similarity between the two
trees. Since it is imposible to explore all possible mappings, current
approaches use heuristics to avoid exploring of the complete space of
solutions (3). So, these approaches
do not ensure the best solution and they can be trapped within a local
minimum. This is why it is important to interactively explore the
proposed solution(s) and to explore changes in this mapping order to
improve it.
The operation
with this server involves two main steps:
i) Initial
submition of a job. The user gives as input the two protein families
(either multiple sequence alignments or phylogenetic trees) and the
server produces an initial mapping with an associated set of scores.
The results are returned by email. The user can stop here and
analyse/process these raw results as they are, or he/she can
interactively inspect and change them in the second step.
ii)
Interactive analysis of the results. In this stage, the user sends to
the server the files obtained in the previous step. A new interface is
presented where the user can visualize and interactively change the
proposed mapping.
Although submiting a Job to TAG_TSEMA is a very simple and nearly automatic
process, you ougth to provide some basic information about your job,
such as:
- JOB NAME: Give a name to your Job. This compulsory
information will help you identifying the job you have submited once it
finishes, as it will be included in the subject of forthcoming mails.
- EMAIL: Valid email address where results (and any other
forthcoming message regarding your job) will be sent to as attachments.
NOTE FOR WINDOWS USERS: Please, ensure that the returning results file is saved as *.tar.gz instead of the default *.tar.tar
- TREE/MSA for Family I/II: Path to the file with the
information on distances between the proteins within both families.
Please, notice that the input can be a Newick Tree and/or a Multiple
Sequence Alignment. In case a MSA is submited, the server will convert
it to a Newick Tree using Clustalw (4).
IMPORTANT NOTE: Please, notice that as the biggest difference with our previous server, TSEMA, when submiting a job to TAG_TSEMA input sequences can be classifigrouped into classes at will by the user (i.e. subfamilies, organisms, known interacting groups...) Only members of the same classes are mapped together during the automatic heuristic searcMonteCarlo search performed here in. To assign a sequence to a classes the name of the sequence MUST be as follows: NAME_CLASS for example CDK2_HUMAN, BRCA1_MYGROUP1... If no class is given to a sequence, UNDEF (standing for Undefined) class is automatically assigned.
Advanced users might want to change the default paramethers such as:
- NUMBER OF MONTECARLO RUNS: The number of times a MonteCarlo heuristic search is performed (a million iterations each).
Please, remember that this is a CPU time consuming process, so please
change this paramether responsibily.
- SUBMITED DATA TYPE:Although the detection of the submited
data-type is done automaticaly, you can also force the type in case you
receive unexpected errors regarding problems with formats.
- SCORING FUNCTION:The default scoring function for measuring
the similarity between the trees (distance matrices) is Pearson's T
Correlation Coefficient. However, you can also use Pearson's R or RMSD (Root Mean
Square Deviation) as alternative scoring function. Take a look at the
references for more information on the methodology for measuring
similarities between phylogenetic trees (1,
2, 3).
Once your job has finished, the results will be returned
to you by email. Now you can either use these raw results, or submit
them to the analysis part of the server to visualize and interactively
change them.
As done for the Job Submition, there is also information that must be
provided, such as:
- ANALYSIS NAME: Give a name to your Job. This compulsory
information will help you to identify the job you have submited once it
finishes, as it will be included in the subject of any forthcoming mail
regarding it.
- EMAIL: Valid email address where messages regarding the job
will be sent.
- RESULTS FILE: You must provide the file that has been sent sent by TAG_TSEMA to your email in the
previous step. It is highly recommended to submit your file compressed
as a *.tar.gz file to avoid network overloading. NOTE FOR WINDOWS USERS: Please, ensure that the returning results file is saved as *.tar.gz instead of the default *.tar.tar
Advanced users might want to change the default paramethers such as:
- TOLERANCE: This parameter controls the percentage of
solutions (ordered by score) that are rejected to calculate the
coincidence. Restrictive analysis are less noisy, but might be
incomplete in cases with high promiscuity.
- SCORING FUNCTION:The default scoring function is Pearson's
Correlation Coefficient. However, you can also use RMSD (Root Mean
Square Deviation) as alternative scoring function (1, 2, 3).
Results from the server can be improved during a human supervised
proccess.
In this interface, the user can interactively change some of the
pairings predicted by the automatic procces described previously and
see how the scores change accordingly.
The "coincidence table" shows in how many of the montecarlo runs these
pair of proteins are linked in the final mappings. Pairs of proteins
with a high value in this table may or may not be linked in the final
overall best mapping reported. So, a good starting point could be to
force some of these pairings to see how this new mapping looks like in
the trees, which new correlation it produces, and how it affects the
rest of scores.
Meaning of the scores:
Reliability: The *Reliability* score for a pair A-B (Rel_AB) indicates in how many of the 500 trees A is linked to B over the total number of pairings for A (in percentage). Note that the total number of pairings for A may be different from 500 since in some solutions A might not be linked to any protein. For the same reason, Rel_AB could be different from Rel_BA, since the number of pairings for both proteins could be different. This is why there are two values of reliability for each pair. /Rel_AB=100%/ would mean that in all the solutions where A was linked, it was linked to B. This score gives an idea of the consistence of a given link. Segregation: The *Segregation* score (Seg) for a pair gives an idea of the difference between the Reliability of that pair and the next best reliability.
Seg_AB= (Rel_AB-2nd_best_Rel_A)*100/SUM_i(Rel_Ai)
If Rel_AB is not the highest Rel_Ai, the highest Rel_Ai is used instead of "2nd_best_Rel_A", and hence Rel would be negative. A high value of this parameter indicates, not only that that pair appears in many simulations, but that it is far from the next most frequent pair for that protein as well. A negative value indicates that that pair is not the most frequent one for that protein. There is a color scale for these two scores, from red (worst) to blue (best).
Notice that both scores can be applied not only to the coincidence matrix, but also to its transpose (Results are 2-Dimensional).
.
This GUI is divided in 3 different
areas:
1.- Mapping Correction: This area provides a graphical
interface to easily change the interacting partners. The GUI also
provides information about the reliability and the segregation (for
both FamilyI and FamilyII) of the different protein pairs, and a control panel
to recalculate the improved distance correlation or, if desired, to
undo changes.
2.- Tree Representation: A downloadable image shows the
mappings over a representation of both trees. A color scale shows the
reliability of the mapping taking into account the whole stack of
solutions.
3.- Correlation Plots: Two plots are shown in this area. The
first one shows the actual and the previous correlation calculated, and
the second one the differences between the actual mapping and the one
suggested by the server as default. Please, notice that a color scheme
indicates whether a specific position appears, remains, or disappears
from the plot.
This page
contains some examples for the user to get familiar with the
server. For each one of the examples, the user can either take the
original multiple sequence alignments of the families (or trees) and
start the whole process from the beginning, or just take the precomputed
raw results and run the second part of the server only (interactive
analysis).
IMPORTANT: Novel users should start with this TUTORIALto get familiar with TAG_TSEMA's interface and calculations.
Izarzugaza JM, Juan D, Pons C, Valencia A, Pazos F.
"TAG_TSEMA: interactive prediction of protein pairings between interacting families." In Press
1.-Izarzugaza JM, Juan D, Pons C, Ranea JA, Valencia A, Pazos F. - "TSEMA: interactive prediction of protein pairings between interacting families." , Nucleic Acids Res. 2006 Jul 1;34 (Web Server Issue); W315-9
2.-Goh, Bogan, Joachimiak, Walther and Cohen - "Coevolution of Proteins
with their Interaction Partners", J.Mol.Biol. 2000
3.-Pazos and Valencia - "Similarity of phylogenetic trees as indicator
of protein-protein interaction" , Protein Eng. 2001
4.-Ramani and Marcotte - "Exploiting the Co-evolution of Interacting
Proteins to Discover Interaction Specificity", J.Mol.Biol 2003
5.-Thompson, Higgins and Gibson - "CLUSTAL W: Improving the sensitivity
of progressive multiple sequence alignment through sequence weightin
position specific gap penalties and weight matrix choice", Nucleic
Acids Res. 1994
Jose Maria González-Izarzugaza (jmgonzalez(AT)cnio(DOT)es)
Spanish National Cancer Research Center - Centro Nacional de Investigaciones Oncológicas (CNIO)
Structural Bioinformatics Group - Grupo de Bioinformática Estructural
C/Melchor Fernádez Almagro, 3
28029 Madrid (Spain)
Phone: + (34) 917 328 000
Fax: + (34) 912 246 980