MS Protein Identification Strategies

Search Fundamentals: In the world of mass analysis the fundamental keys to identifying proteins are threefold : heterogeneity or in-silico induced heterogeneity, search constraints and the ability to search an appropriate database with mass data.


Exploiting Heterogeneity:  We cannot and one should not identify a compound with a single mass.  A single peak in a spectrum is rather nondescript and anonymous.  Many things in the world have the same mass.  I often think of mass spectrometry as if I had to view the world entirely through mass.  Many things would have the same mass, a chair, a table, a monitor.  If you were blind to all things but mass these three things would be identical.  However, if you could impart some heterogeneity by smashing the chair against the wall and measuring the mass of the pieces, you would say, " Ah ha!  These are chair pieces! It was a chair!"  One thing that we forgot to mention is that mass spectrometry is generally a destructive technique.  (IonSource accepts no liability for lecturers who choose to smash chairs in their lectures to demonstrate the power of MS/MS)

Heterogeneity and induced heterogeneity is a central theme in all identifications in science.  As scientists our specialty is pattern recognition.  Monoliths yield few patterns and hence few clues.

Heterogeneity is usually imparted to the unknown protein with an enzyme digestion.  Trypsin is most commonly used.  Trypsin cleaves proteins on the C-terminal side of the basic amino acid residues lysine and arginine, except when they are immediately followed by proline. In addition trypsin sometimes has difficulty cleaving when the next residue is aspartic acid.  Trypsin is used for the following reasons: it is has a well defined cleavage specificity (1), it's inexpensive, it's not finicky, and it yields peptides with an average length of about ten amino acid residues.  

Sometimes chemical cleavages are performed, like CNBr cleavage, and yes sometimes multiple enzymes are employed.  Still, trypsin by itself is used the majority of the time. 

Inducing Gas Phase Heterogeneity:  Further heterogeneity can be imparted once the peptide is ionized and it enters the gas phase.  Peptides can be fragmented in an ion trap as they are excited in the trap and slammed into the damping gas, helium, and fragmented.  Peptides can be fragmented in a collision cell filled with argon in a triple quadrupole mass spectrometer or other hybrid tandem type mass spectrometer.  One thing that should be mentioned is that generally the larger the peptide the more difficult it is to fragment.  A larger peptide can distribute the collisional energy more readily and may survive the car wreck with argon more easily than a smaller peptide could. This is another reason researchers favor trypsin, because the fragments are generally small enough to fragment yet large enough to yield significant sequence detail.

Search Constraints: Constraints are an important component in any search strategy.  For example if you are a zoologist in the field and want to identify an animal and need to shrink the universe of possibilities, then a few observations might help.  For example knowing that the animal you want to ID is a bird, and a black bird, weighing approximately 1kg, with a black beak, will help to narrow the field considerably. 

Enzyme Specificity Constraints:  The most fundamental and most important constraint that is placed in a proteomics search is enzyme specificity.  We know that all tryptic peptides will have a K or an R (lysine/arginine) at the carboxyl terminus, plus we know that the N-1 amino acid will be a  K or an R.  Of course this example ignores the amino terminus that has no N-1 amino acid and the carboxyl terminus of the protein that may or may not possess a C-terminal K or R.  Even though the N and C-terminal do not have this full constraint, as long as the protein in the database has the same sequence these terminal peptides will be useful in the match.

Mass Constraints:   Mass accuracy is the next most useful constraint.  The tighter the mass constraint the more accurate the match will be to a certain limit.  One of the most common mistakes in a search is to set this constraint too narrowly beyond the limits of themass spectrometer (2)

Other Constraints Used in a Proteomic Search:  Two other less important constraints are intact protein molecular weight and isoelectric point, The isoelectric point can be assumed from a 1 or 2D gel.  While these constraints can be useful there are significant drawbacks.  Not all search engines apply the intact mass constraint equally some have a sliding scale, meaning that all the peptides found for that protein must fit within a region that obeys that mass constraint.  Most intact mass constraints limit the search of proteins within a window of calculated theoretical mass derived strictly from the total intact mass calculated from the database sequence. This can throw off the search since many proteins in the database are larger due to an uncleaved signal sequences or immature enzyme pre/pro sequences.  Some proteins in-vivo are heavily glycosylated or run aberrantly on an SDS PAGE gel which would also throw off the intact protein mass estimation.  The isoelectric point in many instances cannot be determined from the database sequence since many proteins in nature are post-translationally modified, with glycosylations, phosphorylations, deamidations all of which can affect their charge characteristics.

Our Philosophy on the Species Constraint: We normally avoid placing a species constraint.  We routinely search the UniProt or NCBInr protein database without species constraint because protein databases are not complete and it is possible that you can identify a homologous protein from another species, even if that protein is not entered for your species of interest.  Also, some database searching programs suffer when the database is too small due to inadequate sampling statistics.  Not all programs suffer from this shortfall.  It would be ideal if the database contained every protein sequence expressed for a particular species, however this day has not yet arrived.  It is also helpful to know that the larger the database the longer the search will take.

Other Constraints: Some search engines allow a myriad of other constraints derived from other observations, for example partial amino acid sequence, immonium ions, many of these sequence or composition constraint are useful.

Database :  As in a car race where the fastest car with the best driver wins, �In the protreomics race she or he who has the most accurate most complete database wins.�  If you cannot come up with an appropriate database, then the mass spectrometry data you collected will be useless in a database search.  We will talk more about databases later.

Congratulations !: You have made it through all of the pontifications.  Now that you know some of the rules proceed to the techniques pages that follow and try some of the exercises, they're easy. Good luck, you will be a proteomics expert soon.


References :

  1. Olsen JV, Ong SE, Mann M. Trypsin cleaves exclusively C-terminal to arginine and lysine residues. Mol Cell Proteomics. 2004 Jun;3(6):608-14. Epub 2004 Mar 19.

  2. Search Optimization Presentation (.pdf presentation)



return to toc next



e-mail the  with all inquiries
home | terms of use (disclaimer) 
Copyright � 2004-2016 IonSource  All rights reserved. 
Last updated:  Tuesday, January 19, 2016 02:48:19 PM