Note: At the time of this review (12/26/2006) Proteome Software was not an advertising client of IonSource
Program Reviewed: Scaffold ®
Version Reviewed: Scaffold-01_06_03
Date Reviewed: 12/26/2006
Vendor: Proteome Software, Inc.
Web site: http://www.proteomesoftware.com
Use: Proteomics, Comparative Proteomics, Protein Identification
Cost: $5000 academic, $6000 institutional
Fully Functional Demo Available for Download: Yes
Download Site: http://www.proteomesoftware.com/Proteome_software_prod_Scaffold_download.html
Contact Person at Proteome Software: Mark Pitman, mark.pitman@proteomesoftware.com, 800-944-6027
Please Note: BioWorks ® is a registered trademark of Thermo Electron Corporation, Mascot ® is a registered trademark of Matrix Science Ltd., Scaffold ® is a registered trademark of Proteome Software, Inc., SEQUEST ® is a registered trademark of the University of Washington, X!Tandem was created by Craig and Beavis of the Manitoba Proteomics Center
Introduction:
Scaffold can be used to combine the search results from multiple protein sequence search engines to give a more complete proteomic coverage it is also very useful at validating or corroborating results. The reader is referred to a presentation by Brian Searle on the advantages of combining multiple searches at the Proteome Software web site. As well as being able to import, combine, and display Mascot and Sequest searches, Scaffold comes bundled with the search engine X!Tandem. X!Tandem is the search engine used by The Global Proteome Machine, "theGPM", www.theGPM.org. The additional X!Tandem search results are used to further validate, and add additional coverage to your initial or primary searches. Scaffold can display the imported Sequest and Mascot results quickly, however, and naturally, it does take longer when an additional X!tandem search is done. To help in this regard Scaffold has included the option to utilize up to four processors in an X!Tandem search, and has also allowed the upfront utility to create a subset protein database from the initial searches that X!tandem can search more quickly. The ability to use multiple processors on a single computer is included at no extra charge. Scaffold also incorporates algorithms for confident protein identification through statistical analysis. For more information on the statistical analysis the reader is referred to a discussion of MS statistical analysis by Brian Searle at the Proteome Software web site. At IonSource, statistical validation of our Sequest results have resulted in our increased use of Sequest for database searching, which has been a welcomed and important side benefit of using the program. Scaffold can combine multiple fractions from a MudPIT run and can display multiple experiments side by side, thus serving a comparative proteomics function. The program comes with very good documentation and the Proteome Software web site is packed with valuable information and tutorials. The presentation that follows was made just to give you a feel for how the program looks and to give you a feel for some of the basic work flows.
Installation:
Installation is easy and takes about 2 minutes and uses a standard installation program. Scaffold is about 34 MB and downloads quickly through a broadband internet connection from the download site, http://www.proteomesoftware.com/ . We have been using Scaffold for about a year, and have noticed no conflict or interference with other programs. It has been a very stable program and seldom if ever crashes. The PS team is very proactive and has done much in the last year to make the program better, and they have done much to reduce the number of bugs. If you find a bug be sure to notify them, they will be happy to help you.
Program Walk Through:
Step 1. We will walk you through an example of how data can be parsed, viewed, and validated with Scaffold. When you launch Scaffold the wizard comes up to walk you through loading a new result or opening a preexisting saved analysis, see Figure 1 below. The wizard makes the program easy enough to use, even without reading the manual. Once you have been through the wizard a few times it is easy enough to import results without it. However, the wizard presents such a convenient and simple entry point that we still find it useful. Importing results into Scaffold is easy, to do this you will need the Mascot .dat files or a Sequest folder containing .dta and .out files. You will not be able to import public Mascot files since you cannot get to the actual .dat file, which resides on a server somewhere over in England. See Figure 1 below for the initial screen that appears when you launch the program.
Figure 1.

Step 2. For this walk through example we will import results from a Sequest search. We will choose the "New" option as shown in Figure 1 above. The dialog shown below, in Figure 2, is the next screen that comes up. At this point you can name your sample if you wish. If we were analyzing a MudPIT run we would check the MudPIT box below and Scaffold would combine the salt steps for us automatically. In the past we would manually combine concatenated dta files in preparation for a Mascot search which was a very painful exercise. Now we search the MudPIT steps separately on Mascot or Sequest and combine them with Scaffold. For this example we will be importing gel spot MS data. After filling out the sample info you would click next in Figure 2 to continue.
Figure 2.

Step 3. The next step is to queue the Sequest result files for loading. Press the "Queue Files For Loading" button, as shown in Figure 3, and browse to the Sequest searches. The import wizard makes every operation very simple. Also the iterative design of this wizard is pure genius.
Figure 3.

Step 4. Select the folders containing the Sequest searches and press "Add to Import Queue", as shown in Figure 4. If you wanted to add all of the files in the folder you could just click on the parent folder. Sometimes trying to open a folder containing 50,000 dtas can take a long time, so choosing the parent folder might be a good option. We have heard that Scaffold can even handle zipped DTAs
Figure 4.

Step 5. The next step is to select a FASTA sequence database. Either select a Scaffold formatted database from the pull down menu, as shown in Figure 5, or add a new database by clicking on the "Add Database" button, as shown in Figure 5.
Figure 5.

Unfortunately Scaffold does not handle nucleotide data bases as of yet. You can select the preexisting Scaffold formatted protein databases, that have already been parsed, as mentioned above, or import a new FASTA sequence database. You should use the same database that the initial Sequest or Mascot search used. It is convenient to have Scaffold on the same computer as Sequest, so that you can just browse to the Sequest database folder. To add a new database to the Scaffold list just point the parser to the folder containing your Sequest or Mascot databases. You can select the predefined definitions, IPI, Swiss Prot, NCBI, etc..., which are very handy, or you can write your own regular expressions to parse the database header. See a screen shot of the parser in Figure 6. It is important that you get the parsing of the header right so that your results can be displayed correctly in the Scaffold output. The parsing interface, shown below, is a very easy interface to use. Some clues that you got the parsing wrong are that the results cannot display the proteins names, or sometimes you will get an error that says, protein sequence is not in the database. We have found the Proteome Software staff very helpful when we have had the rare parsing difficulty. In recent versions of Scaffold, db parsing is almost transparent to the user. Database parsing and linking to db information and annotation is much easier than with any other program out there.
Figure 6.

Step 6. The next step is to select a parameter set. (In recent versions this dialog has been eliminated, Figures 7a through 7b. Parameters are now directly imported from the primary searches, very cool. It is still important to use the same parameters when searching with Mascot and Sequest. We have left this section in the review for historical reasons and for the directions as they relate to the primary searches.) For example, with our ion trap data we always use a 2 u tolerance for the parent mass and an 0.8 u tolerance for the fragment masses, see Figure 7/7b. It is important that you reproduce the setting that you used in your initial search. If you are not certain what parameters you used, take a look at your Sequest parameter file that is generated with each Sequest search. You can usually find the Sequest parameter file in the folder that contains the .out files. This file can be opened with the common text editor, notepad. The default parameters set for Sequest in Thermo's BioWorks is 1.5 u for the parent mass and 0.0 for the fragment masses. We change this to 2 and 0.8 respectively just to be consistent throughout all of our searches. If you are planning on combining Sequest and Mascot searches it is best to let Sequest generate the DTA files, and then let Sequest and Mascot use the same exact pool of DTA files.
Figure 7a
In the recent versions this dialog has
been eliminated.
Parameters are now directly imported from the primary searches.

The parameter dialog is simple, see Figure 7b below. Change each selection as appropriate for the experiment that you performed. The modifications are directly imported from Unimod. (Unimod is a public domain, protein modification database that accepts user defined modifications.) With Scaffold, the mass shift is shown right next to the name of the modification, pretty cool. Often, with other programs, modification acronyms can be confusing. Putting the modification mass next to the acronym often removes this ambiguity. To make multiple modification selections with Scaffold just hold the control key down and click to select multiple mods.
Figure 7b.
In the recent versions this dialog has
been eliminated.
Parameters are now directly imported from the primary searches.

Step 7. Press the "OK" or the "Load and Analyze" button, as shown in Figure 8 below, and Scaffold will analyze the data. If you have ticked the X!Tandem button Scaffold will perform an X!Tandem search and validate all results using the built in statistical algorithms. If you just import the Sequest results without an X!Tandem search, Scaffold will display the results fairly quickly and the peptide statistical analysis is still performed. If you choose to do an X!Tandem search, which is recommended to exploit the multi-search power of Scaffold, the process will take a bit longer. Often we will import the Sequest or Mascot results first, just to have a quick look, then we will go back and do the X!Tandem search. Unfortunately, an X!tandem search cannot be performed after you import and process the Mascot and Sequest result, you must choose to do an X!Tandem search at the time you import the primary data. Also you cannot do an X!Tandem search on its own, unless you run X!Tandem using the command line and import the X!Tandem result into Scaffold. Under normal Scaffold operation, to perform an X!Tandem search, you must also be importing primary searches from Mascot or Sequest . An additional benefit of performing an X!Tandem search with Scaffold is that you can also choose to look for additional mods not specified in the primary Sequest or Mascot search. Performing an X!Tandem search can add substantial data compilation time depending on the size of the database searched. Using the "search subset database" tick box will allow you to search a subset database with X!Tandem making the additional search go much faster. The downside to a subset database search is that it is impossible to find additional proteins not found in the primary search, since it is only searching a list of proteins already found in the primary search.. The upside to an X!Tandem subset database search is that you can improve coverage of the proteins already found, and you can also look for additional mods not specified in the primary searches.
Figure 8.

Step 8. Figure 9 below, shows the result page that appeared for our 12 gel spots. The columns show the individual gel spot data, and the rows show which protein was found in which gel spot.
Figure 9.

Figure 9 illustrates how the initial results are displayed. Personally, we prefer to display the number of unique peptides found for each protein. We think that the number of unique peptides found is a better discriminator. For example, it may be possible to get to 100% probability for a protein with two identified peptides, but we know that the number of peptides found, and sequence coverage adds important information. Scaffold gives you the option to display the, "Number of Unique Peptides". The greater number of unique peptides found for a protein hit is related to the protein's abundance, which can be related to protein significance. To display the unique peptides for each hit just go to the "Display Options" pull down menu located above the "Probability Legend" and select "Number of Unique Peptides". The "Display Options" pull down menu is shown in Figure 9. Figure 10 shows the results page displaying unique peptides, and the result squares are still color coded to still give you the probability, there is definitely more information in this representation of the data.
Figure 10.

There are really only two data output views that your clients will be interested in, the sample view, shown above, and the protein view, as shown in the Figure 11 below. Figure 11 shows all of the spots that have the first protein, from the samples view, in common. The upper left hand window in Figure 11 below shows a cartoon of the coverage maps for each of the samples. We have clicked the top coverage map to display it's peptides to the right and the detailed protein coverage below in this figure. In the peptide view, to the right, notice the Sequest scores to the right of each peptide. If we had folded in Mascot and X!Tandem searches their scores would also be listed. Amino acid modifications are highlighted in green. A new feature in Scaffold does not automatically highlight static mods. This feature helps draw the eye to the more interesting and perhaps more significant variable modifications. You can turn on static mod highlighting by right clicking on the coverage map view and then choosing to turn this function on. When we first observed this function we thought that the program was failing to highlight these residues, as it turns out the programmers at PSI have added many unique and useful features to Scaffold in the last year (2006).
Figure 11.

In the bottom frame of the Proteins view, notice the tabs. You can click on these tabs to see similar or homologous proteins in the database, you can see the spectrum view as shown in the figure below, and you can see the model error plot, that is the delta mass for the fragment ions, with the far right tab.
Figure 12.

As shown above in Figure 12, Scaffold does a really good job at displaying a spectral match. It also does a very good job at labeling the peaks and showing internal cleavages. Being able to identify internal cleavages is important if you have a peptide containing proline residues, and many of us do. The people at Proteome Software have really shown great attention to detail.
Step 9. The next step in fully exploring your data may be a second round search. For example, one search routine may be to first search a big database like the NCBI nr (1,300,000 KB) with few modifications. We normally just do an indexed search of the NCBInr with a static modification on Cys, then we do a second round search of a subset database of all the proteins found in the first search using a broader selection of modifications like oxidation and phosphorylation. If you only have a computer with one or two processors you need to be a little clever with big search jobs or they can take forever, especially if you search with too many variable modifications, a second round search of a smaller database helps in this respect. A search of the entire NCBI nr database with all of your selected modifications could take days for one Sequest search to complete on a single processor, being able to create smaller subset databases can be very helpful.
Figure 13.

Searching a Subset Database
To fully explore your data in a second round search Scaffold has supplied several easy to use tools. As we previously mentioned, one of these tools is a subset database generator. Just go to the "export" pull down menu, as shown above in Figure 13, and choose "Subset Database", and the dialog appears as shown bellow. This tool makes a subset database of all of the found proteins in the primary search. You can affect the size of the exported database by changing the parameters, as shown below.
Figure 14.

Once the subset database is created you can then export all of the unidentified spectra. The dialog for exporting spectra is shown in the Figure 15 below. You can export them as concatenated DTAs or in any form you wish, see the pull down menu in the Figure 15. Part of the dialog is cut off in Figure 15, sorry, you can also export your data as Sequest MS2s. This is a cool little tool, you can convert Mascot files to DTAs or vice versa, or DTAs to concatenated DTAs.
Figure 15.

Once you have exported the unidentified spectra, and the subset database, the second round search can begin. Since the subset database is a lot smaller, usually ~1000 KB, a search with many variable modifications can go a lot faster. Once this re-search is done with Sequest or Mascot the results can be imported and merged with your first pass results. The utilities for a second round search are very powerful features.
We have heard a few negative comments about making subset databases as a tool for faster searching of modifications. We believe that the second round search strategy is a totally valid and sound strategy. Modifications like phosphorylation and acetylation are seldom found as a single peptide hit. If they existed as a single event or identification then you would of course miss them in the second round search. If you find a phosphorylated peptide, a somewhat rare event, most likely it will be accompanied by it's non-phosphorylated relative, and at least a few other peptides from the protein. Since the stoichiometry is often skewed toward the non-phosphorylated peptide, if you do not find the non-phosphorylated peptide it is very unlikely you will find the phosphorylated one. One exception is the experiment where you have enriched for the phosphorylated peptide using a purification technique like IMAC, in which case you should include this mod in the first round search. A second round search is a valuable tool to increase protein coverage and towards finding those rare modifications that we are all so interested in.
Recent Additions to Scaffold, 2006:
1.) Statistics View
This view shows the plotting of the discriminant score that is used in developing the statistical model. The reader is referred to the Proteome Software website for a discussion of scoring. This figure also shows the scatter plot, which plots correct and incorrect peptide hits, the lines in this plot are the confidence levels that you have set. One point of confusion that we had concerning the scatter plot is how a correctly assigned peptide (red dot, all the dots are peptides) gets below the confidence line. The explanation is that that peptide was correlated to a protein that is above the line, even the peptide probability algorithm placed it below the line. We had a question about how a peptide could be determined to be correct or incorrect since all peptide matches are correlations to a supposedly unknown dataset. The answer is, a peptide is judged to be correct if it is correlated to a protein that is judged to be correct based on the protein probability calculation. There is still some question in our mind as to how the user is supposed to use this statistics view, should they use it to judge whether the statistical algorithm is doing it's job correctly? We will get more information on this view, and report back.

2.) Publish View
The Publish view contains all of the information that you will need to write your methods section. It is very nice to have all of this information archived with your data. This is very useful. It will actually help you write your papers methods section. This is very handy if you have clients asking you for this information, you can then point them to this section in the Scaffold data that you have already sent them. Wow, I wonder if they could add additional panes that would include my gel spot digest, and LC/MS method.

3.) Go (Gene Ontology) Annotation
This is not a very good demo data set to showcase this feature, however you can see that Scaffold lists biological process, cellular compartment, and molecular function. We will try to get a better dataset to demo this function.

4.) Report Formatting
Several new features are "starring" and "hiding proteins." These two features, we imagine, were meant to address functions that we all perform as MS data disseminators. That is, as data disseminators we need to emphasize some hits and de-emphasize others like keratin or trypsin. This helps to draw the data recipients eye to results, (especially when there is a lot of keratin present), that we believe to be significant and away from others that we believe to be common contaminants.
Hiding Proteins:
Our philosophy at IonSource is to always show all of the data to the client or collaborator. We know all of the justifications for hiding hits, " The client will be confused by this data, or this data will just open a can of worms if we show this to the client." Our philosophy is that we cannot be the god of MS that says whether a high probability result is significant to the client or not. For some of our clients, keratin has been the target, and the hit that they were looking for, in this case deleting keratin would be a big mistake. We understand that some of our clients may have trouble interpreting results, and in these cases we generally talk and walk them through the data. Hiding data in our opinion is always a mistake. A big reason why we like Scaffold is because the reporting format can be understood with very little explanation. We are often amazed when we send an MS-naive client a Scaffold file, and direct them to the Scaffold viewer download url, and they then have no problem displaying and understanding their data, even if they are working on a Macintosh or Linux computer! Scaffold has greatly eliminated false positives, and redundant hits, and has greatly reduced our hand holding and consulting time.
Concluding Remarks:
Scaffold is a very good program for displaying proteomic data. Few programs currently have the ability to compare multiple runs side by side. In addition, Scaffold has resurrected our Sequest results. Even though Sequest is an excellent program it was often difficult for us to explain the complex Sequest scoring scheme, SP, DelCn, XCorr, etc..., to collaborators. In the past we would just point collaborators to Mark Hail's How to evaluate SEQUEST results page to help them understand Sequest scoring, or we would hand them a pile of scientific papers from the Yates lab. Amazingly, Scaffold along with it's statistical evaluation has brought our Sequest results back to life. Scaffold's probability calculation has added a greater degree of confidence to our results. Definitely if you have a version of Sequest that produces .out files, we would recommend Scaffold to display and organize them. We really like the way it displays multiple gel results, and MudPIT runs. It is so much easier to compare experimental to control samples in a MudPIT analysis using Scaffold. Currently Scaffold can import Mascot and Sequest results and they are working on importing other file formats as well. Scaffold can also import outputs from X!Tandem. Scaffold takes care of what we all intrinsically knew, and that is, combining multiple search results is beneficial, and will increase coverage. It has also given us an economical way of comparing multiple proteomic runs without buying into a 50-100K software package. Currently, if you want to share the Scaffold experience you can send your collaborator your saved Scaffold result file and have them download the free Scaffold viewer. A free viewer is analogous to what Adobe has done with Acrobat, PDF files, the viewer is free, and if you want to create or edit a pdf document you need to buy the program. It is an important concept where MS instrumentation is expensive and centralized and information needs to be disseminated. As the popularity of Scaffold spreads clients will begin to ask their mass spectrometrist, "Could you provide the data in Scaffold format?" Currently with Scaffold every table can be exported to Excel and every graphic can be exported as a jpeg file. The best thing about Scaffold is that it is a simple program that can do some pretty sophisticated proteomic operations, and at the same time it is relatively inexpensive.
For a more complete discussion of all topics covered in this review the reader is referred to the Proteome Software web site. They have an excellent proteomics tutorial section and do an excellent job at describing Protein Prophet and X!Tandem, and of course the Scaffold advantage of combining multiple searches.
Scaffold ® Tips and Tricks.
1.) Formatting a subset database for each second round Mascot search is easy with Scaffold but can be a bit of a hassle when you need to add and format a new database each time on your Mascot server. Here's the tip, try creating a generic "current folder" in the Mascot database folder with a generic name like NCBInrSubset. Then, each time you have a new NCBI nr subset database to search, just drop the new database in this folder and Mascot will automatically format it and you are ready to search.
2.) Many Scaffold functions can be performed over the network. However, it is helpful to have Scaffold on the same computer as Sequest or Mascot so that it is easy to share the same database folder and easy to perform second round searches. It just makes everything a little more convenient.
3.) To avoid result parsing difficulties remember to use the same mass tolerances in all searches.
4.) When doing a Sequest search the default parameters in BioWorks are set to 1.5 for the parent mass and 0.0 for the fragment masses. We change these in the initial Sequest search to 2 and 0.8 respectively to avoid any Scaffold parsing problems.
5.) Conglomerate sequence databases can cause difficulties if the entries do not maintain a common header format. This will result in trouble displaying results correctly, which is a result of not being able to parse multiple header formats in one "clugged" together database.
6.) Key commands, we love em, here is a list of some "secret" key commands you can use with Scaffold. Shh, don't spread these around. .....Back in the 1989 version of the API III SCIEX software, from SCIEX, there was a secret key command that would bring up a picture of Cheryl Tiegs in a swimsuit. Does anyone remember that command? Wait, let me turn off my cryo pump while you tell me.
7.) Here is an IonSource secret, try browsing the Scaffold Review directory or try browsing the general reviews directory, where you really shouldn't be!
8.) more to come.....
Scaffold: Features we would like to see added or improved
Scaffold needs to learn to handle nucleotide databases. (major****)
We would like to be able to do an upfront X!Tandem search
with Scaffold. (major****)
(Note: If you have Scaffold you can do
an X!tandem search of a data file if you run X!Tandem through the command
line.)
Currently, columns cannot be rearranged once a Scaffold report is generated. The only way to order columns is to name the data files alphabetically, or numerically before the search. (major***) (PSI is working on this one.)
Scaffold should have a taxonomy pull down menu like every other search engine out there. Currently we need to make subset databases especially for Scaffold before we search. With smaller species specific databases Scaffold compiles data faster and performs X!Tandem searches faster. X!Tandem was really not meant to search the entire NCBInr protein database. For example if we used NCBInr, with taxonomy rodentia in the Mascot search, we now manually go and make a subset database for Scaffold. (major***)
We would like to see Scaffold be able to import Phenyx and OMSSA results. This will broaden their customer base (major***) Scaffold now supports Phenyx!
Scaffold cannot fold in results from multiple enzyme digests, since the search parameters need to be the same for each search. One work around is to do a no enzyme search for each dataset in the primary searches, then combine them with Scaffold. It would be nice if Scaffold could be less strict when combining results, or incorporate a multiple enzyme import utility. (major***)
We would recommend eliminating the hide protein results
feature, it would be better to have a deemphasize feature, or, and we just
thought of this, a split results feature. Spilt them into user defined
significant and insignificant user defined results. ( major*****)
(The good people at Proteome Software
have assured us that proteins can always be unhidden)
It would also be nice to incorporate a hit note feature so the analyst could make a note to the client, like, "We don't think that this is the best hit because it could be carry-over from the previous run." (minor) (We have been told that there is a function, but have not tried it yet)
Rather than having to download the newest version of
Scaffold every few
weeks it would be nice to have an update button on the Scaffold browser.
(minor)
(We have been told that they are
working on this)
Note:
We would like to thank the Scaffold programmers and sales people for being so responsive over the past year, and for addressing many of our issues promptly. We have given the PSI people a workout this year, and they have been very accommodating, and they seldom complain when the mistake turns out to be a user mistake. Make sure you talk to them if you have any issues with Scaffold.
References:
Craig R, Cortens JP, Beavis RC. TANDEM: matching proteins with tandem mass spectra.Bioinformatics. 2004 Jun 12;20(9):1466-7. Epub 2004 Feb 19. Abstract Free Full Text
Nesvizhskii AI, Keller A, Kolker E, Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem. 2003 Sep 1;75(17):4646-58. Abstract
Keller
A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to
estimate the accuracy of peptide identifications made by MS/MS and database
search. Anal Chem. 2002 Oct
15;74(20):5383-92.
Abstract
Trademark Notice
BioWorks ® is a registered trademark of Thermo Electron Corporation
Mascot ® is a registered trademark of Matrix Science Ltd.
Scaffold ® is a registered trademark of Proteome Software, Inc.
SEQUEST ® is a registered trademark of the University of Washington
X!Tandem was created by Craig and Beavis of the Manitoba Proteomics Center
e-mail webmaster@ionsource.com
with all inquiries
home | terms
of use (disclaimer)
Copyright © 2005-2007, IonSource, LLC, All rights reserved.
Last updated:
Tuesday, April 15, 2008 07:26:06 AM