Authors: Tuping Zhou, Nanjiang Shu and Sven Hovmöller
Structural Chemistry,
Contact: Nanjiang Shu nanjiang@struc.su.se
Updated 2009-08-11
Abbreviations:
PDB: Protein Data Bank
Q3: overall per-residue accuracy for three-state secondary structure prediction
S3: overall per-residue accuracy for three-state Shape String prediction
S8: overall per-residue accuracy for eight-state Shape String prediction
1) Cutting at ≤30% sequence identity, including 5860 chains: chain id list, amino acid sequence file
2) Cutting at ≤25% sequence identity, including 4227 chains: chain id list, amino acid sequence file
3) Cutting at ≤25% sequence identity, including 3338 chains: chain id list, amino acid sequence file
We thank Dr.
David Jones for providing us the training set, which was used to build the
weight files for PSIPRED version 2.61, for carrying out the benchmark.
The training set contains 6598 protein chains, 1563587 amino acids, with the
average sequence length =237 amino acids.
The chain id is psipred2.61.training.idlist
and the sequence file is psipred2.61.training.fasta.
Note that many sequences in this training set are of high sequence identity to
each other. For example, the sequence identity of the chain 1JPTL and 1L7IL is
as high as 90%.
When cutting
this training set down to ≤30% sequence identity, only 3644 chains
remain. The chain id list for this non redundant set of the training set can be
found
at psipred2.61.training.nr30.idlist.
The test set was constructed in the following ways.
1)
Step 1: Obtain all PDB chains
(as of June 10, 2009) cutting at ≤99% sequence identity, with resolution
<2.5Å and R-value <0.3 and using only X-ray structures.
This returned 21574 protein chains.
2)
Step 2: Out of thus obtained
21574 chains, those chains with the same chain id with any one in the training
set were removed.
This resulted in 15256 protein chains.
3)
Step 3: Run blastpgp for all
these15256 chains against the training set and those chains with at least one
significant hit from the training set were removed.
The parameter for blastpgp is: E-value threshold = 1e-3 AND Iteration = 3.
The criteria for being a significant hit are: (Sequence identity >30% AND
Alignment length >30 AND E-value <0.1)
OR (Sequence identity >50% AND Alignment length >15 AND E-value
<5)
This resulted in 3100 protein chains.
4)
Step 4: Cut these 3100 protein
chains down to ≤30% sequence identity.
This resulted in 2421 protein chains.
The chain id list of the test set can be found at test2421.idlist and the amino acid sequence file can be found at test2421.fasta.
1) For the dataset cutting at ≤30% sequence identity, including 5860 chains
a. Q3 for each chain and the overall Q3 can be found at nr30.Q3.list
b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr30.S3S8.list
2) For the dataset cutting at ≤25% sequence identity, including 4227 chains
a. Q3 for each chain and the overall Q3 can be found at nr25.Q3.list
b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr25.S3S8.list
3) For the dataset cutting at ≤20% sequence identity, including 3338 chains
a. Q3 for each chain and the overall Q3 can be found at nr20.Q3.list
b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr20.S3S8.list
1) Prediction results by our method Frag1D
1. Q3 for each chain and the overall Q3 can be found at benchFrag1D.Q3.list
2) Prediction results by PSIPRED2.61
1. Q3 for each chain and the overall Q3 can be found at benchPSIPRED261.Q3.list
1) S3 and S8 for each chain and the overall S3 and S8 can be found at db1296.S3S8.list