Supplementary data for the paper “A novel method for accurate one-dimensional

protein structure prediction based on fragment matching”

Authors: Tuping Zhou, Nanjiang Shu and Sven Hovmöller

Structural Chemistry, Stockholm University, SE-106 91, Stockholm, Sweden

Contact: Nanjiang Shu nanjiang@struc.su.se

Updated 2009-08-11

Abbreviations:

PDB: Protein Data Bank

Q3: overall per-residue accuracy for three-state secondary structure prediction

S3: overall per-residue accuracy for three-state Shape String prediction

S8: overall per-residue accuracy for eight-state Shape String prediction

The list of PDB chains and amino acid sequences in FASTA format for three datasets cutting at ≤30%, ≤25% and ≤20% sequence identity, respectively.

1) Cutting at ≤30% sequence identity, including 5860 chains: chain id list, amino acid sequence file

2) Cutting at ≤25% sequence identity, including 4227 chains: chain id list, amino acid sequence file

3) Cutting at ≤25% sequence identity, including 3338 chains: chain id list, amino acid sequence file

The training set and test set for benchmarking with PSIPRED version 2.61

We thank Dr. David Jones for providing us the training set, which was used to build the weight files for PSIPRED version 2.61, for carrying out the benchmark.
The training set contains 6598 protein chains, 1563587 amino acids, with the average sequence length =237 amino acids.
The chain id is psipred2.61.training.idlist and the sequence file is psipred2.61.training.fasta.

Note that many sequences in this training set are of high sequence identity to each other. For example, the sequence identity of the chain 1JPTL and 1L7IL is as high as 90%.

When cutting this training set down to ≤30% sequence identity, only 3644 chains remain. The chain id list for this non redundant set of the training set can be found
at psipred2.61.training.nr30.idlist.

The test set was constructed in the following ways.

1) Step 1: Obtain all PDB chains (as of June 10, 2009) cutting at ≤99% sequence identity, with resolution <2.5Å and R-value <0.3 and using only X-ray structures.
This returned 21574 protein chains.

2) Step 2: Out of thus obtained 21574 chains, those chains with the same chain id with any one in the training set were removed.
This resulted in 15256 protein chains.

3) Step 3: Run blastpgp for all these15256 chains against the training set and those chains with at least one significant hit from the training set were removed.
The parameter for blastpgp is: E-value threshold = 1e-3 AND Iteration = 3.
The criteria for being a significant hit are: (Sequence identity >30% AND Alignment length >30 AND E-value <0.1)
OR (Sequence identity >50% AND Alignment length >15 AND E-value <5)
This resulted in 3100 protein chains.

4) Step 4: Cut these 3100 protein chains down to ≤30% sequence identity.
This resulted in 2421 protein chains.

The chain id list of the test set can be found at test2421.idlist and the amino acid sequence file can be found at test2421.fasta.

Prediction results on protein chains cutting at ≤30% (5860 chains), ≤25% (4227 chains) and ≤20% (3338 chains) sequence identity, respectively.

1) For the dataset cutting at ≤30% sequence identity, including 5860 chains

a. Q3 for each chain and the overall Q3 can be found at nr30.Q3.list

b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr30.S3S8.list

2) For the dataset cutting at ≤25% sequence identity, including 4227 chains

a. Q3 for each chain and the overall Q3 can be found at nr25.Q3.list

b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr25.S3S8.list

3) For the dataset cutting at ≤20% sequence identity, including 3338 chains

a. Q3 for each chain and the overall Q3 can be found at nr20.Q3.list

b. S3 and S8 for each chain and the overall S3 and S8 can be found at nr20.S3S8.list

Prediction results for benchmarking with PSIPRED2.61

1) Prediction results by our method Frag1D

1. Q3 for each chain and the overall Q3 can be found at benchFrag1D.Q3.list

2) Prediction results by PSIPRED2.61

1. Q3 for each chain and the overall Q3 can be found at benchPSIPRED261.Q3.list

Prediction results on the dataset containing 1296 chains which has been used by Kuang et al. (2004) for Shape String prediction

1) S3 and S8 for each chain and the overall S3 and S8 can be found at db1296.S3S8.list

The lists of 50 randomly selected PDB chains and their corresponding templates for homology modelling can be found at modellerRand50.list