Ammm:Comparative Analysis of RNA And Mfold

From biowiki
Jump to navigation Jump to search

Prediction of RNA Secondary Structure With Free-Energy Minimization

Reasons for Predicting the RNA Secondary Structure

Simpler problem

  • RNA secondary structure has a regular structure.
  • The secondary structure can be broken up into helices(paired) & loops(unpaired).

Kinetics of RNA Folding

  • Secondary structure forms very fast; transcription can be the rate-limiting step.
  • Tertiary structure forms at a slower rate.
  • Thus, breaking the structure prediction into two steps makes sense.

Free-Energies of Structural Motifs

Free-Energy of Overall Structure

  • An RNA secondary structure is composed of smaller structural motifs. The total free-energy of these structural motifs are summed to determine the free-energy of the entire RNA secondary structure. The general classes of these motifs are helices, hairpin loops, internal loops and multi-stem loops. As a general rule, mfold considers helices to be stabilizing while the loops(unpaired) segments are destabilizing.

Structural Elements

  • Helices are composed of at least two Watson-Crick + Wobble base pairs (AU, CG, GU). Each consecutive two base pairs are considered to be a base pair stack. The number and composition of base pair stacks within a helix determine its free-energy.

  • Hairpin loops are U-turns at the end of a helix. A minimum length of three is enforced. The free-energies of a handful of tri- and tetra-loops have been experimentally determined. But for the vast majority of hairpin loops, the only factors involved in determining its free-energy are the hairpin flank and the length of the hairpin.

  • Internal loops are unpaired regions between two helices. For small internal loops (1x1 and 2x2), the free-energy values for every possible combination has been specifically given. For a larger internal loop, mfold computes the free-energy based on length and the internal loop flank.

  • Multi-stem loops are the junctions for three or more helices.
  • Psuedo-knots are treated as part of the tertiary structure and their prediction is not attempted by Mfold or most other secondary structure prediction programs. This has the advantage of greatly reducing the complexity of the problem.

Complexity of Problem

# of Possible Helices vs # of Possible Structures


Molecule Reference Sequence GB # Length #Helices # of Sec. Str. #Helices xyplot IISTR
tRNA Saccharomyces cerevisiae (Phe) K01553 76 37 2.5 * 10^19 4
File:DpgardnTrna-helix.gif
tRNA
File:DpgardnTrna2.str.gif
16S rRNA Escherichia coli J01695 1542 14684 4.3 * 10^393 58 File:Dpgardn16S-helix.gif



Mfold approach [1] [2] [3]

Recursion

  • Mfold employs a dynamic programming (recursive) approach. It finds the optimal secondary structure by determining the optimal structure for each part of the sequence.

Very Short Sequence Example

  • When Mfold first begins, its first step is to find all of the possible helices of length 2 or more within a sequence. This can be used to speed up the search.

To run Mfold

  1. Create a file in either FASTA or GENBANK format that contains the RNA sequence that will be folded.
  2. mfold SEQ=<file>
  3. Look at result files - they are the .ct and .det files created by mfold. .ct file contains the base pairs in the prediction structures. .det contains the free-energies of all the individual structural elements.
  • Depending on the size of the RNA sequence, Mfold will return multiple predicted foldings; the default number is 100. Mfold will actually calls its folding program twice to generate these foldings. The first iteration is used to save off the the optimal folding for each part of the sequence. The second iteration reads the .sav file and does a traceback to generate the multiple structures.

Limitations of Mfold

Mfold Accuracy

  • What is the overall accuracy of Mfold?
rRNA Archaea Bacteria Eukaryote
5S ( ~120 nts) .72 .73 .71
16S (~1500 nts) .59 .49 .34
23S (~2900 nts) .57 .51 .43
  • Mfold's accuracy is entirely dependent upon correct energetic parameters. But since each parameter must, in theory, be experimentally determined in vitro this limits the number of well-defined energetics. Most structural free-energies are approximations that do not account for nucleotide composition. In practice though, many energetic parameters are added in an ad hoc manner. This does have the potential of increasing Mfold's prediction rate if done well, but that does not seem to be the case.

Energetic Parameters

  • Base pair stacks are treated as symmetrical. Also, only WCGU base pairs are allowed.

  • Hairpin loop energies are almost always calculated using the hairpin flank and length of the hairpin.

  • Internal loop energies are usually calculated using the internal loop flank energetics and the length of the hairpin.

  • More generally, the concept that the loop regions are destabilizing does not seem to hold up when looking at the crystal structure. In the crystal structure, most of the nucleotides unpaired in the secondary structure are paired in the tertiary.

Basis of Comparative Analysis

  • For select RNA molecules, it is assumed that all members of that family (e.g., tRNA) will adopt the same secondary and tertiary structure fold. This is not dependent upon sequence similarity.
  • To find the common structure, multiple sequences from the same family must be aligned accurately. Then, a search for potential helices in the alignment that are common to all sequences in that data set is performed.

File:DpgardnCovariation.gif

  • Nucleotides that are involved in a conserved base pair will covary in order to maintain the correct secondary and tertiary structure.

File:DpgardnCovariation2.gif

  • With comparative analysis, the secondary structure of an RNA molecule can be correctly identified.

Using Comparative Analysis to Generate New Improved Energetics

Statistical Potentials

  • If we assume that frequency is correlated to stability, then known secondary structures can be used to generate new energetic parameters. The following equation can turn base pair stack frequencies into statistcal potentials(pseudo-energies).

  • where

  • and

  • The indices i, j, k, l represent any of the nucleotides A, C, G, or U. We define NST(ij,kl) as the number of base-pair stacks composed of ij and kl. The total number of base-pair stacks is NST. The delta function (ij,kl) is equal to 1 if ij=kl and 0 otherwise. Pi is the probability of occurrence of nucleotide i. The scaling factor, is determined by setting, where the numerator is the minimum experimental base-pair stack energy and the denominator is the minimum base-pair stack statistical energy. Any statistical potentials that are higher than the corresponding maximum experimental value are set equal to the maximum experimental value.

Crystal Structure

  • This was first attempted with base pair stacks by counting the frequency of WCGU stacks found in multiple RNA crystal structures. [4] The statistical potentials correlated quite well with the experimentally derived base pair stack free-energies.

Comparative Structure

  • This approach was modified to use the information from secondary structures defined with comparative analysis.[5] This increases the number of RNA molecules that can be used in generating statistical potentials giving a more complete ensemble. After replacing some of the energetic parameters within Mfold(base bair stacks, hairpin flank, internal loop flank, small internal loops) with statistical potentials, improvements in prediction accuracy were seen.
  • As detailed before, the energetic parameters that Mfold uses are limited. Using comparative analysis, it becomes possible to systematically create a complete set of energetic parameters.

PDFs

References

  1. Zuker, M. ;Sankoff, D., RNA Secondary Structures and their Prediction, Bulletin of Mathematical Biology, 1984, 46, 591-621
  2. Zuker, M.; Mathews, D.H.; Turner D.H., Algorithms and Thermodynamics for RNA Secondary Structure Prediction: A Practical Guide, 1999
  3. Zuker, M., Mfold web server for nucleic acid folding and hybridization prediction, 2003, Nucleic Acids Research, 31, 3406-15
  4. Dima, R.I.; Hyeon, C.; Thirumalai, D., Extracting stacking interaction parameters for RNA from the data set of native structures, 2005, Journal Molecular Biology, 347, 53-69
  5. Wu, J.C.; Gardner D.P.; Ozer, S.; Gutell, R.R.; Ren. P. Correlation of RNA secondary structure statistics with thermodynamic stability and applications to folding, Journal of Molecular Biology, 2009, 391, 769-83