Analytical Flory Random Coil

Usage examples and full code reference for AnalyticalFRC. For the underlying theory, see Analytical Flory Random Coil.

Quick start

Build an object from a sequence and read off ensemble-average dimensions:

from afrc import AnalyticalFRC

P = AnalyticalFRC('MASNDYTQQATQSYGAYPTQPGQGYSQQSSQPYGQQSYSGYSQSTDTSGYG')

P.get_mean_end_to_end_distance()        # mean Re (A)
P.get_mean_radius_of_gyration()         # mean Rg (A)
P.get_mean_hydrodynamic_radius()        # mean Rh (A), Kirkwood-Riseman

Pull out full probability distributions (each returns (distances, probabilities)):

re_r, re_p = P.get_end_to_end_distribution()
rg_r, rg_p = P.get_radius_of_gyration_distribution()

# distribution between two specific residues
d_r, d_p = P.get_interresidue_distance_distribution(4, 40)

Inter-residue and whole-chain maps:

P.get_internal_scaling()                # [|i-j|, mean distance] profile
P.get_distance_map()                    # n x n mean inter-residue distances
P.get_contact_map(15.0)                 # contact fractions at a 15 A threshold
P.get_pre_profile(0)                    # hypothetical PRE profile for a label at residue 0

Draw a size-matched sample (e.g. to compare against a simulation trajectory):

samples = P.sample_end_to_end_distribution(n=5000)

See also the demo/demo_AnalyticalFRC.ipynb notebook for a worked, plotted example.

Code reference

class afrc.AnalyticalFRC(seq, adaptable_P_res=False)

The AnalyticalFRC object is the main user-facing object that the AFRC package provides. All functionality is associated with function called from this object, and the object itself is instantiated with a single amino acid string. For all intents and purposes, one can think of as an AnalyticalFRC object as holding one protein sequence and providing an interface to ask specific types of polymer questions.

from afrc import AnalyticalFRC
MyProtein = AnalyticalFRC('KFGGPRDQGSRHDSEQDNSDNNTIFVQGLG')

Note

Distributions and parameters are only calculated as requested, such that initializing an AnalyticalFRC object is a cheap operation. However, operations relating to intramolecular distances (get_distance_map(), get_internal_scaling() etc.) are more computationally expensive.

__init__(seq, adaptable_P_res=False)

Constructor for an AFRC object which can be queried to obtain varies parameters and statistics.

Parameters:
  • seq (str) – Amino acid sequence for the protein of interest (case insensitive). If this is an invalid string it will raise an AFRCException.

  • adaptable_P_res (Bool (False)) – Sets the resolution used for generating probability distributions. By default this is assigned to a fixed value (0.05 A). However, if this flag set to True a sequence-specific adaptable resolution is used and calculated as \(d_{max} / 500.00\) (where \(d_{max}\) reflects the contour length of the polypeptide and is defined as \(3.7n\).

get_contact_fraction(R1, R2, threshold)

Function that - given two residues (R1, and R2) and a distance threshold in Angstroms (threshold) returns the faction of the time the center-of-mass distance between R1 and R2 is < threshold.

Practically, if we set threshold = 5, this gives you the expected contact fraction for two residues, which is a useful normalization factor.

Parameters:
  • R1 (int) – First residue - must be between 0 and length of the polymer

  • R2 (int) – Second residues - must also be between 0 and length of the polymer

  • threshold (float) – A distance threshold in angstroms - can be a float or an int

Returns:

Returns a single value between 0 and 1 that reports on the fraction of the time residues R1 and R2 are close than $threshold angstroms apart.

Return type:

float

get_contact_map(threshold, symmetric_map=False)

Function that returns a contact map for the protein, where the contact map is a square matrix where each element is the contact fraction between two residues.

Parameters:
  • threshold (float) – A distance threshold in angstroms - can be a float or an int.

  • symmetric_map (bool (default = False)) – If True, a full [n x n] matrix is returned, if False only the upper right triangle is returned.

Returns:

Returns a square matrix where each element is the contact fraction between two residues.

Return type:

np.ndarray

get_distance_map(calculation_mode='scaling law', symmetric_map=False)

Returns the complete inter-residue distance map, an [n x n] upper-right triangle matrix that can be used as a reference set for constructing scaling maps.

Distances are in angstroms and are measured from the residue center of mass.

Parameters:
  • calculation_mode (string (default = 'scaling law')) –

    A selector which must be equal to one of a specific set of options:

    ’distribution’ - means the P(r) distribution is used to calculate average distances

    ’scaling law’ - means the derived scaling relationships are used to calculate the

    average distance

    If one of these is not provided then an AFRCException is raised.

  • symmetric_map (bool (default = False)) – If True, a full [n x n] matrix is returned, if False only the upper right triangle is returned.

Returns:

An [n x n] square matrix (where n = length of the amino acid sequence) defining the inter-residue distances between every pair of residues.

Return type:

np.ndarray

get_end_to_end_distribution()

Defines the end-to-end distance (Re) distribution using the standard end-to-end model (as in [Rubinstein2003]).

\(P(r) = 4\pi r^2 \left( \frac{3}{2\pi \langle r^2 \rangle} \right)^{3/2} e^{-\frac{3 r^2}{2 \langle r^2 \rangle}}\)

Returns:

A 2-pair tuple of numpy arrays where the first is the distance (in Angstroms) and the second array is the probability of that distance.

Return type:

tuple of arrays

get_internal_scaling(calculation_mode='scaling law')

Returns the internal scaling profile - a [2 by n] matrix that reports on the average distance between all residues that are n positions apart ( where n is | i - j | ).

Distances are in angstroms and are measured from the residue center of mass.

A linear log-log fit of this data gives a gradient of 0.5 (\(\nu^{app} = 0.5\)).

Returns:

An [2 x n] matrix (where n = length of the amino acid sequence). The first column is the set of | i-j | distances, and the second defines the average inter-residue distances between every pair of residues that are | i-j | residues apart in sequence space.

Return type:

np.ndarray

get_interresidue_distance_distribution(R1, R2)

Returns the distribution between a pair of residues on the chain.

Parameters:
  • R1 (int) – The first residue of the pair being investigated.

  • R2 (int) – The second residue of the pair being investigated.

Returns:

A 2-pair tuple (distances, probabilities) where the first array is the distance (in Angstroms) and the second is the corresponding probability.

Return type:

tuple of np.ndarray

get_mean_end_to_end_distance(calculation_mode='scaling law')

Returns the mean end-to-end distance (\(R_e\)). This value can be the absolute mean end-to-end distance or the root-mean-sequence end-to-end distance.

Parameters:

calculation_mode (string (keyword)) – calculation_mode defines the mode in which the average is calculated, and can be set to either ‘scaling law’ (default) or ‘distribution’. If ‘distribution’ is used then the complete Re distribution is used to calculate the expected value. If the ‘scaling law’ is used then the standard Re = R0 * N^{0.5} is used.

Returns:

Value equal to the average end-to-end distance (as defined by mode).

Return type:

float

get_mean_hydrodynamic_radius(calculation_mode='kirkwood-riseman')

Returns the average hydrodynamic radius, calculated either useing the Kirkwood-Riseman equation or using the empirical Rg-to-Rh conversion scheme developed by Nygaard et al.

Parameters:

calculation_mode (string (keyword)) – Defines how the hydrodynamic radius should be calculated. Must be one of either “kirkwood-riseman” or “nygaard”.

Returns:

Value equal to the average hydrodynamic radius (in Angstroms).

Return type:

float

References

[1] Nygaard M, Kragelund BB, Papaleo E, Lindorff-Larsen K. An Efficient Method for Estimating the Hydrodynamic Radius of Disordered Protein Conformations. Biophys J. 2017;113: 550–557.

[2] Kirkwood, J. G., & Riseman, J. (1948). The Intrinsic Viscosities and Diffusion Constants of Flexible Macromolecules in Solution. The Journal of Chemical Physics, 16(6), 565–573.

get_mean_interresidue_distance(R1, R2, calculation_mode='scaling law')

Returns the mean distance between a pair of residues on the chain.

Parameters:
  • R1 (int) – The first residue of the pair being investigated.

  • R2 (int) – The second residue of the pair being investigated.

  • calculation_mode (string (keyword)) – calculation_mode defines the mode in which the average is calculated, and can be set to either ‘scaling law’ (default) or ‘distribution’. If ‘distribution’ is used then the complete Re distribution is used to calculate the expected value. If the ‘scaling law’ is used then the standard Re = R0 * N^{0.5} is used.

Returns:

The mean distance (in Angstroms) between residues R1 and R2.

Return type:

float

get_mean_interresidue_radius_of_gyration(R1, R2, calculation_mode='scaling law')

Returns the mean radius of gyration (\(R_g\)) as calculated from the \(R_g\) distribution BETWEEN a pair of residues (i.e. the \(R_g\) distribution for an internal local region of the chain).

Parameters:

calculation_mode (string (keyword)) – calculation_mode defines the mode in which the average is calculated, and can be set to either ‘scaling law’ (default) or ‘distribution’. If ‘distribution’ is used then the complete Rg distribution is used to calculate the expected value. If the ‘scaling law’ is used then the standard Rg = R0 * N^{0.5} is used.

Returns:

Value equal to the mean radius of gyration.

Return type:

float

get_mean_radius_of_gyration(calculation_mode='distribution')

Returns the mean radius of gyration (\(R_g\)).

Parameters:

calculation_mode (str) – calculation_mode defines the mode in which the average is calculated, and can be set to either ‘distribution’ (default) or ‘scaling law’. If ‘distribution’ is used then the complete Rg distribution is used to calculate the expected value. If the ‘scaling law’ is used then the standard Rg = R0 * N^{0.5} is used.

Returns:

Value equal to the mean radius of gyration.

Return type:

float

get_pre_profile(label_position, tau_c=4, t_delay=12, R_2D=14, W_H=600000000, sample_size=10000)

Calculate the hypothetical paramagnetic relaxation enhancement (PRE) profile expected if a spin label were placed at position label_position. The only required input is the label position, but additional experimental parameters can be passed in as well.

It’s important to remember this method does not consider the explicit position of a spin label linker, but does provide a reference model as to the expected PRE profile if the chain behaved as an AFRC chain.

Parameters:
  • position (Label) – Position along the chain that is labelled

  • tau_c (float) – tau_c is the effective correlation time, measured in nanoseconds, which is typically between 1 and 30. Default = 4

  • t_delay (float) – Total duration of the INEPT delays from the PRE experiment, as measured in ms. This will depend on the pulse sequence used, but is typically around 1-30 ms for HSQC. Default = 12

  • R_2D (float) – Is the transverse relaxation rate of the backbone amide protons in the diamagnetic form of the protein, measured in Herz (i.e. ‘per second’). A value of around 10 might be expected. Default = 14

  • W_H (float) – Is the proton Larmor frequency, which is typically the “MHz” value associated with the magnet, given in Hz. For examle, a 600 MHz magnet would use the value 600000000. Note that the proton Larmor frequency at 1 Tesla = 267530000 per second per Tesla. Default = 600000000

Returns:

Returns a 3-element list.

[0] - residue indices (starting at 0) [1] - PRE profile (a value between 0 and 1) [2] - PRE H1 relaxation profile (gamma)

Return type:

list

References

[1] Meng, W., Lyle, N., Luan, B., Raleigh, D.P., and Pappu, R.V. (2013). Experiments and simulations show how long-range contacts can form in expanded unfolded proteins with negligible secondary structure. Proc. Natl. Acad. Sci. U. S. A. 110, 2123-2128.

[2] Das, R.K., Huang, Y., Phillips, A.H., Kriwacki, R.W., and Pappu, R.V. (2016). Cryptic sequence features within the disordered protein p27Kip1 regulate cell cycle signaling. Proc. Natl. Acad. Sci. U. S. A. 113, 5616- 5621.

[3] Peran, I., Holehouse, A. S., Carrico, I. S., Pappu, R. V., Bilsel, O., & Raleigh, D. P. (2019). Unfolded states under folding conditions accommodate sequence-specific conformational preferences with random coil-like dimensions. Proceedings of the National Academy of Sciences of the United States of America, 116(25), 12301–12310.

[4] Lalmansingh, J. M., Keeley, A. T., Ruff, K. M., Pappu, R. V., & Holehouse, A. S. (2023). SOURSOP: A Python package for the analysis of simulations of intrinsically disordered proteins. bioRxiv : The Preprint Server for Biology. https://doi.org/10.1101/2023.02.16.528879

get_radius_of_gyration_distribution()

Defines the radius of gyration (\(R_g\)) distribution using equation (3) from [Lhuillier1988].

Returns:

A 2-pair tuple of numpy arrays where the first is the distance (in Angstroms) and the second array is the probability of that distance.

Return type:

tuple of arrays

sample_end_to_end_distribution(n=1000)

Subsamples from the end-to-end distance distribution to generate an uncorrelated ‘trajectory’ of points. Useful for creating a sized-match sample to compare with simulation data.

Parameters:

n (int) – Number of random values to sample (default = 1000)

Returns:

Returns an n-length array with n independent values (floats)

Return type:

np.ndarray

sample_inter_residue_distance_distribution(R1, R2, n=1000)

Subsamples from the inter-residue distance distribution (between residues R1 and R2) to generate an uncorrelated ‘trajectory’ of points. Useful for creating a sized-match sample to compare with simulation data.

Parameters:
  • R1 (int) – The first residue of the pair being investigated.

  • R2 (int) – The second residue of the pair being investigated.

  • n (int) – Number of random values to sample (default = 1000)

Returns:

Returns an n-length array with n independent values (floats)

Return type:

np.ndarray

sample_radius_of_gyration_distribution(n=1000)

Subsamples from the \(R_g\) distirbution to generate an uncorrelated ‘trajectory’ of points. Useful for creating a sized-match sample to compare with simulation data.

Parameters:

n (int) – Number of random values to sample (default = 1000)

Returns:

Returns an n-length array with n independent values (floats)

Return type:

np.ndarray