megnet.data.molecule module¶
Tools for creating graph inputs from molecule data
- class MolecularGraph(atom_features: Optional[List[str]] = None, bond_features: Optional[List[str]] = None, distance_converter: Optional[megnet.data.graph.Converter] = None, known_elements: Optional[List[str]] = None, max_ring_size: int = 9)[source]¶
Bases:
megnet.data.graph.StructureGraph
Class for generating the graph inputs from a molecule
Computes many different features for the atoms and bonds in a molecule, and prepares them in a form compatible with MEGNet models. The
convert()
method takes a OpenBabel molecule and, besides computing features, also encodes them in a form compatible with machine learning. Namely, the convert method one-hot encodes categorical variables and concatenates the atomic features## Atomic Features
This class can compute the following features for each atom
atomic_num: The atomic number
element: (categorical) Element identity. (Unlike atomic_num, element is one-hot-encoded)
chirality: (categorical) R, S, or not a Chiral center (one-hot encoded).
formal_charge: Formal charge of the atom
ring_sizes: For rings with 9 or fewer atoms, how many unique rings
of each size include this atom - hybridization: (categorical) Hybridization of atom: sp, sp2, sp3, sq. planer, trig, octahedral, or hydrogen - donor: (boolean) Whether the atom is a hydrogen bond donor - acceptor: (boolean) Whether the atom is a hydrogen bond acceptor - aromatic: (boolean) Whether the atom is part of an aromatic system
## Atom Pair Features
The class also computes features for each pair of atoms
bond_type: (categorical) Whether the pair are unbonded, or in a single, double, triple, or aromatic bond
same_ring: (boolean) Whether the atoms are in the same aromatic ring
graph_distance: Distance of shortest path between atoms on the bonding graph
- spatial_distance: Euclidean distance between the atoms. By default, this distance is expanded into
a vector of 20 different values computed using the GaussianDistance converter
- Parameters
atom_features ([str]) – List of atom features to compute
bond_features ([str]) – List of bond features to compute
distance_converter (DistanceCovertor) – Tool used to expand distances from a single scalar vector to an array of values
known_elements ([str]) – List of elements expected to be in dataset. Used only if the feature element is used to describe each atom
max_ring_size (int) – Maximum number of atom in the ring
- convert(mol, state_attributes: Optional[List] = None, full_pair_matrix: bool = True) Dict [source]¶
Compute the representation for a molecule
- Args:
mol (pybel.Molecule): Molecule to generate features for state_attributes (list): State attributes. Uses average mass and number of bonds per atom as default full_pair_matrix (bool): Whether to generate info for all atom pairs, not just bonded ones
- Returns
Dictionary of features
- Return type
(dict)
- static create_bond_feature(mol, bid: int, eid: int) Dict [source]¶
Create information for a bond for a pair of atoms that are not actually bonded
- Parameters
mol (pybel.Molecule) – Molecule being featurized
bid (int) – Index of atom beginning of the bond
eid (int) – Index of atom at the end of the bond
- get_atom_feature(mol, atom) Dict [source]¶
Generate all features of a particular atom
- Parameters
mol (pybel.Molecule) – Molecule being evaluated
atom (pybel.Atom) – Specific atom being evaluated
- Returns
All features for that atom
- Return type
(dict)
- get_pair_feature(mol, bid: int, eid: int, full_pair_matrix: bool) Optional[Dict] [source]¶
Get the features for a certain bond
- Parameters
mol (pybel.Molecule) – Molecule being featurized
bid (int) – Index of atom beginning of the bond
eid (int) – Index of atom at the end of the bond
full_pair_matrix (bool) – Whether to compute the matrix for every atom - even those that are not actually bonded
- class MolecularGraphBatchGenerator(mols: List[str], targets: Optional[List[numpy.ndarray]] = None, converter: Optional[megnet.data.molecule.MolecularGraph] = None, molecule_format: str = 'xyz', batch_size: int = 128, shuffle: bool = True, n_jobs: int = 1)[source]¶
Bases:
megnet.data.graph.BaseGraphBatchGenerator
Generator that creates batches of molecular data by computing graph properties on demand
If your dataset is small enough that the descriptions of the whole dataset fit in memory, we recommend using
megnet.data.graph.GraphBatchGenerator
instead to avoid the computational cost of dynamically computing graphs.- Parameters
mols ([str]) – List of the string reprensetations of each molecule
targets ([ndarray]) – Properties of each molecule to be predicted
converter (MolecularGraph) – Converter used to generate graph features
molecule_format (str) – Format of each of the string representations in mols
batch_size (int) – Target size for each batch
shuffle (bool) – Whether to shuffle the training data after each epoch
n_jobs (int) – Number of worker threads (None to use all threads).
- create_cached_generator() megnet.data.graph.GraphBatchGenerator [source]¶
Generates features for all of the molecules and stores them in memory
- Returns
(GraphBatchGenerator) Graph genereator that relies on having the graphs in memory
- class SimpleMolGraph(nn_strategy: Union[str, pymatgen.analysis.local_env.NearNeighbors] = 'AllAtomPairs', atom_converter: Optional[megnet.data.graph.Converter] = None, bond_converter: Optional[megnet.data.graph.Converter] = None)[source]¶
Bases:
megnet.data.graph.StructureGraph
Default using all atom pairs as bonds. The distance between atoms are used as bond features. By default the distance is expanded using a Gaussian expansion with centers at np.linspace(0, 4, 20) and width of 0.5
- dijkstra_distance(bonds: List[List[int]]) numpy.ndarray [source]¶
Compute the graph distance based on the dijkstra algorithm
- Parameters
bonds – (list of list), for example [[0, 1], [1, 2]] means two bonds formed by atom 0, 1 and atom 1, 2
- Returns
full graph distance matrix