megnet.data.molecule module

Tools for creating graph inputs from molecule data

class MolecularGraph(atom_features: Optional[List[str]] = None, bond_features: Optional[List[str]] = None, distance_converter: Optional[megnet.data.graph.Converter] = None, known_elements: Optional[List[str]] = None, max_ring_size: int = 9)[source]

Bases: megnet.data.graph.StructureGraph

Class for generating the graph inputs from a molecule

Computes many different features for the atoms and bonds in a molecule, and prepares them in a form compatible with MEGNet models. The convert() method takes a OpenBabel molecule and, besides computing features, also encodes them in a form compatible with machine learning. Namely, the convert method one-hot encodes categorical variables and concatenates the atomic features

## Atomic Features

This class can compute the following features for each atom

  • atomic_num: The atomic number

  • element: (categorical) Element identity. (Unlike atomic_num, element is one-hot-encoded)

  • chirality: (categorical) R, S, or not a Chiral center (one-hot encoded).

  • formal_charge: Formal charge of the atom

  • ring_sizes: For rings with 9 or fewer atoms, how many unique rings

of each size include this atom - hybridization: (categorical) Hybridization of atom: sp, sp2, sp3, sq. planer, trig, octahedral, or hydrogen - donor: (boolean) Whether the atom is a hydrogen bond donor - acceptor: (boolean) Whether the atom is a hydrogen bond acceptor - aromatic: (boolean) Whether the atom is part of an aromatic system

## Atom Pair Features

The class also computes features for each pair of atoms

  • bond_type: (categorical) Whether the pair are unbonded, or in a single, double, triple, or aromatic bond

  • same_ring: (boolean) Whether the atoms are in the same aromatic ring

  • graph_distance: Distance of shortest path between atoms on the bonding graph

  • spatial_distance: Euclidean distance between the atoms. By default, this distance is expanded into

    a vector of 20 different values computed using the GaussianDistance converter

Parameters
  • atom_features ([str]) – List of atom features to compute

  • bond_features ([str]) – List of bond features to compute

  • distance_converter (DistanceCovertor) – Tool used to expand distances from a single scalar vector to an array of values

  • known_elements ([str]) – List of elements expected to be in dataset. Used only if the feature element is used to describe each atom

  • max_ring_size (int) – Maximum number of atom in the ring

convert(mol, state_attributes: Optional[List] = None, full_pair_matrix: bool = True) Dict[source]

Compute the representation for a molecule

Args:

mol (pybel.Molecule): Molecule to generate features for state_attributes (list): State attributes. Uses average mass and number of bonds per atom as default full_pair_matrix (bool): Whether to generate info for all atom pairs, not just bonded ones

Returns

Dictionary of features

Return type

(dict)

static create_bond_feature(mol, bid: int, eid: int) Dict[source]

Create information for a bond for a pair of atoms that are not actually bonded

Parameters
  • mol (pybel.Molecule) – Molecule being featurized

  • bid (int) – Index of atom beginning of the bond

  • eid (int) – Index of atom at the end of the bond

get_atom_feature(mol, atom) Dict[source]

Generate all features of a particular atom

Parameters
  • mol (pybel.Molecule) – Molecule being evaluated

  • atom (pybel.Atom) – Specific atom being evaluated

Returns

All features for that atom

Return type

(dict)

get_pair_feature(mol, bid: int, eid: int, full_pair_matrix: bool) Optional[Dict][source]

Get the features for a certain bond

Parameters
  • mol (pybel.Molecule) – Molecule being featurized

  • bid (int) – Index of atom beginning of the bond

  • eid (int) – Index of atom at the end of the bond

  • full_pair_matrix (bool) – Whether to compute the matrix for every atom - even those that are not actually bonded

class MolecularGraphBatchGenerator(mols: List[str], targets: Optional[List[numpy.ndarray]] = None, converter: Optional[megnet.data.molecule.MolecularGraph] = None, molecule_format: str = 'xyz', batch_size: int = 128, shuffle: bool = True, n_jobs: int = 1)[source]

Bases: megnet.data.graph.BaseGraphBatchGenerator

Generator that creates batches of molecular data by computing graph properties on demand

If your dataset is small enough that the descriptions of the whole dataset fit in memory, we recommend using megnet.data.graph.GraphBatchGenerator instead to avoid the computational cost of dynamically computing graphs.

Parameters
  • mols ([str]) – List of the string reprensetations of each molecule

  • targets ([ndarray]) – Properties of each molecule to be predicted

  • converter (MolecularGraph) – Converter used to generate graph features

  • molecule_format (str) – Format of each of the string representations in mols

  • batch_size (int) – Target size for each batch

  • shuffle (bool) – Whether to shuffle the training data after each epoch

  • n_jobs (int) – Number of worker threads (None to use all threads).

create_cached_generator() megnet.data.graph.GraphBatchGenerator[source]

Generates features for all of the molecules and stores them in memory

Returns

(GraphBatchGenerator) Graph genereator that relies on having the graphs in memory

class SimpleMolGraph(nn_strategy: Union[str, pymatgen.analysis.local_env.NearNeighbors] = 'AllAtomPairs', atom_converter: Optional[megnet.data.graph.Converter] = None, bond_converter: Optional[megnet.data.graph.Converter] = None)[source]

Bases: megnet.data.graph.StructureGraph

Default using all atom pairs as bonds. The distance between atoms are used as bond features. By default the distance is expanded using a Gaussian expansion with centers at np.linspace(0, 4, 20) and width of 0.5

Parameters
  • nn_strategy (str) – NearNeighbor strategy

  • atom_converter (Converter) – atomic features converter object

  • bond_converter (Converter) – bond features converter object

dijkstra_distance(bonds: List[List[int]]) numpy.ndarray[source]

Compute the graph distance based on the dijkstra algorithm

Parameters

bonds – (list of list), for example [[0, 1], [1, 2]] means two bonds formed by atom 0, 1 and atom 1, 2

Returns

full graph distance matrix

mol_from_file(file_path: str, file_format: str = 'xyz')[source]
Parameters
  • file_path (str) –

  • file_format (str) – allow formats that open babel supports

mol_from_pymatgen(mol: pymatgen.core.structure.Molecule)[source]
Parameters

mol (Molecule) –

mol_from_smiles(smiles: str)[source]

load molecule object from smiles string :param smiles: smiles string :type smiles: string

Returns

openbabel molecule