Tuesday, October 5, 2010

Linear v. Dynamic Search in Molecular Structures

In the field of cheminformatics, a common task is navigating through a database of molecules that could have thousands of entries.

One task is how to store the molecules themselves. An approach in cheminformatics is a "SMILES" string. A smiles string stores the molecules involved and special characters which indicate structural properties of the molecule. Previous tests in the Lewis Research Group have involved RDX adsorption. (RDX is an explosive compound) The Lewis Research Group uses a default file format of .xyz, but an open file conversion system exists called OpenBabel. The SMILES string for RDX is provided below, along with a picture for comparison.
RDX Molecule
O=N(=O)N1CN(N(=O)=O)CN(N(=O)=O)C1
SMILES strings can be represent the same structure in different ways, so for databases a special form of SMILES known as Canonical SMILES form is used to prevent duplicate entries. The advantage of SMILES strings is that when they break in to comparable substructures.

There are clever indexing schemes for quickly searching structural properties of molecules. One is substructure keys, in which binary flags about structural properties are stored for each molecule. Another is using a hash table encoding that serves as a proximity filter.

There is also the area of molecular similarity and molecular diversity. This field seeks to find similar molecules by noting differences in derived attributes. For example, it is computationally easy to compute the molecular weight of a molecule. This attribute, and others, can be used as a search therm to find similar molecules or correlation with more complex attributes such as molecular adsorption in repeating lattice structures.
Other features can also be noted and collected, providing a linear database for searching.

Overall, the question is whether to use structural similarity analysis (dynamic) or molecular diversity measures (linear) for the system.
The answer, at the current moment, appears to be using both.
We are currently not sure what variables correlate with performance in the desired chemical property of adsorption in a lattice structure, so having more features provides a greater possibility for correlation and more accurate estimates. The linear search could also be used to select candidates for dynamic search, using a hybridized preprocessing approach.

The second is that the system has the advantage of being domain specific and thus having access to domain specific algorithms and methods. The database in question will have many properties which need to be handled in relation to their environment. For example, it is likely that there will be missing or blank features for entries in the database. The algorithm might have access to be able to call a function to compute missing entries, which is a highly domain specific solution not appropriate to general algorithms.



No comments:

Post a Comment