11

Are there any good data structures out there that can be used to represent a molecule?

I was thinking maybe I represent it as a Graph by making every atom a vertex, however, it's common for organic compounds to have lots of Carbons and Hydrogens. How would you number it? Is there a good way to represent molecules, but at the same time, have an efficient .contains() method?

One of the most basic uses for this would be to check if a compound contains carbonyl group, or a benzylic hydrogen, or even a benzene ring.

scriptin
  • 4,442

2 Answers2

7

(Biochemistry graduate with 30 years software development experience)

Non-organic molecules are "relatively" simple. The interesting ones are the ones that can bond with themselves e.g. C, N, O, Si because you can get some really funky combinations. The Benzene ring is a very simple example. Some variations substitute a Nitrogen for one of the Carbons and it gets weird fast.

I'd start with an "atom" object with the various types of atom inheriting from it.

Each "atom" object would contain a list of atom objects to represent the various bonds so Nitrogen would have a list of fixed size 3. It could then store links to three other atoms. A double bond could be represented as a duplicate entry.

Each atom would have rules embedded about what it can legally bond to and how.

So you can make up reasonably complicated molecules unambiguously - because bond 3 on the Carbon #1 is linked to bond 1 on Hydrogen 2 etc.

Hope that makes sense...

mcottle
  • 6,142
  • 2
  • 25
  • 27
4

The first temptation with modelling this is to use a quad-tree style data structure. Each carbon atom has four connections, each oxygen two and each hydrogen one. I don't think that this is the proper solution though.

I think that the proper solution has already been invented. The data structure to use is a string.

Think about this. Chemists have been modelling organic compounds for quite a long time now. If you show a chemist CH4, they will immediately recognise that as methane. Show them CH3CH2OH and they will recognise that as ethanol. They recognise this because they identify the CH3CH2 combination as an "eth" compound (meaning two carbon atoms) and the OH as an "anol" or alcohol group.

We also have a pre-existing methodology for searching and identifying substrings - regular expressions.

So to represent programatically an organic compound, I would define a compound as containing a string which represents its chemical formula and a string defining its chemical name. It could have methods which identified which "special" properties the compound had.

An example class in C#:

public class OrganicCompound
{
    private Regex benzineRingRegex;

    public OrganicCompound(string formula, NameCalculator nameCalculator, Regex benzineRingRegex)
    {
        this.Formula = formula;
        this.Name = nameCalculator.CalculateName(formula);
        this.benzineRingRegex = benzineRingRegex
    }

    public string Formula { get; private set; }

    public string Name { get; private set; }

    public bool HasBenzeneRing() 
    { 
        return Regex.IsMatch(this.Formula, benzineRingRegex);
    }
}

Obviously you would need to write the nameCalculator class, which calculates the name based off of the formula. You would need to create the regex which defines a benzine ring. Define extra regexes for each of the groups you wish to search for.

The advantage of modelling the compounds this way is it's in the language that is exactly in the business domain of the end user. All you as the developer needs to know is the strings to search for, which can easily be provided by either a text book or a chemist.

If structural representations of these chemicals are required, I suggest looking into maintaining SMILES representations of the formula.

SMILES chemical formula representation

Stephen
  • 8,848
  • 3
  • 30
  • 43
  • 2
    How do you handle isomers this way? –  May 07 '15 at 03:44
  • That's a great question. It turns out that this has already been thought about. I have added information on the SMILES chemical formula representation into the answer. – Stephen May 07 '15 at 04:34
  • there are various systematic name systems that could also be used depending on what properties you actually want to model – jk. May 07 '15 at 14:21