Friday 11 July 2014

Text parsing within a simple app

By profession, I am a qualified lecturer in Chemistry. I've been doing it for years and one aspect which is key to understanding is the ability to understand concentration and the mole. To most people, a mole is a cute little creature that digs people's lawns up and that's about it. To a chemist, it's possibly one of the fundemental parts and it causes more problems than enough.

After teaching one particular group, I decided to sit down a write a chemical formula calculator. I know there are lots of these on the market but there was a problem. Some charge, some don't and some aren't very good. Key to mine would be a simple to use interface.

The user would enter the formula and then the number of moles required. The program would do what it needed to and report back it's answers. Easy enough.

As with any piece of software, the majority of the effort is in the design and understanding of the problem

The problem


Take the formula CuSO4.5H2O - this is a compound called copper (II) sulfate pentahydrate. If you did science at school, you'll possibly have grown crystals with it as it has a very attractive deep blue colour.

A chemist would look at this and work out how much it weights like this:

  1. There is a dot, this means that there are two parts to the formula
  2. On the left, there is a Cu, an S and an O followed by a 4. As there are no numbers before the Cu and the S this means there is only one of these.
  3. On the right there is an H followed by a 2 and then an O with nothing after it.
  4. Before the H though, there is a 5. This means that there is 5 lots of whatever follows, so there is now 10 lots of H and 5 lots of O
  5. Overall then, there is 1 Cu, 1 S, 9 O and 10 H.
  6. Find the mass of each element, multiply it by the number of it and add the lot together

This sort of calculation can be performed in a matter of moments by any chemistry student - in fact, a non-chemist can do it as long as they know what each element weights, it's just simple maths

A program though has to handle things differently...

For a start, elements always start with a capital letter, but don't have to just have one letter. Brackets also have to be understood, and what happens if there is a dot at the end?

With a bit of logic and quite a bit of paper, the solution wasn't that hard...

Knowing the problem, the solution can be found. First off, let's define some data for the elements

I'm doing this by first creating a string array for the elements and then a double array for the masses. I also have a List of type element which will be used to store the elements in (makes life simpler when doing comparisons - I did not use the list to store in initially for reasons which should become clear later)

public class element
{
  public string el;
  public double num;           
  public element(string e, double n)
  {
     this.el = e;
     this.num = n;
  }
}

List <element> elem = new List();
       
string[] elements = new string[] {"H","He",
  "Li","Be","B","C","N","O","F","Ne",
  "Na","Mg","Al","Si","P","S","Cl","Ar",
  "K","Ca","Sc","Ti","V","Cr","Mn","Fe","Co","Ni","Cu",
  "Zn","Ga","Ge","As","Se","Br","Kr",
  "Rb","Sr","Y","Zr","Nb","Mo","Tc","Ru","Rh","Pd","Ag",
  "Cd","In","Sn","Sb","Te","I","Xe",
  "Cs","Ba","La",
  "Ce","Pr","Nd","Pm","Sm","Eu","Gd","Tb","Dy","Ho","Er",
  "Tm","Yb","Lu",
  "Hf","Ta","W","Re","Os","Ir","Pt","Au","Hg","Tl","Pb",
  "Bi","Po","At","Rn",
  "Fr","Ra","Ac",
  "Th","Pa","U","Np","Pu","Am","Cm","Bk","Cf","Es","Fm",
  "Md","No","Lr",
  "Rf","Db","Sg","Bh","Hs","Mt", "Ds", "Rg"};
       
double [] atmass = new double[111] {1.0079,4.0026,
  6.941,9.01218,10.8,12.011,14.0067,15.9994,18.9984,20.179,
  22.9898,24.305,26.9815,28.0855,30.9738,32.06,35.453,39.948,
  39.0983,40.08,44.9559,47.88,50.9415,51.996,54.9380,55.847,
  58.9332,58.69,63.546,65.38,69.72,72.59,74.9216,78.96,79.904,83.8,
  85.4679,87.62,88.9059,91.22,92.9064,95.94,98,101.07,102.9055,
  106.42,107.868,112.41,114.82,118.69,121.75,127.6,126.9045,131.29,
  132.9054,137.33,138.9055,
  140.12,140.9077,144.24,145,150.36,151.96,157.25,158.9254,162.5,
  164.9304,167.26,168.9342,173.04,174.967,
  178.49,180.9479,183.85,186.207,190.2,192.22,195.08,196.9665,
  200.59,204.383,207.2,208.9804,209,210,222,
  223,226.0254,227.0278,
  232.0381,231.0359,238.0289,237.0482,244,243,247,247,251,252,
  257,258,259,260,
  261,262,263,264,265,266,267,268 };

 string [] numbers = new string[10] {"0","1","2","3",
                                     "4","5","6","7",
                                     "8","9"};


That's the elements and atomic weights all in. I've also added in another array which just stores numbers. These are to be used later on as well.

Test 1 - the text box

The first test that needs to be carried out is not on any sort of formula, but on the text entry box itself. There are a few tests that can be done here. The two simplest are to check if there is something actually in there and then if what's in there is valid (in other words, has the user entered an invalid element). These are two easy tasks. Other checks are also performed (for example missing braces)

void calculate(object sender, EventArgs e)
{
   if (formula.Text.Length == 0)
   {
      MessageBox.Show("Error : You haven't entered a formula",
                      "Formula error",
                      MessageBoxButtons.OK,MessageBoxIcon.Error);
      return;
   }


The simplest of errors - if the calculate button has been pressed, nothing can happen if there is nothing to work with

   if ((formula.Text.Contains("(") && !formula.Text.Contains(")"))||
        (formula.Text.Contains(")") && !formula.Text.Contains("(")))
   {
      MessageBox.Show("Error : You have a missing brace within your
                       formula. Please recheck",
                      "Bracket error", MessageBoxButtons.OK,
                      MessageBoxIcon.Error);
      return;
   }


A quick check to ensure that the braces match up. This only checks to make sure there is a ( and a ), but not the number of them. It would be quite trivial to add in a number of braces check.

   if (formula.Text.Length == 1)
   {
      bool test = true;   
      foreach (string element in elements)
      {
         if (formula.Text.Contains(element))
         {
            test = false;
             continue;
         }
      }
      if (test == true)
      {
         MessageBox.Show("Error : Your formula contains an
                         unknown element",
                         "Unknown element", MessageBoxButtons.OK,
                         MessageBoxIcon.Error);
         return;
      }
    }


Let's make sure that the elements entered in the formula actually exist shall we...

    if (formula.Text.Contains("."))
    {
       int pos = formula.Text.IndexOf(".");
       if (pos == formula.Text.Length)
       {
          MessageBox.Show("Error : You have a period followed by
                          nothing",
                          "Period error", MessageBoxButtons.OK,
                          MessageBoxIcon.Error);
          return;
       }
     }


Okay, we have a dot in the formula. Let's do a quick sanity check - is there anything after it?

     string dupeform = formula.Text;

     if (dupeform.Contains("("))
        dupeform = dupeform.Remove(dupeform.IndexOf("("), 1);
     if (dupeform.Contains(")"))
        dupeform = dupeform.Remove(dupeform.IndexOf(")"), 1);
     if (dupeform.Contains("."))
        dupeform = dupeform.Remove(dupeform.IndexOf("."), 1);
     for (int a = 0; a < dupeform.Length; ++a)
     {
        foreach (string n in numbers)
        {
          if (dupeform.Contains(n))
             dupeform = dupeform.Remove(dupeform.IndexOf(n), 1);
        }
     }


This part may strike as being a bit odd. I've made a duplicate copy of the validated text and have then removed all of the numbers, braces and periods - it's just a sanity check. It's now clear why I've created a string containing numbers
   
       dupeform = formula.Text + "#";
   search(dupeform, false);
   results();
}


The duplicate is then passed to the search routine and the results method called

The Search routine

Now that the code has through the formula, it's time to do the leg work and find out what the mass of the compound is.

The search is split into two parts - the first divides up the formula, the second identifies the element.

The search is also a recursive search - it makes more sense to parse the first part of a formula, then parse the rest instead of doubling up the code in order to do the same thing. I'm not a fan of recursive functions as the logic is not always that clear with them, but in this case, it's a good idea to use it

void search(string formula, bool dot)
{
   int mult1 = 1, mult2 = 1, s = 0, p = 0, bs = 0, be = 0, bn = 0;
   bool hasdot = formula.Contains(".") ? true:false;
   bool hasbrace = formula.Contains("(") ? true : false;
   int point = hasdot == true ? formula.IndexOf(".") + 1 : 0;
       
   if (hasbrace == true)
   {
      int k = 0;
      for (k = 0; k < formula.Length; ++k)
      {
         if (formula[k] == '(')
            bs = k;
         if (formula[k] == ')')
            be = k;
      }
      k = formula.IndexOf(")") + 1;
      if (formula[k] >= '0' && formula[k] <= '9')
      {
          int c = 0;
          while (formula[k + c] >= '0' && formula[k + c] <= '9')
             c++;
          bn = Int32.Parse(formula.Substring(k, c));
      }
      else
          bn = 1;
    }


Is there a number outside of a bracket? If there is, what is it? The start and end points of the braces are also found and stored here
           
    if (hasdot == true && dot == true)
    {
       if (formula[point] >= '0' && formula[point] <= '9')
       {
          int c = 0;
          while(formula[point + c] >= '0' &&
                formula[point + c] <= '9')
             c++;
          mult1 = Int32.Parse(formula.Substring(point, c));
          s = point + 1;
       }
     }
     else
     {
        if (formula[0] >= '0' && formula[0] <= '9')
        {
           int c = 0;
           while(formula[c] >= '0' && formula[c] <= '9')
              c++;               
           mult1 = Convert.ToInt32(formula.Substring(0, c));
           s = 1;
         }
     }


Here the number at the start is being found. If nothing is there, mult1 remains at 1 - this is used to multiply the mass found by.

The hard part now starts - first is that it's vital to remember that an element can have more than one character, so let's set up some simple variables to help

string twoelem = "   "; // three spaces
double newmass = 0;
           
if (hasdot == true && dot == false)
   p = formula.IndexOf(".");
else
   p = formula.Length + 1;


The if clause really just sets the length of the string to be searched. The searching is all done with one large loop...

int loop = s;
{
   while (loop < p)
   {
      if (loop + 1 > p || formula[loop] == '#')
         break;

      if (loop == bs)
      {
         while (loop < be)
         {
            if (formula[loop + 1] >= 'a')
            {
               twoelem = formula.Substring(loop, 2);
               loop += 2;
            }
            else
            {
               twoelem = formula.Substring(loop, 1);
               loop++;
            }

            if (formula[loop] >= '0' && formula[loop] <= '9')
            {
               int c = 0;
               while (formula[loop + c] >= '0' &&
                      formula[loop + c] <= '9')
                  c++;
               if (twoelem == "")
               {
                  if (formula[loop + c + 1] < 'a')
                     twoelem = formula.Substring(loop + c, 1);
                  else
                     twoelem = formula.Substring(loop + c, 2);
               }
               mult2 = Int32.Parse(formula.Substring(loop, c));
               loop += c;
            }
            newmass = atommass(twoelem) * mult2 * bn;
            elem.Add(new element(twoelem, newmass));
            newmass = 0;
            mult2 = 1;
         }
      }
     
      if (formula[loop + 1] >= 'a')
      {     
         twoelem = formula.Substring(loop, 2);
         loop += 2;
      }
      else
      {
         twoelem = formula.Substring(loop, 1);
         loop++;
      }

      if (formula[loop] >= '0' && formula[loop] <= '9')
      {
         int c = 0;
         while (formula[loop + c] >= '0' &&
                formula[loop + c] <= '9')
            c++;
           
         if (twoelem == "")
         {
            if (formula[loop + c + 1] < 'a')
               twoelem = formula.Substring(loop + c, 1);
            else
               twoelem = formula.Substring(loop + c, 2);
         }
         mult2 = Int32.Parse(formula.Substring(loop, c));
         loop += c;
      }
      newmass = atommass(twoelem) * mult2 * mult1;
      elem.Add(new element(twoelem, newmass));
      newmass = 0;
      mult2 = 1;
   }
}


And finally, let's do the recursion...

if (dot == false && hasdot == true)
   search(formula, true);

No comments:

Post a Comment