Programming ≈ Fun

Written by Krešimir Bojčić

Breaking the Rules

Yesterday I’ve subscribed to Rubies in the Rough from James Edward Gray II and read his article “Doing it Wrong”.

In his article he questions (along with some other rules) the rule of never using regular expression for xml parsing.

As it turned out it was a fortunate move since six dollars and one day latter I came across .xml that needed to be parsed.

I have a confession to make: I’ve always hated xml parsers. This particular .xml did not even use xml strengths; data inside was all messed up:

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>Exchage rates list</title>
<link>http://*******</link><description>Excange rates list 12/13/2011</description><item>
      <guid isPermaLink="false">code: 978</guid>
      <title>EMU (EUR)</title>
      <description>
          Unit: 1<br />
          Buying: 1.95583  <br />
          Medium: 1.95583  <br />
          Selling: 1.95583 <br />
      </description>

  </item>   <item>
      <guid isPermaLink="false">code: 36</guid>
      <title>Australia (AUD)</title>
      <description>
          Unit: 1<br />
          Buying: 1.488813<br />
      ...

In the article he made a good point and illuminated the edge cases where you are not really parsing, but rather just hunting for some data.

That was encouraging enough to end up with this:

data.gsub!("\n",'').gsub!("\t",'')
data =~ /Exchange rates list (\d+\/\d+\/\d+)/
  ...
data.scan /code:\s(\d+).*?\((\w+)\).*?Unit:\s(\d+).*?Buying:\s(\d+\.?\d+).*?Medium:\s(\d+\.?\d+).*?Selling:\s(\d+\.?\d+)/ do |item|
  ...
end

Does it feel good? It sure does. Is it any worse than using xml parser? No, I don’t think so. Data is safe and sound, just look how happy it looks:

"12/13/2011"
["978", "EUR", "1", "1.95583", "1.95583", "1.95583"]
["36", "AUD", "1", "1.488813", "1.492544", "1.496275"]
["124", "CAD", "1", "1.437051", "1.440653", "1.444255"]
["191", "HRK", "100", "26.021213", "26.086429", "26.151645"]
["203", "CZK", "1", "0.076274", "0.076465", "0.076656"]
...

Conclusion

I didn’t really break any rules. I just saw the whole problem more clearly because of his article. I figured I would need the regex anyhow since document was structured so unfortunately.

Other then recommending his articles I can say that being pragmatic can get you solutions that are on the other side of rules fence.

The thing is that sometimes the other side doesn’t have to be a bad side.

Comments