Pound sterling char in DEiXTo regular expressions

pound-sterlingWe were building a couple of wrappers recently for Jürgen (from Germany) and we fall upon a strange issue. The pound sterling character we had used in a regular expression (required for extracting prices in the UK currency) was not recognized as valid UTF-8 char by both the GUI and the CLE versions of DEiXTo.

The solution was to replace “£” with “xA3” in the regular expression. Both XML parsers (the MSXML for GUI DEiXTo and the XML parser of Perl for DEiXTo CLE) worked fine and the extraction commenced flawlessly with 100% recall.

By the way, here is a couple of useful regular expressions:

  • to get dates in xx/xx/xxxx format with either one or two digits for day and month use: “(\d{1,2}\/\d{1,2}\/\d{4})” (without the double quotes)
  • to get £ price data from 0.01 to 999,999,999.99 in this format, use: “\xA3(\d*,\d*,?\d*\.?\d*)” (without the double quotes)

Many thanks to my students Kostas Papaioannou and Vasilis Pallas for helping me found my way.

About Fotis Kokkoras

I am a lecturer @ University of Thessaly, Greece
This entry was posted in News and tagged , , , , , . Bookmark the permalink.

Comments are closed.