« my latest TODO’s | Home | financial econometrics and a handy firefox add-on »

fast XML parsing with Ruby and extra large EC2 instances, and other convenient tricks for handling data

By stephen | December 24, 2008

I wanted to add this last post before delving into holiday festivities and some reading for the rest of the night. I have been doing a large amount of parsing of XML data recently. In finance, data sets can be quite large. XML is a very convenient way to store data, but it can be a pain to parse. Previously, I had written a parser in Perl using SAX. It was usable, but quite slow on my machine. The files I were parsing were larger than the amount of RAM I have, so I could not use XML::DOM. This module would attempt to load the entire data structure to memory, and it would need to swap to the local hard disk due to the insufficient amounts of RAM. Therefore, I would have to use SAX. This is a pain as SAX does not keep a record of the structure of the data.

In writing my most recent parser, I decided that rather than write a SAX parser, I would simply utilize a machine with more RAM. This is where Amazon’s EC2 saves the day. They allow you to pay per hour for compute time, and you have a selection of different computers to use priced according to the performance of the computer. The available computer instance types are shown here. I used the extra large instance, which was spec’d as follows.

I wanted to add this last post before delving into holiday festivities and some reading for the rest of the night. I have been doing a large amount of parsing of XML data recently. In finance, data sets can be quite large. XML is a very convenient way to store data, but it can be a pain to parse. Previously, I had written a parser in Perl using SAX. It was usable, but quite slow on my machine. The files I were parsing were larger than the amount of RAM I have, so I could not use XML::DOM. This module would attempt to load the entire data structure to memory, and it would need to swap to the local hard disk due to the insufficient amounts of RAM. Therefore, I would have to use SAX. This is a pain as SAX does not keep a record of the structure of the data.

In writing my most recent parser, I decided that rather than write a SAX parser, I would simply utilize a machine with more RAM. This is where Amazon’s EC2 saves the day. They allow you to pay per hour for compute time, and you have a selection of different computers to use priced according to the performance of the computer. The available computer instance types are shown here. I used the extra large instance, which was spec’d as follows.

Extra Large Instance

15 GB memory 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage (4×420 GB plus 10 GB root partition)
64-bit platform
I/O Performance: High
Price: $0.80 per instance hour

With 15 GB of RAM at my disposal, I could load the entire XML file into memory and write a simple DOM parser in Ruby. My OS of choice is Ubuntu, so I used Alestic’s 64 bit base install Ubuntu 8.10 ami (ami-0e5db967). Using ElasticFox, I instantiated an instance of this AMI. Within a minute, I am given the hostname/IP address of an extra large instance with a fresh copy of Ubuntu 8.10. I am also using elastic block storage (EBS) which is essentially a static storage volume that persists accross instantiations, even when you do not have instances running. On my EBS volume is an initialization script and a copy of the parser.

Configuring my new instance:

i0

Note the time on my system clock between this one and last one. Requests for instances are fulfilled very quickly.

i1

I ssh to my new instance, and the first thing I do is attach the EBS.

ebs elastic

attach EBS

I then run my initMachine script, which takes about 60 seconds and sets up this machine with a webserver, mysql, and emacs.

Init Machine:

#!/bin/bash

sudo aptitude update
sudo apt-get install apache2 php5-mysql libapache2-mod-php5 mysql-server
sudo apt-get install emacs
sudo apt-get install phpmyadmin
sudo apt-get install libmysql-ruby libxml-ruby

I can then run an extremely simple parser in Ruby, which I kept stored along with all my data on the EBS:

puts “SJB: Simple XML parser”
#require ‘net/http’
#require ‘rexml/document’
require ‘xml’
require ‘mysql’
my = Mysql::new(”localhost”, “root”,”password”,”db3″)

def fixer(element)
    if element != nil && element != “” then
       element =~ /(\w+)/
       return $1
    else
    return “”
    end
end

def fieldClean(element)
    if element == “Key” then
       return “Key_self”
    end
    return element
end

def answerClean(element)
    if element == “True” then
       return “1″
    end
    if element == “False” then
       return “0″
    end
    return element
end

doc = XML::Document.file(’Data.xml’)
doc_root = doc.root
i0 = 0
doc_root.find(’//Datas/Data’).each do |node|
# For each credit profile, look at all of the nodes
  valueHash = {}
   childs =    node.children
   childs.each do |child|
               childName = child.to_s
               childName =~ /<(\w+)>/
               sVal = fixer($1)
              # str =  sVal + “: ” +  child.content
#               valueHash[sVal] = child.content
                valueHash[sVal] = answerClean(child.content)
               #puts str
   end

#   puts valueHash
    # now, iterate over the hash, generating

    myKeyArray = []
    myValArray = []

    valueHash.each_key do |key|
    if key == “”
   puts “HIT IT”
  next
 end
    myKeyArray << key
    myValArray << valueHash[key]
    end

    puts myKeyArray.length
    puts myValArray.length

     valueHash.delete(nil)
     myStr = “insert into Data (internalKey, “

     # PUT IN THE FIELD NAMES
     myKeyArray.each do |item|
     item = fieldClean(item)
     myStr += my.quote(item) + “,”
     end
     myStr = myStr.chop + “) values (’0′, “

     # PUT IN THE FIELD VALUES
     myValArray.each do |item|
     myStr += “‘” + my.quote(item) + “‘,”
     end
     myStr = myStr.chop + “)”

     puts myStr
      my.query(myStr)
      puts i0
      i0 = i0 + 1
end

This parser will take a file called Data.xml and insert each entry into a MySQL database. Data.xml may look like this:

<Datas>
<Data>
<Key>ABC123</Key>
<name>Orange</name>
<color>Orange</color>
<shape>Sphere</shape>
</Data>
<Data>
<Key>DEF456</Key>
<name>Kiwi</name>
<color>Brown</color>
<shape>Ellipsoid</shape>
</Data>
….
</Datas>

The database (in this case, a produce database) would then be defined with the following fields

  • fieldName, fieldType, etc.
  • internalKey, integer, primary key, auto_increment
  • Key_self, string (probably varchar(100)). Note that my script renames fields named Key to Key_self. In mysql, Key is a reserved word.
  • name, string
  • color, string
  • shape, string

The parser is very general and should be easily adaptable to other XML files. The following is the screenshot with my parser running. Note how the process is taking ~2GB of memory, and 0 swap space is being used. I have used this parser to parse ~3GB files in approx 3-4 minutes. As long as it is possible to load the entire data structure in to memory without swapping to the hard disk, this parser will be extremely fast.

TOP

When this is all done, all your data is conveniently stored in a MySQL database. From here, you can generate CSV files. Note that earlier we installed phpmyadmin. Just visit http://public_dns_name_of_ec2_instance/phpmyadmin and it provides a web interface to downloading your CSV files. Alternatively, you can use this little snippet of script.

 mysql -B -u root -ppassword db3 -e ’select * from myTable ‘ | \
perl -F”\t” -lane ‘print join “,”, map {s/”/”"/g; /^[\d.]+$/ ? $_ : qq(”$_”)} @F ‘

Save that into a file and redirect the output to the csvfile: ./myScript > myTable.csv.

One last little bit of code. If you are trying to bring large datasets into STATA, you have probably noticed that you cannot always load the entire data set at once. This little snippet of Ruby I coded up will split a file into a specified number of files, replicating the first line for each file. This is useful for CSV files where you want to split a 400,000 line file into 4 files, but preserve the column headers for each file. Save the text as splitReplicateHeader.ry and run this on the console as: ruby splitReplicateHeader myBigFile.csv 4, where 4 is the amount of files that you want to end up with. In this case you will end up with files myBigFile.csv1.part, myBigFile.csv2.part, etc. I made the naming this way so you can transfer the files to another machines with scp *.part username@destbox:/my/data/directory.

 puts “SJB: splitReplicateHeader”
originalFile = ARGV[0]
intNewFileAmt = ARGV[1]

count = `wc -l < #{originalFile}`
fail “wc failed: #{$?}” if $? != 0
count.chomp!

puts “Splitting ” + originalFile + ” into ” + intNewFileAmt + ” parts”
puts “Lines: ” + count

Integer c = count.to_i
Integer n = intNewFileAmt.to_i

Integer intLinesPerFile = c / n
puts “Lines per file: ” + intLinesPerFile.to_s

f = File.open(originalFile)
firstLine = f.readline
for i in 1..(n-1)
thisFile = originalFile + i.to_s + “.part”

writer = File.new(thisFile, “w”)
writer.puts firstLine
for j in 1..intLinesPerFile
line = f.readline
writer.puts line
end
end

thisFile = originalFile + n.to_s + “.part”
writer = File.new(thisFile, “w”)
writer.puts firstLine

while(newline = f.readline)
writer.puts newline

Once you are done working with your EC2 instance, perform the ’sync’ command twice, then unmount your EBS. Yes, it is necessary to perform sync twice (just like how you need to run the latex command twice when modifying citations).

sync sync umount

Then you are free to destroy the instance. The entire process is very fast. I have used this in the What’s in a Picture project to parse the XML data from Prosper.com into a MySQL database. Once data is in a relational database, it is very easy to export to CSV, work with particular subsets of the data (by date range, field > certain amount, etc.). From there, it would be easy to dump the database on to the EBS. Then, whenever you need to work with the data in database for, reinstantiate an EC2 instance and reinitialize the database from your EBS. I will use this technique in my future work with intraday data sets, which get large very quickly. Also, it is nice to have this data in the cloud for resource intensive computations or simulations on the data. You could just as easily instantiate 10 or 50 computers on EC2 and distribute your computations. I wrote a simple Monte Carlo simulator a while ago that I ran on EC2 using 5 instances, and it was pretty cool.

Okay, thats all for tonight. Happy Holidays!

Topics: General, Mathematical Economics, Economists' Toolbox |

One Response to “fast XML parsing with Ruby and extra large EC2 instances, and other convenient tricks for handling data”

  1. You may want to look at vtd-xml as the latest option that overcomes issues of DOM and SAX

    http://vtd-xml.sf.net

    Posted by: anon_anon on November 21st, 2009 at 3:32 pm

Comments