Ruby: XML Parsing With SAX
SAX is an event-driven parser for XML.
It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It’s quite different from the DOM model, where the whole xml is parsed and loaded in an tree. As you can see, the first approach is more difficult than the DOM one. Why we should use it? Depends. If you want to extract certain informations from a big file, probably you should choose a SAX implementation, in this way you can avoid the initial DOM loading overhead.The Ruby XML Library
The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it’s terribly slow, it’s highly advisable to use libxml. It’s a binding to the popular library from Gnome and it was released as gem.
The Ruby Implementation
In first instance we need an handler, to deal with the SAX events.
class Handler
def method_missing(method_name, *attributes, &block)
end
end
Libxml generates several events and it expects to find certain methods into the class assigned ad handler. With method_missing we simply avoid any exception.
A More Useful Example
We try to extract the most recent headlines of a blog.
Download the feed:
curl http://feeds.feedburner.com/LucaGuidi >> luca.xml
Now we need our custom SAX parser: class SaxParser
def initialize(xml)
@parser = XML::SaxParser.new
@parser.string = xml
@parser.callbacks = Handler.new
end def parse
@parser.parse
@parser.callbacks.elements
end
end
require 'rubygems'
require 'xml/libxml'
require 'handler'
We have just wrapped the SAX parser from libxml and we have registered our first class as callback handler.
Now we are going to improve the handler to recognize and save the post titles: def initialize
@elements = []
end def on_start_element(element, attributes)
@print = true if element == ’title'
end def on_characters(characters = ‘’)
@elements
class Handler
attr_accessor :elements
When the handler is instantiated we create an internal array to store our results, then when we find and title element we set on true the print flag. When it’s true we can store the data into elements, then we set on false on the ending handler of the element.
Usage
We create a trivial script: xml = open(ARGV[0], ‘r’).collect { |l| l }.join
puts SaxParser.new(xml).parse
#!/usr/bin/env ruby
require 'sax_parser'
From the shell:
./parse luca.xml
Conclusion
SAX is less elegant and easy than DOM, but could be very useful in certain cases.