SAX is an event-driven parser for XML.

It sequentially reads the xml and generates special events. So, if you want to use SAX, you should implement the code to handle them. It’s quite different from the DOM model, where the whole xml is parsed and loaded in an tree. As you can see, the first approach is more difficult than the DOM one. Why we should use it? Depends. If you want to extract certain informations from a big file, probably you should choose a SAX implementation, in this way you can avoid the initial DOM loading overhead.

The Ruby XML Library

The Ruby core library has a built-in XML parser (both DOM and SAX) called REXML, but it’s terribly slow, it’s highly advisable to use libxml. It’s a binding to the popular library from Gnome and it was released as gem.

The Ruby Implementation

In first instance we need an handler, to deal with the SAX events.
class Handler def method_missing(method_name, *attributes, &block) end end

Libxml generates several events and it expects to find certain methods into the class assigned ad handler. With method_missing we simply avoid any exception.

A More Useful Example

We try to extract the most recent headlines of a blog.

Download the feed:
curl http://feeds.feedburner.com/LucaGuidi >> luca.xml

Now we need our custom SAX parser:
require 'rubygems' require 'xml/libxml' require 'handler'

class SaxParser def initialize(xml) @parser = XML::SaxParser.new @parser.string = xml @parser.callbacks = Handler.new end

def parse @parser.parse @parser.callbacks.elements end end

We have just wrapped the SAX parser from libxml and we have registered our first class as callback handler.

Now we are going to improve the handler to recognize and save the post titles:
class Handler attr_accessor :elements

def initialize @elements = [] end

def on_start_element(element, attributes) @print = true if element == ’title' end

def on_characters(characters = ‘’) @elements

When the handler is instantiated we create an internal array to store our results, then when we find and title element we set on true the print flag. When it’s true we can store the data into elements, then we set on false on the ending handler of the element.

Usage

We create a trivial script:
#!/usr/bin/env ruby require 'sax_parser'

xml = open(ARGV[0], ‘r’).collect { |l| l }.join puts SaxParser.new(xml).parse

From the shell:
./parse luca.xml

Conclusion

SAX is less elegant and easy than DOM, but could be very useful in certain cases.