Parsing huge OpenStreetMap's JOSM file using Nokogiri

Background

While attending SSJS Devcamp I was having fun with MongoDB and CouchDB. To make some useful tests we had first to gather some data. The plan was to import data into MySQL, next migrate it to NoSQL dbs and then make some tests/comparisons. After a quick brainstorm we decided to create database with places/POIs. The greatest free geo repository I know is OpenStreetMap and they are providing all theirs data for download. Whole repository is really huge (almost 17GiB of compressed data), so we settled for using only Poland data from a nice mirror.

Data

Format specification is described on OSM wiki. It is pretty straightforward XML and contains mostly roads, but also cities, buildings, shops and ton of other things. This makes even Poland data pretty big (~150MiB compressed, ~2.5GiB uncompressed XML file). For us just the POI stuff was important so we had to go through whole file and cherry pick interesting entries.

Parser

Because of the size of the XML, using a DOM parser on a laptop would not work because of memory usage. Obvious solution would be to use SAX parser, but I consider it to be totally backwards1. I prefer Pull Parsing so I searched for some nice parser, preferably in Ruby. As it turns out, the excellent Nokogiri have one as XML::Reader.

Code

The import was broken into two phases. In phase one the interesting data was taken from the XML and saved as JSON (it could be any other format of course). In the second phase it was read and inserted into MySQL. It allowed writing phase two code while phase one was running and was generally more error resistant (errors in phase two didn’t cause whole import to be rerun).

Phase one code from GitHub/parse-osm.rb:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#!/usr/bin/env ruby
# Small script for scraping POIs from JOSM (http://wiki.openstreetmap.org/wiki/JOSM_file_format).
require 'rubygems'
require 'bundler/setup'
require 'nokogiri'
require 'json'
require 'ap'

class Parser
  # Tags in data that we ignore.
  IGNORED_TAGS = ["created_by", "source"]
  def initialize
    # Map storing attribute name mapped on count of nodes containing it.
    # It is helpful to see what tags should be taken into account in the first place
    # during importing.
    @popular_attributes = {}
    # number of nodes parsed
    @count = 0
    # number of total xml nodes went through
    @total_count = 0
    # number of entries considered useful as POI
    @included = 0
  end
  
  def parse(input, output)
    out = File.new(output, "w")
    begin
      out.write "[\n"
      # Nokogiri reader created.
      reader = Nokogiri::XML::Reader(File.new(input))
      while reader = parse_node(reader, out)
      end
    ensure
      out.write "{}\n]\n"
      out.close
      ap @popular_attributes.sort_by { |k, v| v}.reverse
      STDERR.puts ""
      puts "\n#{@included} / #{@count} / #{@total_count}\t"
    end
  end

  def parse_node(r, out)
    # Search for 'node' tags because they contain data (points). Other tags
    # are discarder.
    (r = r.read; progress) while r && r.name != 'node'
    # Stop processing if end of file
    return false unless r
    # Create entry to be enriched with 'tag' data
    entry = { :lat => r.attribute("lat"), :lon => r.attribute("lon") }
    # Required fields to create usable POI.
    req = ["name"]
    while (progress; r = r.read)
      # Next node found, so no more tags.
      break if r.name == 'node'
      # Only 'tag' are interesting.
      next unless r.name == 'tag'
      # Each tag has form of <tag k="key" v="value" />
      key = r.attribute "k"
      unless IGNORED_TAGS.include? key
        req.delete key
        entry[key] = r.attribute "v"
        @popular_attributes[key] ||= 0
        @popular_attributes[key] += 1
      end
    end
    # If all required tags were found.
    if req.size == 0
      @included += 1 
      out.write(entry.to_json)
      out.write(",\n")
    end
    progress(true)
    return r
  end

  # Progress info
  def progress(entry_found = false)
    @total_count += 1; 
    @count += 1 if entry_found
    limit = 10000
    if @total_count % limit == 0
      STDERR.print "."
      STDERR.print "\r#{@included} / #{@count} / #{@total_count}\t" if @total_count % (limit * 50) == 0
      STDERR.flush
    end
  end
end

if ARGV.size < 2
  puts "Usage: #{$PROGRAM_NAME} osm_file output_json"
  exit 1
end
Parser.new.parse ARGV[0], ARGV[1]

Parsing whole file ran for a little over 250 seconds:

ruby parse-osm.rb poland.osm poi.json  247.94s user 3.07s system 99% cpu 4:13.08 total

Just copying the file take more than half of that:

cp -i poland.osm del.me  0.03s user 3.80s system 2% cpu 2:32.33 total

So the overhead is not too big. While running, the script used less than 4MiB of RAM, so also very acceptable. The result is 101 793 POI candidates[^2] that will be imported into the DB in the phase two. It will be described in more details in a following post, so stay tuned!

react appropriately. Is is just wrong. I guess this API was easy to implement, but the result is a strange to use, and code using it ends up pretty mangled. [^2]: Extracted from 14 122 097 JOSM ‘node’s and 89 522 705 XML nodes;that gives processing of more than 362 440 XML tags per second.

  1. I mean literally backwards. You, as a client, has to maintain state and 


Comments
blog comments powered by Disqus
[Valid Atom 1.0]
Fork me on GitHub