One of the web sites I look has its own custom application log file format. I use a line that begins with the letters PERF to store how long a page request takes – what I’d like to do is extract these lines from the (fairly large) log files, and then do some operations on them. But I’d like to do it as fast as possible.

François Maillet has some ideas on efficient log processing, and suggests gzipping your log files to avoid having to wait for reading from disk. (See “What Your Computer Does While You Wait” by Gustavo Duarte for a good explanation of how slow the various different storage methods are).

I concatenated some files together and ended up with a 9.3GB whopper. Gzipping it down took a few minutes, but we were left with 402MB. There are 131,428,410 lines in it. God almighty.

Grepping through the plain text file on my old 2008 white Macbook (2.2ghz, 4GB ram, 5400rpm 320GB hard disk):

time grep -P '^PERF' logsall.txt | gzip -f > logsall.txt.gz
real    5m37.844s
user    1m2.698s
sys 0m10.160s

I ran BigTop at the same time to see what was going on, and it bore out François’ point – this is the disk utilisation graph – my disk isn’t that fast, then (it’s also really full):

Plain text - disk graph

and this is the CPU graph, just barely flickering over 15%:

Plain Text - CPU graph

Using the gzipped version, things were rather quicker:

time zcat logsall.gz.Z | grep -P '^PERF' | gzip -f > perfzipall.gz
real    0m54.164s
user    1m29.021s
sys 0m5.612s

‘Scuse the terrible x-scale on these graphs, but here’s the disk utilisation while grepping through the gzipped file:

Gzip - Disk graph

and here’s the CPU graph, mostly bouncing between 80%-90%:

Gzip - CPU

Hoping to get another speed-up, I installed parallel through homebrew to try and make use of the other core in my Macbook, before realising that the grep through the gzipped file had already been making use of both cores. This was proved by watching the CPU monitor in Activity Monitor. I also tried a ramdisk and got no real speed-up on my system, but it might be different for you.

Anyway, either way, hooray for grep.

Thought I was going mad over the weekend – one minute the Posterous API was dragging back comments from my blog, and the next minute not.

It turned out that it wasn’t my rubbish Python that was to blame but an undocumented change to the Posterous API. Before it was:

/api/2/users/me/sites/primary/posts/44379151/comments

and now it’s:

/api/2/users/me/sites/primary/posts/postID/responses

…note “responses” rather than “comments” on the end, as documented in the Posterous API.

You get back a bunch of JSON looking like this

[{
    "name":"stevewoodward",
    "created_at":"2011/02/27 12:32:33 -0800",
    "body":"just a quick test while i fiddle with the posterous api.",
    "post_id":44379151,
    "id":7387696,
    "comment_type":"comment"
}]

Went to the Coventry and Warwickshire Social Media Cafe the other day, and heard a great presentation from Alison Hook from Coventry City Council on their use of Facebook, which won the Digital and Social Media gold award at the LGComms Reputation Awards.

Here are my notes, but if you want it straight from the source, Steve Rumsby also recorded Ally’s presentation using Qik – it runs to about 27 minutes.

Starting out

  • Deliberately called it Coventry – (“who would become a fan of Coventry City Council? About 5 people”)
  • Also allows them to put out press releases from partner organisations (like the Herbert Art Gallery)
  • Started November 2009 as a pilot experiment, let’s see what happens
  • Started out putting out the same press releases as on the website
  • Got 500 fans in 2 months

The snow…

  • Then came the snow, had a bit of a warning that it was coming
  • Had this automated system for putting the information on the website for headteachers to log in and notify people, which was great until lots and lots of people tried to access it at the same time
  • Went to manual system, because of the load, put the schools closures stuff in. Thought, let’s see what happens if we put it on Twitter and Facebook
  • Ended up pushing people to Facebook, saving the server load on the main website
  • People were commenting on the closures and helping each other too
  • As well as getting official information from headteachers, also from parents
  • Some kids persuaded the local media that their school was shut, Ally & team checked – it wasn’t, and put that message out through Twitter and Facebook
  • Local media were watching Twitter and Facebook, and referencing it
  • Went from 500 to 2000 users the day after the snow, and over 10,000 fans after the snow
  • Ally would warn other councils; this was all done manually. If you do it, you’ve got to follow it through
  • People did really appreciate it, lots of nice messages

The stats

Ally showed the Facebook statistics page, which gives a impressive breakdown of the sort of people that are following your page.

Age range:

  • 13-17 years  =  23%
  • 18-24 years  = 29%

…so seen as being a good way to get to young people. Fans count currently goes up at about 40 a week. It’s mostly people from Coventry, then Birmingham second. Oddly, there are 52 fans in Slough, 400-ish fans in Maidenhead.

Questions

How are you handling bad news?

  • Good example – the bins [there was much criticism of the City Council's rubbish collections during and directly after the snow]
  • If there was swearing = deletion (people don’t seem to equate swearing with abusive language…)
  • Only rebut if factually untrue
  • Often the public will join in, and there’s an element of self-moderation
  • Nice turning point when someone said “i’m fed up with all the negativity”, and it’s been a lot more positive since then

When was it set up?

  • Set up November 2009. Went from zero to 500 fans without any publicity

How were you going to get fans if you didn’t get the snow?

  • Word-of-mouth will get the message round

Where do you advertise that you’ve got a Facebook page?

  • Facebook page is linked to from the council homepage
  • Put it in the staff magazine
  • Facebook itself helps (it’s designed to help spread groups around) – “so and so has become a fan of this”
  • Facebook page is mentioned in email signatures
  • No paid-for advertising

Have numbers increased steadily since then [the snow]?

  • About 40-50 a week
  • If you put too much stuff out in a week, people will unsubscribe
  • Sometimes the team tweak the press releases for the Facebook feed
  • Writing for the web – include lots of headings
  • People won’t always read the whole piece before commenting

What guidelines do you have for posting?

  • If people are spamming, the post will be deleted
  • Blatent adverts are deleted
  • They also have a standard disclaimer on the homepage (“We reserve the right to remove, without notice, any disruptive, offensive or abusive posts. This includes posts that contain swearing or libellous statements.”)

What are your internal social media rules?

I struggled to summarise this one, but essentially it’s all bound up with their usual ICT code of conduct stuff.

Final advice

…oh and use pages rather than groups.

Warwickshire County Council Election Map -

This is a map of Warwickshire County Council’s election results for the 4th of June 2009, built as part of a mini-project to show off the possibilities of our opendata before our Hack Warwickshire competition. (You really should enter you know, you might win an iPad.).

Anyway, for comparison, here’s the result for 2005 – it’s easy to see that Labour lost share over the four years. It’s a strong indication of how Labour generally do well in cities and the Conservatives do better in rural areas – compare with BBC map of 2010 UK General Election results.

I really enjoyed putting this together, despite the stress of not knowing anything about this stuff at the start, and it wasn’t that difficult in the end.

The outlines of the electoral divisions are available from the Ordnance Survey as part of the recently opened-up Boundary Line product, and comes as an ESRI Shapefile, a popular file format for geographic vector information.

The full dataset is 46MB in size, and contains polygons for all of the county councils in England, so I’ll need to chuck a load of that data away. Here’s the full dataset mapped out in open-source GIS tool Quantum GIS – Warwickshire is the thing in yellow in the middle.

I used the command line utilty ogr2ogr from the Geospatial Data Abstraction Library (GDAL), and converted the .SHP file from eastings/northings into latitude/longitude:

ogr2ogr -s_srs EPSG:27700 -t_srs EPSG:4326 destination.shp source.shp

…and then converted the resulting file into a .KML:

ogr2ogr -f "KML" destination.kml source.shp

Each area, called a placemark in the KML, came out of the sausage machine looking something like this, only 108MB of it.

<Placemark>
  <Style>
    <LineStyle>
      <color>ff0000ff</color>
    </LineStyle>  
    <PolyStyle>
      <fill>0</fill>
    </PolyStyle>
  </Style>
  <ExtendedData>
    <SchemaData schemaUrl="#electoral_division_latlng">
      <SimpleData name="FID">1558</SimpleData>
    </SchemaData>
  </ExtendedData>
  <Polygon>
    <outerBoundaryIs>
      <LinearRing>
       <coordinates>-1.667473197956685,52.164132540593961 (and lots more....)</coordinates>
      </LinearRing>
    </outerBoundaryIs>
  </Polygon>
</Placemark>

At this point I came to the all-too obvious realisation that the polygons really really were just made out of a shitload of co-ordinates. I knew this would be the case, but I kind of expected it to be something a bit cleverer. I’m not sure what else I could’ve been expecting, really.

So all good so far, but I had no idea which electoral division was which. Alongside the .SHP file was a .DBF file (so dBase, right?) – this can be opened up in Excel. It took me a while to work it out, but the FID number in the KML corresponded to the row number (which is C in the table below) of the area details.

DBF file displayed in Excel

Once I’d chopped out all the non-WCC electoral divisions from the KML file, I was left with about 3MB of data. I took this and ran it through a very simple PHP routine (leaning heavily on the SimpleXML library) to load the data into a MySQL database as field type geometry. The original data was accurate to at least 16 decimal places, which is lovely and all, but overkill on Google Maps, so I took it down to six decimal places, which I think is to within 10cm. Which should be enough.

From there I imported the names of the areas, and with a bit of fiddling, matched those up against the names of the electoral divisions I’d scraped from the original Notes election systems.

I’d had a warning previously that web browsers were a bit rubbish at dealing with the large amount of data it can take to map areas, and I was a bit concerned when I found that the full map of Warwickshire electoral divisions as supplied by the Ordnance Survey contains over 80000 points. It sounded like a lot.

Some rooting around in Stack Overflow brought up a mention of the Douglas-Peucker algorithm, which is a method for simplifying lines. Developer Anthony Cartmell has written an implementation of the algorithm in PHP, and has a demo of it in action too. I used Anthony’s class to simplify the electoral division polygons. Here’s a screenshot showing an example of the line generalisation using Anthony’s class on the Aston Cantlow electoral division – I’ve chosen badly with the colours on the map, the pink line has 813 points, and the yellow line has 74 points.

Comparison between original and simplified lines, using the Aston Cantlow electoral division as an example

I also wrote a routine to export the whole dataset as a simplified KML file – this is at setting 3000, which results in a 176 KB KML file, and 6647 points (just over 8% the size of the full version):

Warwickshire Electoral Division boundary line, generalised

Here’s a link to the KML in Google Maps, with 6647 points – notice on the left that the areas aren’t labelled yet.

After all that, I did an export of the dataset at 99% of the points in the full version, resulting in a 1.8 MB KML (view on Google Map) – which still works quickly in Mac OS X Safari, and was actually OK in IE7 too. I was interested that IE could cope with this many points – it made me wonder if something was being done Google server side to smooth out the points at a particular zoom level.

I also wrote a variation on this to output back to the database rather than out as a KML file – this meant I could quickly experiment with different generalisations of the data to check performance.

Once I’d finally finished fiddling and settled on a generalisation I was happy with (around 8000 points for the entire set), I built up a single GPolygon using the Google Maps API on the results page map, and coloured it in with the winning party colour:

Alcester Electoral Division

For the main map it was just a matter of building up all the GPolygons for all the areas, and listeners to add pop-up bubbles when clicking on an area.

Showing the election pop-up bubble

…and we’re done. Performance is presumably dependent on Javascript performance – it’s definitely fastest in the WebKit-based browsers, as you’d expect, with Firefox 3 being ok, and IE7/8 being very slow. Chrome was actually usable with the full 80,000 points.

Surprisingly, Opera 10 throws a psychedelic fit, refusing to remove previous shapes as you zoom in, which is pretty but a bit rubbish.

All I have for mobile testing is my much loved 1st gen iPod Touch, which crawls but does work, including pinch zoom, interestingly. I’d like to see it on an iPhone 3GS.

Just to compare the experience, here’s the map running within Chrome 4 on a Windows 7 x64 VM on my Macbook.

And here’s the same map in Internet Explorer 8.

I had high hopes for the IE9 preview, which is said to boast much better Javascript performance and SVG, but in my testing it didn’t seem much quicker than IE8.

There are alternatives to building the polygons client-side, but with the experiments I’m doing right now, I’m trying to keep things as simple as possible, running outside of the corporate network using my cheap shared host and free web services to get a quick leg-up and show what can be done without spending a fortune. The shapes could be built server-side, which given a big enough server – however big that is – would be more usable across a wider range of browsers. This would need infrastructure putting in place, and I’m too cheap to get myself a VPS or dedicated server at the moment. Also building the polygons client-side also gives you a level of interactivity (…alright, they’re clickable) which has further possibilities.

So for now, this method works for simple shapes, but for presenting more complicated outlines to a general audience, I’m hoping the future will catch up with us and Microsoft release a blindingly quick version of IE9 which somehow automatically replaces all previous versions in a flash.

Yeah, that’s gonna happen.

But if not, I’ll get round to looking at GeoServer. I noted that KML files displayed quickly when viewed directly in Google Maps in IE7 – this could also be something to explore.

(I should say that this page uses Ordnance Survey data © Crown copyright and database right 2010.)

Update: Since this was written, Heroku have added the ability to push and pull specific tables, through version 0.3 of their Taps gem. More info in their post here: Supporting Large Data

I’m still enjoying using Heroku. For deployment of Rails apps it seems like it’s almost too easy, it’ll be hard to accept anything that doesn’t quite work as smoothly.

Having praised them to death everywhere, the swearing kicked off when I tried to upload some table data for a subset of the database – the command heroku db:push will overwrite the entire database.

There are references to using the yaml_db plugin as a way to extract and reload the data from your database,

This forum post in the Heroku Google Group pretty much covers it – but here’s my written up version, for the next time I need it. This is all using bash on OS X – there will be differences if you’re on Windows.

1. Install yaml_db as a plugin (a gem would require some hacking to make the rake tasks available to your app):

script/plugin install git://github.com/ludicast/yaml_db.git

2. Dump the database out as a YAML file

rake db:data:dump

3. Edit the yaml file in db/data.yml using a text editor, leaving only the tables you want to upload. Add/commit this to your local Git repo.

4. You could skip this bit if you were feeling brave, but I’d recommend you do a test with the data locally before attempting it on live. Back up your local database first, and delete the table data for the tables you want to upload (not the tables themselves), and run

rake db:data:load

5. Check that your local tables contain the data you expect.

6. Push your changes to live – including your db/data.yml file.

git push heroku master

7. Run your migration for setting up the new tables (if any):

heroku rake db:migrate

8. Now load in the data…

heroku rake db:data:load

At this point, your data should now have appeared in your live database. If you know of a better way than this, let me know in the comments.

After having read lots about Chris Taggart’s Open Election Data project, I thought it’d be fun – yes, really – to try and have a go at the Warwickshire County Council election data, which resides within a series of Lotus Notes databases on the WCC site. Here’s what I ended up with, Warwickshire Election Data.

Due to a lack of Notes developer resources, I wasn’t going have the data exported out of the database itself – doing so would require setting up new views – so I went for the even more direct route and scraped the data for the council elections of 2001, 2005 and 2009 into a MySQL database. Actually, who I am kidding, I like scraping data, despite the mental torture. It has the feeling of something that you shouldn’t be able to do, and I like the idea of being able to turn flat data into something more linked up.

2005 and 2009 weren’t so bad, because all the information was in a view, and using the ?readviewentries switch in the Notes view URLs I had some sort of XML representation which I could process. Here’s an example of the view, from 2005:

WCC Election result view from Lotus Notes database, 2005

The name, party and votes received for a particular candidate were all in one XML element, but in a predictable format so it wasn’t such a problem to parse that out. In the end it was just over an afternoons work to grab all of 2005 and 2009.

2001 was trickier, because the information I needed was in a plain, old-school HTML page, as a table with no classes or IDs to identify the information within. It ended up taking me a couple of days (yes, that’s how slow I am) to write something that would reliably scan through the list of ward results, get the HTML for each ward result page, find the particular (unmarked) table on within that markup and scan through it to create the people, results, divisions, and candidates. Here’s an example of the result for Arley:

Arley electoral division result - WCC 2001

I was hoping to use the power (makes face) of relational data, so before adding a person to the database, I checked to see if they were already included – if they were, I returned the reference for that person and used that. This way, we can see where people might have stood for election – here’s a good example, Janet Alison Alty, who stood for election three times for the Green Party in different wards across the elections.

The graphs are straight out of the Google documentation for their interactive charts – here’s an example of a 3-D pie chart from the Bedworth North 2005 election result page:

Example of a pie chart, this one is from the Bedworth North 2005 election result page

I really wanted to use a jQuery-based charting library (for example Flot, jqPlot) but on a quick glance they weren’t quite sharp enough looking for me, pretentious aesthete that I am. I need to come back to this in the future. I could quite happily graph the arse off this data, it’s just a question of spending the time working out the SQL queries to bring back something useful.

Something I’ve never tried before but heard mentioned in dispatches is scraping Google – I used a quick scrape of the first result for a search on the Warwickshire Web to get a list of pages describing the electoral divisions, which include the number of registered electors, wards and parishes, last election date and other stuff. As far as I can tell the result pages on the Warwickshire Web aren’t linked to these pages, so this is an extra relationship in this site. The link appears as “WCC page for (division name)” on each division page – here’s an example for the Admirals division.

I don’t know how impressed Google would ever be with this tactic, but as the volume was relatively low (less than a hundred) I didn’t think they would too upset. When I was trying to get the query right, I messed it up slightly and ended with a result from a well-known far right forum discussing a particular election – reading the posts was a strange view into another frightening world.

This isn’t anywhere near done (…to death), there’s lots more that could still be done with the data, and even more now that Ordnance Survey have released their boundary information, which covers electoral ward and divisions.

The whole thing took just over a week, inbetween chomping on biscuits and wiping a baby’s bum, using the usual PHP/MySQL, CodeIgniter, and the Simple DOM Parser library for scraping purposes, which I used before for my attempt to scrape/map the WCC news stories.

And the last thing – almost forgot – the election area is marked up as linked data as per the Open Election Data project, which feels good. I’m not sure if I should submit it, to be honest, although it’s tempting. I’ve checked it out using the W3C Semantic Web parser, and as far as my bleary eyes can tell, it looks ok. We’re missing some data that’s present on the example page from Lichfield DC – electorate, ballot papers issue, and number of spoiled ballots, but that might not be a problem.

Hopefully in the next week or two we’ll be making the raw election data available as a CSV file or something, from the Warwickshire Open Data site which will be launching soon, once we’ve added a few more datasets – there’s some interesting data from schools to come in the first batch. There’s an inevitable blog about all that at warwickshireopendata.wordpress.com.

It was a struggle, but we got there in the end. The background to this was that I tried to upgrade my Rails install from 2.3.3 to 2.3.5 using the Rails wiki Getting Started guide, and then started running into problems. Ooh look, here it is:

Library/Ruby/Site/1.8/rubygems.rb:777:in `report_activate_error': RubyGem version error: rack(1.0.0 not ~> 1.0.1) (Gem::LoadError)

…which meant I had the wrong version of Rack installed. To be fair, that’s bleedin’ obvious from the error message. How often do you get an error message that clear? For Rails 2.3.5, Rack 1.01 is the right one, so uninstall Rack, all versions:

[e@lemons ~/ruby/project]$ sudo gem uninstall rack

Select gem to uninstall:
 1. rack-1.1.0
 2. rack-1.0.0
 3. All versions
> 3
Remove executables:
    rackup

in addition to the gem? [Yn]  Y
Removing rackup
Successfully uninstalled rack-1.1.0

…and then installed Rack v1.01:

[e@lemons ~/ruby/project]$ sudo gem install rack -v 1.0.1
Successfully installed rack-1.0.1
1 gem installed
Installing ri documentation for rack-1.0.1...
Installing RDoc documentation for rack-1.0.1...

which all helped. But then I tried to to do rake db:create using MySQL and got no joy. So I ended up…

  1. Installing XCode from the Snow Leopard DVD. Yes, even if you installed it under Leopard.
  2. Download the latest 64-bit MySQL DMG from the MySQL download area – yes, that’s 64-bit
  3. In Terminal… sudo gem update --system
  4. And then… sudo env ARCHFLAGS="-arch x86_64" gem install mysql -- --with-mysql-config=/usr/local/mysql/bin/mysql_config

And with this I could create my MySQL database using rake, and start it up as usual. All done by 11.30pm on a school night, ow.

I kept getting a segmentation fault from abstract_adapter.rb when trying to connect to my controller – some people have reported success rolling back to MySQL 5.0, but if you want to use 5.1, try using the libmysql.dll from InstantRails:

http://instantrails.rubyforge.org/svn/trunk/InstantRails-win/InstantRails/mysql/bin/libmySQL.dll

(via peterskim.org)

I’ve been fiddling with this little tagging experiment, which I’m pretentiously calling the Warwickshire News Mine, for a couple of weeks now. Essentially the plan was to scrape a bunch of news stories from the Warwickshire County Council website, and see if they could be tagged up automatically.

Screenshot of Warwickshire News Mine

Initially it was just meant to be an excuse to fiddle with the Google Maps API, but I started having a play with the online automatic tagging service OpenCalais, which ended up being the most satisfying thing about it. I’ve left all the tags and types produced by OpenCalais in so you can compare the tags against the content.

OpenCalais is actually pretty good, despite my previous churlish Twitter whinging. It seems a bit petty to pick holes given that it’s a free service provided kindly by Thomson Reuters, but well, I’m going to anyway…

Most of my problems with it are to do with the categorisation of the tags – for instance it seems to be pretty good at pulling out names, with some exceptions – for example, Lea Marston and Leek Wootton are places rather than people, and Warwickshire is tagged as a City rather than a Province or State, although it correctly works out that North Warwickshire fits into the latter category.

It could do better with working out synonyms – for instance anti-virus software and antivirus software are the same thing, and I remember seeing a couple of places where the plural and the singular are included as tags.

For some reason I was impressed that it knows that Come Dine With Me is a TV show, and that the communications team write so much about programming languages. The latter is one case where you would possibly post-process the tags found, in this case by chucking them away.

OpenCalais doesn’t seem so hot on working out a more general keyword behind a story – the tagging on the story Civil partnerships and marriage increase didn’t pull out the words marriage or wedding as tags.

(Update: See comment from Tom Tague of OpenCalais for clarification on the way that OC works).

Screenshot of a story from the News Mine

For me the best thing to come out of the tagging was the “possibly related stories” sidebar in the news story page, which I added late on. When you open up a news story, it searches the database for the top 5 stories with the most tags matching that story, and mostly this works pretty well – possibly because of the robotic tagging consistency of OpenCalais.

On the technical front, the site is based on the usual PHP/MySQL combination, and I used the open-source CodeIgniter framework, with the Simple DOM Parser for scraping the news stories, and Dan Grossman’s Open Calais Tags library to send the main body text off to OpenCalais for processing. The elapsed time to process each piece of content was generally about 3-4 seconds, sometimes slightly shorter, sometimes longer (up to around 10 seconds). I had to run the routine several times to get results for all four thousand indexed stories – there was a memory leak somewhere along the way.

As for the thing that I initially started out to do, that’s pretty dull really – I used the Google Maps API to geocode a list of towns and villages in Warwickshire, as well as few other places, and then ran the main text of the stories through a simple regular expression search to tag them up with places.

There’s lots of improvements that could be made, but in the end it’s just a throwaway experiment. I’d like to improve the places tagging routine, which could be as simple as adding a few more places. The main thing would be to look into some way of fitting the tags around a pre-defined ontology. There’s no current method to suggest a list of categories and tags for OpenCalais to process content with, so it would be have to be after the results had been received.

I’ve been working on a little project to scrape and automatically tag a load of content, and I found I needed to convert some website addresses in some text content that weren’t linked with anchor tags.

I tried this routine which is replicated in a few places, but I’ve found that if a link ends with a full stop, it will include the full stop as part of the linked URL and break.

So stone me, it turns out that CodeIgniter -the PHP framework I’ve been using all along – has a function in its URL Helper called auto_link for this very purpose. And it handles any full stops at the end of the URL.

I must think of an appropriate way to celebrate – possibly I’ll fill the dishwasher.