<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Made of String &#187; mapping</title>
	<atom:link href="http://madeofstring.co.uk/tag/mapping/feed/" rel="self" type="application/rss+xml" />
	<link>http://madeofstring.co.uk</link>
	<description>Still not a very good programmer despite all that tea</description>
	<lastBuildDate>Sun, 29 Jan 2012 21:29:09 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Automatically tagging news stories using OpenCalais</title>
		<link>http://madeofstring.co.uk/article/automatically-tagging-news-opencalais/</link>
		<comments>http://madeofstring.co.uk/article/automatically-tagging-news-opencalais/#comments</comments>
		<pubDate>Mon, 22 Feb 2010 22:07:38 +0000</pubDate>
		<dc:creator>Steve</dc:creator>
				<category><![CDATA[Article]]></category>
		<category><![CDATA[api]]></category>
		<category><![CDATA[googlemaps]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[opencalais]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://madeofstring.co.uk/?p=43</guid>
		<description><![CDATA[I’ve been fiddling with this little tagging experiment, which I’m pretentiously calling the Warwickshire News Mine, for a couple of weeks now. Essentially the plan was to scrape a bunch of news stories from the Warwickshire County Council website, and see if they could be tagged up automatically.

Initially it was just meant to be an [...]]]></description>
			<content:encoded><![CDATA[<p>I’ve been fiddling with this little tagging experiment, which I’m pretentiously calling the <a href="http://madeofstring.co.uk/newsmap">Warwickshire News Mine</a>, for a couple of weeks now. Essentially the plan was to scrape a bunch of news stories from the Warwickshire County Council website, and see if they could be tagged up automatically.</p>
<p><a href="http://madeofstring.co.uk/newsmap"><img src="http://madeofstring.co.uk/wp-content/uploads/2010/02/newsMineScreenshot-499x332.png" alt="Screenshot of Warwickshire News Mine" title="newsMineScreenshot" width="499" height="332" class="alignnone size-medium wp-image-44" /></a></p>
<p>Initially it was just meant to be an excuse to fiddle with the Google Maps API, but I started having a play with the online automatic tagging service <a href="http://www.opencalais.com">OpenCalais</a>, which ended up being the most satisfying thing about it. I&#8217;ve left all the <a href="http://madeofstring.co.uk/newsmap/tagging">tags and types produced by OpenCalais</a> in so you can compare the tags against the content.</p>
<p>OpenCalais is actually pretty good, despite my previous churlish Twitter whinging. It seems a bit petty to pick holes given that it&#8217;s a free service provided kindly by Thomson Reuters, but well, I&#8217;m going to anyway&#8230;</p>
<p>Most of my problems with it are to do with the categorisation of the tags &#8211; for instance it seems to be pretty good at pulling out <a href="http://madeofstring.co.uk/newsmap/entities/person">names</a>, with some exceptions &#8211; for example, <a href="/newsmap/entities/person/lea-marston">Lea Marston</a> and <a href="/newsmap/entities/person/leek-wootton">Leek Wootton</a> are places rather than people, and <a href="http://madeofstring.co.uk/newsmap/entities/city/warwickshire">Warwickshire</a> is tagged as a City rather than a <a href="http://madeofstring.co.uk/newsmap/entities/province-or-state">Province or State</a>, although it correctly works out that <a href="http://madeofstring.co.uk/newsmap/entities/province-or-state/north-warwickshire">North Warwickshire</a> fits into the latter category.</p>
<p>It could do better with working out synonyms &#8211; for instance <a href="http://madeofstring.co.uk/newsmap/entities/industry-term/anti-virus-software">anti-virus software</a> and <a href="http://madeofstring.co.uk/newsmap/entities/industry-term/antivirus-software">antivirus software</a> are the same thing, and I remember seeing a couple of places where the plural and the singular are included as tags.</p>
<p>For some reason I was impressed that it knows that <a href="http://madeofstring.co.uk/newsmap/entities/tv-show/come-dine-with-me">Come Dine With Me</a> is a TV show, and that the communications team write so much about <a href="http://madeofstring.co.uk/newsmap/entities/programming-language">programming languages</a>. The latter is one case where you would possibly post-process the tags found, in this case by chucking them away.</p>
<p>OpenCalais doesn&#8217;t seem so hot on working out a more general keyword behind a story &#8211; the tagging on the story <a href="http://madeofstring.co.uk/newsmap/2010/02/09/civil-partnerships-and-marriage-increase">Civil partnerships and marriage increase</a> didn&#8217;t pull out the words <strong>marriage</strong> or <strong>wedding</strong> as tags.  </p>
<p>(<strong>Update:</strong> See <a href="http://madeofstring.co.uk/article/automatically-tagging-news-opencalais/#comment-3">comment</a> from Tom Tague of OpenCalais for clarification on the way that OC works). </p>
<p><a href="http://madeofstring.co.uk/newsmap/newsstory/2006/03/16/pram-problem-for-alice-as-curtain-prepares-to-rise"><img src="http://madeofstring.co.uk/wp-content/uploads/2010/02/newsMineStory-500x197.png" alt="Screenshot of a story from the News Mine" title="newsMineStory" width="500" height="197" class="alignnone size-medium wp-image-53" /></a></p>
<p>For me the best thing to come out of the tagging was the &#8220;possibly related stories&#8221; sidebar in the news story page, which I added late on. When you open up a news story, it searches the database for the top 5 stories with the most tags matching that story, and mostly this works pretty well &#8211; possibly because of the robotic tagging consistency of OpenCalais.</p>
<p>On the technical front, the site is based on the usual PHP/MySQL combination, and I used the open-source <a href="http://www.codeigniter.com/">CodeIgniter</a> framework, with the <a href="http://simplehtmldom.sourceforge.net/">Simple DOM Parser</a> for scraping the news stories, and Dan Grossman&#8217;s <a href="http://www.dangrossman.info/open-calais-tags/">Open Calais Tags</a> library to send the main body text off to OpenCalais for processing. The elapsed time to process each piece of content was generally about 3-4 seconds, sometimes slightly shorter, sometimes longer (up to around 10 seconds). I had to run the routine several times to get results for all four thousand indexed stories &#8211; there was a memory leak somewhere along the way.</p>
<p>As for the thing that I initially started out to do, that&#8217;s pretty dull really &#8211; I used the Google Maps API to geocode a list of towns and villages in Warwickshire, as well as few other places, and then ran the main text of the stories through a simple regular expression search to tag them up with <a href="http://madeofstring.co.uk/newsmap/places">places</a>.</p>
<p>There&#8217;s lots of improvements that could be made, but in the end it&#8217;s just a throwaway experiment. I&#8217;d like to  improve the places tagging routine, which could be as simple as adding a few more places. The main thing would be to look into some way of fitting the tags around a pre-defined ontology. There&#8217;s no current method to suggest a list of categories and tags for OpenCalais to process content with, so it would be have to be after the results had been received. </p>
]]></content:encoded>
			<wfw:commentRss>http://madeofstring.co.uk/article/automatically-tagging-news-opencalais/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

