Computers, bikes and things I’d like to remember.

Back to reading

February 12th, 2010 Posted in General | No Comments »

Having put evenings of Uni study behind me, I can now devote time to reading again.

I still feel a small pang of guilt when I settle down to read a book - as though I ought to be doing something more useful - but I’m getting over it. I’m also caving in to the temptation to drop a few dollars here and there on inexpensive books when I see them.

Right now I’m reading Hugh and Jon’s “Practical Arduino“, Peter Doherty’s “The Beginner’s Guide to Winning the Nobel Prize”, Terry Pratchett’s “Making Money” and a couple of tech magazines that I’m trying to stretch out by reading no more than an article a day. In the wings I have four new books that I am resisting the temptation to start.

This weekend looks like it might be a wet one. Oh dear, I may be forced to sit indoors and read.

Time to stand up.

January 25th, 2010 Posted in General | No Comments »

If you know me, you’ll know that walking in a protest march or chaining myself to a tree is not my thing. In fact I have spent years supporting the idea of, “Don’t vote. It only encourages them.”

But the Australian government’s proposed Internet filtering legislation is so poorly conceived, so expensive, and so ineffectual, that I can no longer remain silent. It will fail to do what it purports to do (protect children). It will cost an unconscionable amount of burned money at a time when this country is sorely stretched for funds. It will adversely affect hundreds of thousands of Australians.

If this issue burns you, or even piques your interest, more here: http://www.internetblackout.com.au/

If you would like to register your disapproval, you can join the blackout protest and black out your website or your online avatar (facebook or other pic).

Click on michaelcarden.net to see what the web site protest pop-up looks like.

Linux Conf Au 2010

January 6th, 2010 Posted in Computing, General | 1 Comment »

LCA 2010 starts in just a few days (January 18) and I’m mighty jealous that I won’t be there.

I first attended LCA in 2005, not at all confident of having the geek skills needed to understand it all, and I had a ball. Simon Phipps from Sun, Eben Moglen from the SFLC and Tridge’s hilarious expose of Bitkeeper ‘hacking’ were just some of the highlights. I was hooked.

2006 I crossed the Tasman to Dunedin, 2007 I was a speaker in Sydney, 2008 a great week in Melbourne and in 2009 I walked up and down that hill at U-Tas several times a day. Huge experiences each and every one.

And a year ago when I heard that 2010 would be in Wellington NZ, I rejoiced because I love New Zealand and I love Wellington. I started imagining my LCA 2010 experience.

Then reality intervened. Mostly in the shape of money, or the lack of it. Work had paid for four of my five previous LCA experiences, but this time around the work piggy bank was empty. I might have paid for myself but my very good friends at the Australian Tax Office had other plans and required me to start paying for my recent Uni education. Bye bye several thousand dollars.

But I can’t resist looking. I keep surfing the LCA web site and drooling over what I won’t see. And today my drooling paid off big time.

I was re-reading the details of the keynote speakers and I idly clicked the link to the blog of Gabriella Coleman from NYU. There I found a PDF preprint of her paper, “The Hacker Conference: A Ritual Condensation and Celebration of a Lifeworld.” It’s a wordy title, as most academic papers seem to have, but it’s a ripping read. It examines the hacker conference as a social phenomenon via Gabriella’s experiences at Debconfs and others.

See how many people you recognise from the photos in the paper. Then try not to tell me how much fun you’re having at LCA while I sit back in Canberra and cry into my beer.

Canberra Streetview update

December 17th, 2009 Posted in General | No Comments »

Streetview camera car

Last Saturday while we were picnicking at Black Mountain Peninsula, the Streetview camera car drove slowly in and out along the main access road. I had seen it briefly on the north side of Black Mountain the previous week, then just a few minutes ago zipping up the Tuggeranong Parkway.

Dad managed to grab this quick pic as it cruised past taking photos of us and of Michelle’s car.

Xena 5.0.0 released

December 16th, 2009 Posted in Computing, General | 1 Comment »

After a whole lot of interesting work behind the scenes, the team has pulled together and made the latest and greatest version of our Xena digital preservation software available.

The sourceforge download page offers source code, a Mac dmg installer, a Windows exe installer and packages for Sun’s JRE and OpenJDK for anyone.

I think that the most exciting new feature is our addition of the ability to create a searchable text version of anything that has a text representation. For PDF and DOC that is just an extraction, but for TIFF images of scanned text documents it involves integrating with an Optical Character Recognition engine. We’re using Google’s Tesseract which does a pretty good job of OCR but can be a bit fragile. We managed to find some image content that kills Tesseract with a segfault but it looks like the version in source control is better. Anyway, this is a useful step forward for our software.

In addition we have cut out all of the static jars from external projects that used to live in our source tree, reduced the number of libraries we depend on, added source for those we do depend on and reset our license to be GPL3.

All of this and more besides. Grab a copy, have a play and let us know what you think.

From PDF to TIFF to ASCII

November 27th, 2009 Posted in General | 1 Comment »

This is one of those so I remember next time blog posts.

Yesterday I was asked to help someone who wanted to take some PDF files and make them into text for OCR purposes. These particular PDFs were made from some TIFF files created by scanning lots of paper. The OCR software that I have to hand is Google’s Tesseract free and open source OCR engine and it likes images to be monochrome TIFFs with a three letter TIF file name extension. So I needed to extract TIFF images from the PDFs at a high enough resolution that the OCR can take place, convert them from RGB colour to 1 bit TIFF and feed them to tesseract to extract some text. There must be a nicer way, but here’s how I eventually did it:

To extract the RGB TIFF data from the PDF as monochrome at a high resolution, I used the ‘convert’ command from the open source imagemagick library.

convert -monochrome -units PixelsPerInch -density 300×300 Navy_List-October-1905-1.pdf image%02d.tif

This results in 34 individual TIFF files, one for each page of a 34 page PDF. Then, to turn these into one big TIFF file with a three letter extension, I used the convert command again:

convert -adjoin image* bigtiff.tif

Finally, I used tesseract to OCR the resulting image file and extract the text into a file I called bigout.txt (tesseract adds the txt extension automatically).

tesseract bigtiff.tif bigout

The result is awful if the purpose is to read the text, but as the basis for a full text search of the documents, given the quality of the scanning, it’s actually pretty good.