Converting from HTML5 to Kindle Format

At the recent meeting of the book club I'm in, it was decided that we'd read To Kill A Mockingbird. A couple of us immediately reached for our Kindles to see if we could download it. It turns out that all you can get from Amazon's Kindle Store are study guides - everything but the book itself. My suspicion that this was because the book is out of copyright and so they can't easily sell a eBook of something you can essentially get for free turned out to be correct.

But could I find a copy of the text in the right format? Well I probably could, but I fancied having a go at creating it myself. Which I duly did.

Step 1: Find the Text.

It's easy to find the text of out of copyright books online. Looking for Jane Austen? I refer to you Pemberley.com. To Kill a Mockingbird I found at the Washington State School for the Blind (although it turns out that their text is incomplete - they make 26 chapters available but the book has 32).

Step 2: Create A Single File of the Text

As usual for books published as Web pages, each chapter has its own page. That's sensible, it's how I'd do it except that I want the whole thing in a single file that I can convert to Kindle format. So that's what I created. It's in HTML5, and each chapter is an <article> except the top matter which is in a <header>. I also decided to add in some microdata using schema.org, some RDFa using basic Dublin Core properties, and I chucked in some Open Graph data as well for good measure.

First Edition cover, shows a simple design of a cartoonish, hand drawn tree

Step 3: Add in an image of the cover

It's no hard to find an image of the first edition cover of To Kill A Mockingbird.

Step 4: Convert to Kindle format

What is Kindle format? It's an eBook format known as .mobi (nothing at all to do with the top level domain of the same name). And there are various tools that let you convert documents into that format from common ones. I first used an open source tool called Auto Kindle eBook Converter which, for reasons I wasn't able to fathom, rendered the text in italics. I tried messing with the CSS but it made no difference, italics it was every time.

Let's try a different converter. Online Convert offers a free conversion service and, very conveniently, I could just paste in the URI of my page. That would be very good, I'm sure I'll use this service again, but I found I still needed to mess around with the course code of the page a bit and doing that online was an added pain.

In the end I downloaded Calibre eBook Management which looks like the Real McCoy in terms of extended functionality. It's another open source project too so that feel right. If I use it a lot I promise I will make a donation.

Step 5: Fiddle about a bit

The text was being converted nicely, no italics, so that was fine, but the frontispiece wasn't right. That's because I'm using <hn> elements for things like the publisher and author. Calibre puts a page break in before those so I had about 4 pages of titles - not what I wanted. Actually, in HTML5, if you have multiple headers like that the advice is to use an <hgroup> element, rather than a <header> which is what I tried. But the credit to the Washington State School for the Blind is not a heading and neither is the image — and all you can put in an <hgroup> are <hn> which, in this instance, I found restrictive. I could have used an <hgroup> within a <header> but that's getting silly.

Anyway, in the end I just changed the headers to paragraphs, did the conversion, and then put them back again. That seemed to work.

Step 6: Server configuration

The final step was to add a line to this site's server config (.htaccess) thus:

AddType application/x-mobipocket-ebook .mobi

This means that the .mobi file is served with the correct content type.

Step 7: Publish …

… and hope this really is out of copyright.

I offer you the first 26 chapters (of 32):

being the first of what I imagine will be a growing collection of HTML and eBook representations of out of copyright books on this site.

Next step is to set up content negotiation — all in good time.