Ebook/Epub/Docbook Braindump

(July 21 Update:  For some reason wordpress is very flaky about handling lots of xml code, so I am no longer updating this post. I should add that 1)I am currently using the epub3 XSL now instead of db2epub.py or ant. Basically I am generating the xhtml files, manually copying over the images, css and manually zipping them up. It seems to work, but I am just starting.  I shall eventually make a longer document about epub3 + docbook and link to it at the top. For now, though, what is below is a helpful reference. See the mobileread e-production wiki for better information.

(Sept 15 2011: I found a compatibility problem in using db2epub.py for producing ebooks, so my next priority is moving my workflow to Oxygen + Ant + Docbook. It’s quite a complicated task, so I’m basically dropping everything I’m doing to get it done. Rest assured that 1)when I learn it, I will do a write up about it and 2)I recommend that Oxygen users try using Ant as well. Db2epub.py worked well, but with more  complicated setups, I find that the script didn’t offer much flexibility. Oxygen + Ant created problems of its own, but now I’ve decided that they are now worth solving. Read the rest of the article with this caveat in mind).

I have been deeply involved in producing epub files. I keep coming across tiny bits of secret wisdom, and it  occurs to me that I should keep a list.  My toolchain for producing epubs is Docbook XML + customization layer for Docbook’s epub stylesheet + db2epub.py + epubcheck.  Some of my tips may be specific to Docbook.   (See also: my reference for technical writing, which contains a concise list of links about epub production). I’ll be adding to this over time. One important thing worth mentioning is that there is now an official branch of Docbook (called DocBook Publishers schema). It is brand new, so there are no style sheets, but it looks like they will be based on the Docbook XML stylesheets, so it won’t need to be written from scratch. (See: Mailing list). See the bottom of this page for more info about this schema. I’m in the process of cleaning these remarks up for clarity and succinctness. If I decide that a remark is out-of-date or no longer applicable, I’m not going to delete it but put it at the bottom of its category with the label NO LONGER VALID before it).

See also: Docbook XSL Snapshot and Oxygen mailing list.

  1. I have developed some templates for inserting images in docbook. I created different roles for media objects (i.e.,) and used about 4 roles: tall-image-right, tall-image-left, wide-image-float, wide-image-with-caption, wide-image-no-caption. In my XSL customization layer, I set up code to turn them into div class=”tall-image-right” (etc…). Here is the code to do that — note that this applies  only for vanilla Docbook XSL, not the NS version.
    <xsl:template match="mediaobject[@role = 'wide-image-float']" mode="class.value"> <xsl:value-of select="'wide-image-float'"/> </xsl:template> <xsl:template match="mediaobject[@role = 'wide-image-with-caption']" mode="class.value"> <xsl:value-of select="'wide-image-with-caption'"/> </xsl:template> <xsl:template match="mediaobject[@role = 'wide-image-no-caption']" mode="class.value"> <xsl:value-of select="'wide-image-no-caption'"/> </xsl:template> <xsl:template match="mediaobject[@role = 'tall-image-right']" mode="class.value"> <xsl:value-of select="'tall-image-right'"/> </xsl:template> <xsl:template match="mediaobject[@role = 'tall-image-left']" mode="class.value"> <xsl:value-of select="'tall-image-left'"/> </xsl:template> 

    Next I declared css for all these.

     

      /* even though I'm giving different css depending on the image div, here are defaults which apply to all captions. Note that I am not defaulting to text-align: center because captions for floated images don't look great when centered. */ div.caption p { text-indent : 0em !important; font-style : italic; font-size : 0.8em; color : #157DEC; } /* div.tall-figure floats an image to sets an image width of 575px and caption width of 550px just to be safe;*/ div.tall-image-right { width : 50%; max-width : 350px !important; float : right; margin : 0 1em 0 1em !important; display : inline-block; } div.tall-image-right img { border : 1px; width : 100%; max-height : 50%; /* max-width: 350px;*/ vertical-align : text-top; } /* div.tall-figure floats an image to sets an image width of 575px and caption width of 550px just to be safe;*/ div.tall-image-left { width : 50%; max-width : 350px !important; float : left; margin : 0 1em 0 1em !important; display : inline-block; } div.tall-image-left img { border : 1px; width : 100%; max-height : 50%; /* max-width: 350px;*/ vertical-align : text-top; } /* rest of styles are declared separately above for all captions */ div.tall-image-right div.caption p { margin : auto; } /* rest of styles are declared separately above for all captions */ div.tall-image-left div.caption p { margin : auto; } /* div.float-wide-option is for images smaller than 350px which can be floated which still leaves 35-40% of screen for text. Probably should not include a caption here. div wide-image-no-caption (see below) will display image not floated with scalable width */ div.wide-image { width : 60%; max-width : 350px !important; float : right; margin : 0 1em 0 1em !important; display : inline-block; } div.wide-image img { width : 100%; max-height : 50%; /* max-width: 350px;*/ vertical-align : text-top; } /* div.wide-image-with-caption sets an image width of 575px and caption width of 550px just to be safe;*/ div.wide-image-with-caption { display : inline-block; padding : 10px 0em 10px 0em !important; vertical-align : middle; width : 100%; } div.wide-image-with-caption img { display : block; margin : 0 auto; width : 550px; } div.wide-image-with-caption div.caption p { text-align : center; margin : 0 auto; display : block; width : 525px; } /* div.wide-image-no-caption doesn't set an image width for maximum flexibility*/ div.wide-image-no-caption { padding : 10px 0em 10px 0em !important; vertical-align : middle; width : 100%; } div.wide-image-no-caption img { display : block; margin : 0 auto; } 

    Some notes: 1)Nook Touch lets the user set narrow margins, thus defeating the wide-image-with-caption DIV css. (The solution is to narrow the width, but that just sucks because it limits how big your image can be). 2)This CSS doesn’t work with kindlegen. I think I tested some other solutions to that — which I’ll document later. 3)In general, captions don’t look nice in e-ink readers, especially for floated images. Better to use very short captions if you can or none at all.

  2. Zipping up your project into a zip/epub file is very tricky because the mimetype file has to be the first file in the zip and must be stored without compression. (See this thread I made about this issue). That actually is a good reason for using an automated tool to produce the epub rather than trying to do it manually.
  3. You actually don’t need to create a page with a TOC in it if you have created a proper epub file. The main reason you see this in epub files is that mobipocket requires it and publishers are too lazy to remove it from  their epub files.
  4. In general, hyperlinks on an e-ink device are cumbersome to use.  On touchscreen devices, they work ok, but the limiting factor is the width of your finger. That is why you shouldn’t have links too close to one another regardless. Four or five links maximum.
  5. Frankly, pdf files look so good in Ipad’s Goodreader that I no longer flinch at the very thought of using  pdf as an ebook format. If you are editing print-ready PDF copy on the ipad, Ipad’s iAnnotate app is great!  4/2011 Update: You can do annotation on ibooks too, but when you email the results, it only emails the notes you make – not the highlighted text or the specific location where you did the annotation. The key is copying the text into the note and then emailing it to yourself.
  6. One of the points made about the epub format is that each device’s implementation of the format will differ (and even if they don’t differ, their form factor imposes limitations on what is doable). This is basically true and imposes a constraint on layout for a device. This is painfully  true for device families consisting of a high end and low end device.
  7. The big thing awaiting us in the next epub specification is CSS Media Queries. This is a CSS 3 feature that lets you set conditional CSS depending on the device-width and device-height. This is badly needed! Right now when you make an epub, you end up making one for the software on  an e-ink sized device and another for the same software on an  ipad (and other bigger tablets). It is painful. Right now, CSS media queries are supported on all browsers except IE 8. But in the spring IE 9 will support CSS media queries and end this madness (for web browsing at least). It could take 2 years for ebook devices to support it. I can’t wait!
  8. Right now the Nook is capable of accepting embedded fonts, but they recommend against it. Mainly because of legal restrictions, but also because of bad device support, embedding fonts is not a good idea. (For now, you should do it only on your website, not your ebook).
  9. In my current ebook, I’m dealing with wide images and tall images. You need to create two test chapters: one for wide images, and another for tall images and make sure both render fine in the ebook reader. (I’ll describe how to do it in docbook below).
  10. Here’s a working examples of epub and mobipocket files which display endnotes. Impressive.  Here’s the code that was used.  I’m happy to report that Docbook already has both the semantics and the style templates to make endnotes. I feel 95% sure that you could customize these endnotes to work with Amazon.
  11. According to Tallent in his ninja podcast on images (near the end), when you lower font size in ADE, the cover page will change to 2 columns and  readers will left align covers in this case. To solve this, you can add this non-standard-css declaration specific to the html cover page of  epub: oeb-column-number value = 1.
  12. July 2011 Update: This is not really relevant anymore. I will be posting tested code examples soon. Liz Castro’s book on epub formatting has a serious error about how to format images. In the Working with Images section Step 3, she recommends img{width: 100% !important;}.Here’s what she meant to say:
      img{ max-width: 100%; vertical-align: text-top; margin: 0 0 0 .5em !important; } 

    (I verified this by expanding the actual epub file of Castro’s book; her code in the epub file is right, but the code in the chapter is wrong). Generally I recommend Castro’s book still, but you should keep in mind that many of her tricks are specific to the ipad and (often) specific to Adobe InDesign. Some of her tips are crazy. For example, to circumvent the limitation of ibooks fonts on text, she suggests adding a rare element inside it and styling it with one of ipad’s other fonts. I like knowing about this trick, but it messes up the semantics (and it is undoable in Docbook). A better strategy would be to wait for ibooks to improve its font support. April 19 2011 Update: Liz Castro addresses this issue; apparently, the problem is that ibooks doesn’t support css properties for width for img while all the other epub implementations seem to. I haven’t had time to test her solutions or to offer my own; but I have a feeling some workarounds or solutions will congeal around this post.

Testing

  1. To test epubs in the Nook for PC, there is no way from within the application to delete the epub file. To do this, go to C:\My Documents\My Barnes & Noble eBooks and delete the file.
  2. Adobe  Digital Editions is an application for testing ebooks on a desktop. Many e-ink readers render epub files with the Adobe Adept parser, which is capable of reading a DRM-locked and no-DRM  epub files. Adobe Adept is found on Nook, Sony, Bookeen and other e-ink devices. Therefore, Adept bugs are also going to appear on Adept devices as well.  Also, one needs to pay attention to Adobe’s guide to Best Practices for Designing Epub files (.epub) which explain some tips for improving rendering performance. For example,  keep unused styles to a minimum, avoid pseudo-selectors, use ems,  and use 0 default values for the most common elements.
  3. As far as I know, there is no way to add an epub file in Nook for ipad or a mobipocket file in Kindle for ipad.
  4. Initially at least, I am happy testing ebook html in my browser. To do this, you can add the Chrome Resolution Test to view the page in a 800×600 resolution. (You can also do this using the Web Developer plugin in Firefox, but I prefer to stay in a Webkit browser environment).  Note that if your css is optimized for this resolution, it can look like total crap in a normal browser viewport.
  5. One totally cool thing about Oxygen XML Author is that you can view and edit individual files in your epub file on the fly even though it is zipped up.

E-ink Style and Design Considerations

  1. The big challenge in e-ink readers is maximizing the use of space. Unlike normal book publishing (where white space gives the eyes a break), e-ink publishing is so compact that you want to minimize the number of page turns. That is not to say everything should look crowded; certainly the end of chapters will have white space; also images break up the monopoly of text on the screen. But if a reader has a sense that one page doesn’t offer enough information, then that drags the story.  (That is yet another reason why the Kindle format is a failed format; lack of support for floats makes it impossible to use  graphics without wasting space).
  2. On e-ink devices, putting white text with colored background is an effective way to handle contrasts. E-ink translates colors fairly evenly.
  3. For titles (which I make white text plus dark background), you need to make sure that there is at least one pixel at the top. It looks crappy if the title bar bleeds onto the top.
  4. I am relatively neutral about whether images should be allowed to bleed onto one side of the page. I don’t think it looks bad.
  5. As counterintuitive as this seems, it makes sense to start the text immediately after the title bar. (maybe a little space is ok). If the title bar has enough css padding,  that creates a kind of empty space for the reader’s eyes.
  6. Even though I generally try to put captions under every major graphic, on e-ink devices they clutter up the page.  Lose the captions unless absolutely necessary.
  7. For e-ink, sidebars and captions need to be italic to convey enough contrast.
  8. Ebook covers are tricky especially because they are viewed in substantially reduced sizes. In print books you can get away with cramming design into a cover; in ebooks, minimal-but-distinctive covers seem to work well. (Joel Friedlander has written in more detail about cover design issues for ebooks).
  9. Instead of using mdashes, I always use ndashes and separate it from the word with a single space. One effect of doing this is that you make it easier for the browser/ebook reader to do justification better.

Ipad Specific Stuff

I was hoping not to spend too much time talking about optimizing ebooks for ipad. Liz Castro’s book does that job fairly thoroughly, and besides I was hoping not to spend too much time talking about one platform (even if it is a beautiful platform). I’m having enough problems figuring out the mobipocket stuff. It’s worth keeping in mind that even though ibooks is still dominant

  1. Here’s some thoughts about the image dimension & ratio needed to force a new page. But this wouldn’t be necessary if ibooks supported css page-break options. Arrgh!
  2. Liz Castro shows how  to do line-breaks in the latest version of iBooks. embedded font, tweet. blockquote?
  3. iBooks early support for builtin fonts was limited, but Liz Castro notes that it has improved considerably. She explains how to do it.  Here’s a list of available Apple fonts.
  4. Castro provided updates about viewing tables in Epub.
  5. The real question is whether you should do the hack to set text-indent: left in iBooks. Castro reports the hack.  People armed with InDesign are more tolerant of breaking semantics by including these hacks, but I worry about long term compliance and code maintenance. My feeling is that we shouldn’t clutter HTML elements with these things and instead support default behavior.
  6. NO LONGER VALID. According to Liz Castro, embedding fonts will work for Ipad if you put the css at the top. But The reader can’t change the font, and the reader can’t change the font size, leading her to wonder: what’s the point?

Mobipocket Production (or Docbook –> Epub –> Mobipocket)

Special Contempt goes to Mobipocket/Kindle format. Apparently Mobipocket never managed to update its css support.  Most of  its formatting is done through attributes for HTML elements, which went out of style way back in the last century. It is a seriously defective format. In general, I would say it is incompatible with the Docbook workflow process I have outlined; it might be easier to do everything in MS Word. I tried to avoid Joshua Tallent’s Kindle Formatting book as a matter of principle, but he does a good job of explaining how images are processed and all the craptastic workarounds that are needed as a result of the quirky way it supports images and how it formats things in the html rather than in the CSS. Eventually I bought Tallent’s book and am generally happy with the book (even though I was horrified at the hoops he had to go though). Tallent “cheats” with regular expressions – which is ok, I guess, but it’s still unwieldy. Liza Daly suggests making 2 epub files – one for mobipocket, one for everybody else. That makes a lot of sense if you merely have to swap CSS files. But oh, I forgot, it looks like you need to do profiling (i.e, conditional text & elements) when writing your docbook source.  A little bit is ok, but the core problem is that content and presentation are not separated in Mobipocket.

Aug 1 2011 Update.  I have resigned myself to the need to do some post-processing of HTML code after the epub is generated. Yes, I feel dirty doing it, but fundamentally docbook is not suited for this kind of manipulation. (Or maybe it is , but it doesn’t seem to be worth the effort).

Kindle does not support any kinds of floats, be they for images or sidebar divs. That means all images have to use 100% of the screenwidth as a block regardless of how small or narrow it is. (If the image is smaller than half the screen size, it will not blow up to fill the screen width; still, it will leave a big empty space there, which is just as bad). Also, Kindle does not support handling for windows and orphans. This really affects readability. Mobipocket really doesn’t support css classes – a royal pain in the neck.

Oxygen lets you open the epub archive and to edit the html files. I have generally performed 2 post-processing actions in addition to the ones listed below on docbook/xsl.

  1. I have manually created the TOC.  Kindle cannot process two levels of TOC and css will not work at all.  To do 2 levels,
      <div><a href="structuralmatters3.html">My Chapter Title </a></div> <blockquote> <div><a href="structuralmatters3.html#paradigm-more">A Paradigm that's a little more Paradigmatic</a></div> <div><a href="structuralmatters3.html#points-of-view-concerning">Points of View Concerning Points of View</a></div> <div><a href="structuralmatters3.html#odbcc-pattern">The OBDCC Pattern</a></div> <div><a href="structuralmatters3.html#past-pouring">However you Begin, the Past Comes Pouring In </a></div> <div><a href="structuralmatters3.html#third-interlude">Third Interlude</a></div> </blockquote> <div><a href="chapter4.html">My Next Chapter Title </a></div> 

    The most recent Kindle Formatting Guide says that this method of doing subchapters is also acceptable using CSS

      <style> div.chapter { margin-left: 1em} div.subchapter { margin-left: 2em} </style> <div>Section 1</div> <div class="chapter">Chapter 1</div> <div class="chapter">Chapter 2</div> <div class="chapter">Chapter 3</div> <div class="subchapter">Subchapter 1</div> <div class="subchapter">Subchapter 2</div> <div class="chapter">Chapter 4</div> <div class="subchapter">Subchapter 1</div> <div>Section 2</div> 

    the above code will be sufficient (Thanks, Joshua Tallent!) There are no CSS for these things.  The  docbook-generated TOC often looks bad on the Kindle – especially if the chapter/section titles contain a lot of words. This is a case where it is probably easier to hand-code a TOC in bland generic HTML. then after you generate the epub, use Oxygen to paste the bland generic HTML into the html file containing the TOC and then Oxygen zip it back up. It’s ugly, but it follows my guiding philosophy which is : don’t let Kindle’s limitations hijack your docbook production process!

  2. Adding custom attributes and values for Kindle Epubs. I’ve used the following substitutions on all my HTML files to create no-indent files where appropriate. It mainly needs to be applied to places where you are putting PARA tags and not wanting them to be indented automatically.
    • Change <p class=”no-indent”> to <p style=”text-indent:0″> in all instances.
    • Change <div class=”caption”><p> to <div.caption><p style=”text-indent:0″> in all instances.
    • Change <p class=”pullquote”> to <p style=”text-indent:0″> in all instances.

    Update: All is not lost! With the help of Bob Staynton, we have a way to add custom attributes or values for epubs destined for the Kindle. That eliminates the need to do these postprocessing steps.

    To hardcode width=”50px” inside caption/para:

      <xsl:template match="caption/para" mode="class.attribute"> <xsl:param name="width" select="local-name(.)"/> <xsl:attribute name="width">50px</xsl:attribute> </xsl:template> 

    To add an attribute “width” with the value being determined by the value of role=”” inside the para element of caption/para:

      <xsl:template match="caption/para" mode="class.attribute"> <xsl:param name="width" select="local-name(.)"/> <xsl:attribute name="width"> <xsl:value-of select="@role"/> </xsl:attribute> </xsl:template> 

Here are some additional docbook tips in preparing epubs for processing by kindlegen.

  1. You need to use profiling (i.e. conditional code) in order to create a Kindle-friendly epub. (Read below). That of course means that when the kindle format dies its slow & ugly death, you’ll still have these conditional statements to deal with.  On the other hand, it probably won’t be too hard to remove this crap by other means, and your code base still is flexible to anticipate future shock.  August 2011 Update: Now that I’ve given up and agreed to search/replace postprocessing on html/epub files, maybe profiling will no longer be necessary except for images. (which you can manipulate through use.role.for.mediaobject and preferred.mediaobject.role .
  2. Update 3 March 2011 on images. For tall images on the Kindle 3, I think 150 pixels by 190 is the optimal size if you want the image to occupy about a third of the screen (and 40% of height). Users who set the 2nd pair of Aa’s on the font button will get 11 lines of text after the image (10 for the 2nd pair of Aa’s, 8 for the 3rd pair of As). For wide images I suspect that the height can be 190px; update. Well, that is interesting: I had a wide image at 500×334 and it still rendered ok. Maybe Kindle is more adept at resizing images which are wider than tall. By the way I am scaling the image itself, not the code for the image. That means that I am not specifying image dimensions in the docbook/html code.  The basic problem here of course is: How do you make images for Kindle which are small enough that they don’t blow up to the full page in Kindle device and large enough so that they don’t look puny in Kindle app for Ipad? So far, no one has been able to answer that basic question.
  3. I should point out something obvious about Docbook, Kindle and images. You can use parameters to set conditional images, so it probably is going to make sense to keep 2 sets of images– 1 for your non-Kindle epubs and 1 for your Kindle epubs (and one for your originals, I guess). Then when generate an epub from docbook, docbook will swipe only the ones relevant to your choice of epubs.  See  use.role.for.mediaobject and preferred.mediaobject.role .
  4. I’ve noticed that an epub generated by the XSL will have <spine toc=”ncxtoc”> <itemref idref=”cover” linear=”yes”/>.

    If you allow linear to be “yes,” when running kindlegen, that will produce a duplicate cover. This is hard to see in the Kindle Previewer, but obvious in the actual device. (But if you go to the Cover, and click the next page, you will see the same Cover; if linear=”no”, then after you click the arrow in the Previewer, it will go to the TOC. There is a parameter called  <xsl:param name=”epub.cover.linear” select=”0″ />. Allegedly, zero is supposed to mean no, but there’s a bug in the epub/docbook.xsl where it automatically goes to “yes.” I fixed it by copying the whole <xsl:template name=”opf.spine”> into my customization layer and hard-coding the value “no” into both parts of the test. As of the XSL version 1.7.6.1, you need to do this hardcoding. (I would definitely expect it to be fixed in next version). So here’s what my customized template looks like (Remember this is ONLY for Kindle Epubs!)

    <!-- FROM RJ For some reason $epub.cover.linear always in 1.76.1 returns yes. I had to hardcode it to no to prevent the cover from rendering twice in kindle --> <xsl:template name="opf.spine"> <xsl:element namespace="http://www.idpf.org/2007/opf" name="spine"> <xsl:attribute name="toc"> <xsl:value-of select="$epub.ncx.toc.id"/> </xsl:attribute> <xsl:if test="/*/*[cover or contains(name(.), 'info')]//mediaobject[@role='cover' or ancestor::cover]"> <xsl:element namespace="http://www.idpf.org/2007/opf" name="itemref"> <xsl:attribute name="idref"> <xsl:value-of select="$epub.cover.id"/> </xsl:attribute> <xsl:attribute name="linear"> <xsl:choose> <xsl:when test="$epub.cover.linear != 0"> <xsl:text>no</xsl:text> </xsl:when> <xsl:otherwise>no</xsl:otherwise> </xsl:choose> </xsl:attribute> </xsl:element> </xsl:if> <xsl:if test="contains($toc.params, 'toc')"> <xsl:element namespace="http://www.idpf.org/2007/opf" name="itemref"> <xsl:attribute name="idref"> <xsl:value-of select="$epub.html.toc.id"/> </xsl:attribute> <xsl:attribute name="linear">yes</xsl:attribute> </xsl:element> </xsl:if> <!-- TODO: be nice to have a idref="titlepage" here --> <xsl:choose> <xsl:when test="$root.is.a.chunk != '0'"> <xsl:apply-templates select="/*" mode="opf.spine"/> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="/*/*" mode="opf.spine"/> </xsl:otherwise> </xsl:choose> </xsl:element> </xsl:template> <!-- FROM RJ: end $epub.cover.linear = no hack -->
  5. Here’s a discussion about Mobipocket’s requirement to have a TOC file (which is totally unnecessary in epub files).
  6. The Kindle Previewer does an ok job of emulating Kindle formatting, but I am finding lots of inconsistencies (like extra  blank pages).
  7. You can approximate pullquotes with right-justified tables or just separating the pullquotes with HR tags. But I think it defeats the purpose (except for larger devices).
  8. Here are some kindle css tweaks by Anthony Levings to add to the mobipocket css file for paragraphs to turn indenting off for the first paragraph and how to vary the spacing between the header and the first paragraph.
  9. You probably should strip the P tags which are automatically output inside LISTITEMS for Kindle output. (See the Docbook-to-epub  section below).
  10. Because of a Docbook bug, chapters and sections and subsections are all h1. This is bad for Kindle format because that is the only way to contrast headings. Kindle can only differentiate between h1, h2, h3 and not h1.chapter, h1.section, h1.subsection.  (See the Docbook-to-epub  section below to see how I fixed this with the addition of a single line).
  11. I’ve noticed that the Kindle manual  includes a css declaration which puts extra space above.         h1 h2 h3{margin-top: 1em;} This sounds like a good way to create space between previous section and the title of the next section.
  12. Indexing work barely in epub-t0-Kindle. I’ve noticed that Kindle cannot understand or process hierarchy. In other words, it can process only terms inside the <primary> element but not the <secondary> element. To make it work for Kindle, you would only use terms inside the primary element. On the other hand, if you use only the primary element, you could make links to the specific location of the <indexterm> element rather than the section in which it belongs.  I think Kindle is unable to process definition lists properly.
  13. Here are some idiosyncrasies for using docbook to generate epub files which will be processed by kindlegen: strange lists, modifying the TOC, using hr for sidebars, adding pagebreak code (specified here).
  14. (NO LONGER VALID) Sadly, I report that an epub generated by Docbook for  submission to Kindle format has a major compatibility problem with sections.  HTML generated by docbook puts sections inside a div tag, but Kindle 3 stumbles badly with these nested DIV (especially one containing lots of content). The result is that  (with Kindlegen 1.1 at least), the Kindle will add lots of extra blank pages at the end of the chapter. For a simple chapter with only two or three sections, Kindle inserted up to 30 extra pages at the end.   There are (as I see it) 2 solutions. The first (and easiest) solution is not to use sections at all and instead using the BRIDGEHEAD element to separate each “section.”  You can use the renderas attribute to specify the font style of the heading title. Then you can set the  parameter bridgehead.in.toc to a nonzero value, so that the TOC generates all bridgeheads. Unfortunately the bridgehead titles will show up, but they will show up in the same order as chapters. The second solution is to use a Docbook-to-Docbook XSLT transform especially for Kindle output. In this Kindle transform, you  remove sections and then insert bridgeheads where the first SECTION element ought to go.   that TOC item is surrounded by a SPAN class=bridgehead tag, and I can modify the indenting of it (and confirmed that it works on the Kindle 3). As long as you don’t touch the original Docbook files (with the sections untouched), it should render acceptable epub-for-nonKindle devices. Writing this transform has now become the top priority for me. I will post the solution when I hit upon one!  March 13 2011 Update: Perhaps the problem is not as bad as I thought. I had an xsl command that was out-of-place.
  15. (MIGHT BE NO LONGER VALID) Sometimes images will blow up to fill the entire screen; how utterly stupid!  (That must mean that css width doesn’t work worth a damn). I’ve read somewhere that unless you specify the image dimensions in the img tag itself, the default is to blow up to the whole page.  I have yet tested this yet, but regardless, this kind of thing should never happen in an ebook.  Dec 1 Update. Tallent explains the method behind the insanity. Kindle automatically blows up any image to full size which is less than half the size of the screen. Specifically, if it’s less than 261×319 px on Kindle 1 or 260×311 px on Kindle 2, Kindle will  blow up the image.    See also Anthony Levings’ thoughts about Kindle + images. Update 2:  In the ebook ninja podcast (Nov 2010),  Tallent gives updated information about Kindle image. Hairy! Apparently Kindlegen 1.1 blows up the ebook file size  significantly, leading people to choose an earlier version of kindlegen instead.
  16. Might be no longer valid. HTML files prepared for conversion to Kindle format should not use UTF-8 formatting. Instead you should specify ISO-8859-1 encoding in an HTML meta tag. For the docbook-generated epub to produce a well-encoded epub file, set this parameter: <xsl:param name=”chunker.output.encoding”>ISO-8859-1</xsl:param> . April 2011 Update. I think this is no longer true, but I haven’t confirmed if you use kindlegen 1.2 (the command line tool).

Docbook to Epub production

  1. According to Keith Falhgren, the epub docbook.xsl works better when you use the vanilla stylesheets and not the NS-version.
  2. Db2epub.py will create a directory named after the name of the xml file you fed it via command line. So if your xml file is called index.xml, the epub will be called index.epub at the same level as your xml files. The The output HTML subdirectory will be named index. I usually test first with the html files inside the index directory. Here are 2 gotchas about testing this way:
    • If you change the css file in the docbook project, you need to delete the css file that was copied into the index directory before you run the script again.   For some reason, the script is not smart enough to know that the css file hasn’t been updated.
  3. The latest Docbook stylesheets has some HTML errors with the sidebar element.  (Update: this problem is related to the fact that  <div class=”titlepage”/> doesn’t render correctly in XHTML on browsers – I haven’t checked on devices yet). You can get around it by attaching a class/role to the para element (which you can style in CSS), but that means you only can put one paragraph inside the sidebar (and have no way of putting several paragraphs).  I’ve gotten around it with this code.   <para role=”sidebar”> This is the first line of the sidebar.  <?linebreak?>This is the second line. </para>. Then in your customization layer, you add this:
    <xsl:template match="processing-instruction('linebreak')"> <br /> </xsl:template> 

    This works with db2epub as well as Saxon.

  4. For the Docbook 1.76.1 stylesheet, you can’t apply class.values to tables or informaltables. The workaround is to specify  your custom class as the value of the tabstyle attribute. Strange, but it works.
  5. You can strip the TOC which appears in the main text (leaving it in the TOC listing found in the interface of the ebook software). To do this, add this to the xsl:<xsl:param name=”generate.toc”></xsl:param> <!– no space at all –> or use the value “nop”. Leaving out the TOC in epub files is a best practice although if you want you can still leave it there. (Big publishers do it simply because they don’t want to customize their epub files after generating Kindles).
  6. Might no longer be valid. Because of a bug, the copyright and legalnotice elements don’t render adequately in epub. For the time being, it’s best just to make the legalnotice into a separate chapter in the appendix. April 2011 update:  I tried  this again, and it seems to take the copyright element to put on the title page, so maybe this bug isn’t valid. (Legalnotice might still have problems; have to confirm).
  7. I have noticed that when you use the epub/docbook.xsl stylesheet,  by default epub output shows h1 for both  chapters and sections (which is bad).  Sections are supposed to output h2, subsections as h3, etc.  With epub output, this problem is not so bad because you can create a css rule for div.chapter and div.section, but still that is not right.  I think there is a bug in epub/docbook.xsl which causes this.  To fix this, I copied the entire contents of  the code inside and including <xsl:template name=”section.heading”>from epub/docbook.xsl into my XSL customization layer and simply edited one line:
     <xsl:variable name="hlevel"> <xsl:choose> <!-- highest valid HTML H level is H6; so anything nested deeper than 7 levels down just becomes H6 Note from Robert: I added + 1 on the xsl: otherwise statement--> <xsl:when test="$level &gt; 6">6</xsl:when> <xsl:otherwise> <xsl:value-of select="$level+1"/> </xsl:otherwise> </xsl:choose> </xsl:variable>
  8. Profiling (that is, using conditional text in your source so that you have 2 different kinds of output) is possible with db2epub.py . It is simple really, but you need to do “two pass profiling.” The first pass will simply apply a special profiling xsl sheet called profiling/profile.xsl and include the profiling condition.  I just created an intermediary customization layer called rjprofile.xsl which read like this:
     <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml" xmlns:d="http://docbook.org/ns/docbook" exclude-result-prefixes="d" version="1.0"> <xsl:import href="../../../1latest/docbook-xsl-1.76.1/profiling/profile.xsl"/> <xsl:param name="profile.condition">not-kindle</xsl:param> </xsl:stylesheet>

    The first step is to use attributes to designate which elements need to be included or excluded in Kindle epubs:

     

    <para condition="kindle">This will appear only in Kindle </para> <para condition="not-kindle">This will appear only in not-kindles. </para> <para condition="web-version">This will appear in the web-version of the ebook.</para>

    In Oxygen I create an intermediary output called rj-profiling-result.xml. Then I use this as the input for the standard epub docbook stylesheet.  From the command line you will do this:

     

    db2epub.py rj-profiling-result.xml --xsl epub.xsl 

    (You cannot do single pass profiling with the db2epub script, but two pass works fine – with the caveat about ?dbhtml-include — see below). Profiling can help if you are delivering two different versions of the same output (like epub and kindle-epub).  Unfortunately this means that profiling becomes mandatory for all ebooks and that one extra step becomes necessary.   Note that you can use multiple conditions here in your xsl file. For example, this line

     

     <xsl:param name="profile.condition">kindle;web-version</xsl:param>

    will cause 2 of the three statements to appear in the original code. In general, using profiling is cumbersome, but it provides flexibility at least. is really mandatory if you want to accommodate Kindle customizations, but  not for adding html code remember that to exclude elements with attributes marked condition=”kindle,” really the only way to do this is to modify rjprofile.xsl (see above) by changing the profile.condition from kindle to “not-kindle” (or whatever you designate).  Conversely, if you want to designate content to be excluded from the kindle-customized epub, you should apply the attribute “condition=”not-kindle” to it.

  9. Docbook already has a parameter for swapping images depending on the platform. It is called use.role.for.mediaobject. This can be done without needing to enable profiling. So far, I have not had a need for this although I imagine if you want to use a reduced image for kindle, you could do it this way:
     <mediaobject> <imageobject role="html"> <imagedata fileref="web-images/unchecking72.jpg"/> </imageobject> <imageobject role="kindle"> <imagedata fileref="light-images/unchecking50.jpg"/> </imageobject> <imageobject role="fo"> <imagedata fileref="fo-images/unchecking312.jpg" contentwidth="10cm"/> </imageobject> <textobject /> <caption> <para>This is the caption.</para> </caption> </mediaobject>
  10. One quirk of Docbook is that it adds a P tag into every LI element.  Mobipocket especially does not like that. Here’s a solution proposed on the docbook-apps list to strip the P tags and just leave the LI tags.
     <xsl:template match="orderedlist/listitem/para [count(preceding-sibling::*) = 0 and count(following-sibling::*) = 0]"> <xsl:apply-templates/> </xsl:template> <xsl:template match="itemizedlist/listitem/para [count(preceding-sibling::*) = 0 and count(following-sibling::*) = 0]"> <xsl:apply-templates/> </xsl:template> 
  11. HTML indices for Docbook don’t use page numbers, but indicate the section title. That can be cumbersome. The solution is (for now) to give a titleabbrev element for every title and then prevent the TOC builder from using the titleabbrev in the TOC instead of the original title. Until there is a parameter to disable this, you can customize the template named ‘toc.line’ in html/autotoc.xsl to change
    <xsl:apply-templates mode="titleabbrev.markup" select="." /> 

    to

     <xsl:apply-templates select="." mode="title.markup"/> 

    Now that I look at this more closely, I  see that the epub stylesheet uses xhtml-1.1, so the proper file to edit would be xhtml-1_1/autotoc.xsl.

  12. (Sept 15 2011 Update. Bob Staynton fixed this bug and the fix is now in the snapshot. Read the glorious  details!) HTML indices don’t work well when you use a SECONDARY element under INDEXTERM element (i.e., most of the time). index.links.to.section determines whether the index link will go to the exact place in the section or simply to the top of the section. The default value for this parameter is 1 (which means the link will go to the section and not the exact location). Unfortunately, changing it to 0 will make the link go specifically to the specific location in the text; however, it will create duplicate links in the index: one  to the value of the PRIMARY element, and the other to the value of the SECONDARY element.  I ended up leaving the value of this parameter to the default (i.e, 0) and indicating on the heading of the index that it will go to the section, not the exact place. “Topics, listed by Chapter & Section”.  Another option is to not use the SECONDARY value at all and change the parameter value to 1.  It’s a tradeoff. (More about this bug). 4/1/2011 Update. When you convert from epub-to-Kindle,  kindlegen massacres the index. It cannot process the secondary element only the primary element; it can only deal with one level of index terms. I think it relates to the way Kindle handles definition lists.
  13. (XHTML Output only) This is not directly related to docbook epub, but it’s semi-related if you need to use Docbook to produce HTML websites.    Basically when you specify XHTML output, it serves the page with the application/xhtml+xmlfile type and that messes up the major  browser in several different ways. I noticed it when I realized that using empty <a id> tags caused the code not to validate. The solution suggested by the mailing list was relatively simple:
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; xmlns:saxon="http://icl.com/saxon"; exclude-result-prefixes="saxon" version="1.0"> <xsl:import href=".../path/to/xhtml/chunk.xsl"/> <xsl:param name="chunker.output.method">saxon:xhtml</xsl:param> <xsl:param name="chunker.output.indent">yes</xsl:param> <xsl:param name="chunker.output.omit-xml-declaration">yes</xsl:param> </xsl:stylesheet> 

    Even if you didn’t use the Saxxon parser, the other two commands did the trick for me. By the way, I believe that db2epub did not have this problem because its default output type was html.

  14. Docbook has a method for choosing which elements to include in your title page (and by implication, in the top part of some major components).  If you use this method for HTML output, you need to manually add that the default namespace will be HTML/XHTML. (More details here). This is a bug.
  15. The xsl parser for db2epub can process conditional text (i.e., profiling),  but it can only do two-pass profiling.  (This is not a problem with Saxon). That means having to run an intermediary step.  Unfortunately after you perform the first of two passes, you will not be able to remember the relative xml base, so ?dbhtml-includewill not work unless you specify an absolute path. For example, suppose you want to insert some special code especially for the Kindle which you cannot do by using docbook. You would do it this way:
     <section condition="kindle"> <?dbhtml-include href="file:/I:/My%20Documents/xml/kindle/crappy-kindle-code1.html"?> </section> 

    That impacts the portability of your docbook project. On the other hand, because you are already using that intermediary step to produce profiling output, you could do a simple global search-and-replace command on rj-profiling-output.xml to update the file path. Ideally, because of the difficulty maintaining this file path, you should look for other ways to achieve the same result within docbook. On the other hand, Oxygen editor will soon have a docbook-to-epub script built into it, so hopefully that will use Saxon (and make single-pass profiling possible).

  16. Endnotes.Docbook uses the footnote tag to handle endnotes in html output. End Notes work perfectly in epub output, with the caveat that they appear at the end of chapter (or whatever you have chunked each file to be.  I added one the word “Notes”  after the HR line.
     <!-- From RJ: I added one line of HTML code to this template in chunk-common.xsl so that Notes appears as h3 under the HR. Other than that, everything else is the same. --> <xsl:template name="process.footnotes"> <xsl:variable name="footnotes" select=".//footnote"/> <xsl:variable name="fcount"> <xsl:call-template name="count.footnotes.in.this.chunk"> <xsl:with-param name="node" select="."/> <xsl:with-param name="footnotes" select="$footnotes"/> </xsl:call-template> </xsl:variable> <!-- <xsl:message> <xsl:value-of select="name(.)"/> <xsl:text> fcount: </xsl:text> <xsl:value-of select="$fcount"/> </xsl:message> --> <!-- Only bother to do this if there's at least one non-table footnote --> <xsl:if test="$fcount &gt; 0"> <div class="footnotes"> <br/> <hr/> <h3> Notes</h3> <xsl:call-template name="process.footnotes.in.this.chunk"> <xsl:with-param name="node" select="."/> <xsl:with-param name="footnotes" select="$footnotes"/> </xsl:call-template> </div> </xsl:if> <!-- FIXME: When chunking, only the annotations actually used in this chunk should be referenced. I don't think it does any harm to reference them all, but it adds unnecessary bloat to each chunk. --> <xsl:if test="$annotation.support != 0 and //annotation"> <div class="annotation-list"> <div class="annotation-nocss"> <p>The following annotations are from this essay. You are seeing them here because your browser doesn&#8217;t support the user-interface techniques used to make them appear as &#8216;popups&#8217; on modern browsers.</p> </div> <xsl:apply-templates select="//annotation" mode="annotation-popup"/> </div> </xsl:if> </xsl:template> 

    The class to change the word Notes is: div.footnotes h3

EPUB/CSS Stuff

(Generally these messages will be about CSS as they relate to docbook-epub or docbook-epub-Kindle, although the code here should be generally applicable. Soon I will just include a link to a good CSS file.

  1. Tagging paragraphs. At the start of every chapter in Docbook, you should tag your first paragraph with  <para role=”first-para”>. When the epub/xhtml script runs, that will make the first paragraph <p class=”first-para”> which you can then  style with css. That allows you to style the first letter of the first line and also to style the first line itself.
  2. I have had problems getting good dropcap code even if I use the code from Liz Castro’s epub book.  I will post code which works later; suffice to say that it depends on a lot of  factors like font and screen real estate. It’s not that hard to get it to work with one font on one platform, but hard to get it to work for multiple platforms.  Here’s an explanation of dropcap CSS for web browsers.
  3. Turning off Hyphenation. I generally like hyphenation, and don’t think it’s good to turn it off when the ebook devices keep it turned on by default. But it’s a good idea to turn it off for certain elements like titles. Here’s a webkit CSS declaration to turn off hyphenation for h1, h2, etc: h2.title, h1.title, h3.title {-webkit-hyphens: none !important;

    adobe-hyphenate: none;

    -moz-hyphens: none;

    }

    . Elizabeth Castro provides background to this turnaround.

  4. No indent after blockquotes. You can set indents globally in your css or the device can take care of it.  But after a blockquote, there are two cases: a)if the paragraph is the continuation of the previous one, then you do not want an indent, b)if the paragraph is in fact a new paragraph, then you still want an indent. You need to hardcode whether an indent is necessary in this next paragraph. The easiest way to do this is to use  <para role=”no-indent”> and create a css rule for p.no-indent. In Docbook, a blockquote can still be a child of a paragraph – which means that you are allowed to continue the paragraph after the blockquote. Unfortunately the HTML output does not reflect this but inserts an extra div tag:
      <p> Now you will see a very interesting quote: </p> <div class="blockquote"> <blockquote class="blockquote"> <p> Here is the extra long quote consisting of several lines. Because this is only an example, this quote is short. </p> </blockquote> </div> <p class="no-indent">This paragraph is a continuation of the previous paragraph, so an indent should not be here. You can see that by the no-indent class in the next P element. </p> 

    I will be checking next about how mobipocket handles all this and what changes (if any) are necessary.

Book Marketing/Royalties/Promotion

  1. In November 2010 here are the royalty percentages for the online distributors: Amazon/DTP 70%, Barnes & Noble PubIt 65%, Apple ibookstore 70%,  Lulu 80%, Smashwords 85% minus Paypal transaction fees. Smashwords requires MS Word files to convert to other formats. It’s well worth losing the 5% to be able to upload an epub directly to Lulu rather than having Smashwords create one for you. At the same time, the kindlegen script has some quirks of its own.  If you can’t pass muster with PubIt or ibookstore, Smashwords can get you into B&N or ibookstore with substantially lower royalties. Not worth it, I say.

Docbook Future: Docbook Publishers’ Schema and Docbook 5.1

One exciting development is that there is now an official variation of Docbook called Docbook Publishers’ Schema which excludes many technical elements not used too often  and includes some elements which would be commonly used in mainstream publishing. That includes: speaker, line, poetry, dialogue, drama, linegroup. The new schema includes better Dublin core support.

Also, Docbook 5.1 is in draft. A cursory inspection of the Docbook 5.1 book shows that there is now support for topics through a feature called Docbook Assemblies.

Resources/References


Posted

in

by

Tags:

Comments

10 responses to “Ebook/Epub/Docbook Braindump”

  1. jseliger Avatar

    Great post. A question you might not want to answer: Are you working on these files primarily for yourself or for others?

    A couple others that might be easier:

    How much time do you estimate you’ve spent learning how to convert files into reasonably attractive eBooks?

    If you’re doing this for others, are you eventually going to offer eBook technical consulting and the like?

  2. Robert Nagle Avatar

    Hi, Jake.

    I wanted to learn how to do everything through Docbook. big learning curve at first, but the fact that it creates a epub file automatically while still letting you edit source is a major reason for me. It is easier to maintain in the long run. Docbook is really powerful, but hard to grasp and configure. It can handle a lot of the advanced configurations which still aren’t there in current ebook readers.

    It took about a year to figure out how to get the hang of it. But now I’m at the point where it’s easy to create epub files. I have 90-95% of the css done too. By the way, in a month or two I plan to release a long tutorial about my process for creating epub files with docbook. Now that I’ve gone through the whole process, it’s really quite easy.

    This is mainly for myself, but I’ll be helping Jack Matthews to digitalize his books. I’ll probably be able to sell this service fairly cheaply as well.

    I spent an afternoon or two trying to figure out how to convert epub files to mobipocket. There’s a script to do that, but the result was awful. I basically decided that mobipocket required way too much effort to produce. It would not be my primary platform.

    I plan to add to this tip list over time.

  3. Diana Shannon Avatar

    you wrote: “As far as I know, there is no way to add an epub file in Nook for ipad or a mobipocket file in Kindle for ipad.”

    I load .mobi files in the Kindle app for the iPad via Phone Disk on my Mac. I change the connection root to Kindle and load files I’m testing into Documents > Ebooks. I haven’t figured out how to do this for iPad’s Nook app yet.

  4. Robert Nagle Avatar

    Diana, this is great information. I know about the Phone Disk application trick . (there’s a Windows version by the way). But I did not know you could do it with Kindle as well. Thanks!

  5. Marco Cevoli Avatar

    Hi,

    thanks a lot for your insights. I’m looking forward to seeing your docbook to epub tutorial! I’m an Italian translator and will be happy to translate it to Italian and publish it on our website, if you grant us permission.
    Kind regards

  6. Sharon Gallagher Avatar

    An excellent write up. Thank you. We are working on complex layouts with images using Sigil.

  7. Lars Vogel Avatar

    Thanks Robert this info is really helpful. Great that you keep this list.

  8. Russ White Avatar
    Russ White

    Robert,

    This information has been very valuable to me. Thank you so much for taking the time to capture it.

    I wonder if you know of anyone doing reverse conversion. What I mean is epub to docbook. It seems like it it should be possible, but I can’t seem to find any automated way of doing it.

    Thanks!
    Russ

  9. Bill Hutchison Avatar
    Bill Hutchison

    Hi – thanks for this very interesting and substantial post. I feel I am asking rather an ignorant question, but here goes. Why would a business commit to using DocBook 5 instead of ePub? What are the arguments against originating all content products in XHTML 5/ePub as some people propose? Your insights would be much appreciated.

  10. Robert Nagle Avatar

    (By the way, although the information on that URL is still "right", some of it is out of date. One could legitimately argue that it’s no longer necessary to make ebook files for the older Kindle format and just stick to KF8. Also, I think producing epub3 files is a better long term strategy than producing epub).

    Advantages of Docbook

    • you can make global changes more easily. (not just css, but moving/modifying certain blocks of code
    • Docbook will auto-generate the TOC, indices, some tables as well as the epub-specific files (ncx, etc). Indices are really easy.
    • with docbook you have the ability through XSL to globally move/shift certain content into other places.
    • because you are single sourcing, you can run a Docbook XSL transformation to produce PDFs, websites, slides. You can even make separate transformations for different kinds of epub destinations. (this is becoming less of an issue because the differences between ebook platforms are shrinking).
    • there are lots of helpful parameters which you can add to your customizations to perform common tasks

      there is already a stylesheet to produce epub3 files http://50.56.245.89/xsl-ns/epub3/README )

    (see this mailing list archives https://lists.oasis-open.org/archives/docbook-apps/ and snapshot builds

    BUT

    • learning curve is high
    • it is complicated to set up your environment
    • you need to be highly skilled in XSL to make customizations
    • debugging can be hard
    • sometimes very simple coding tasks can be hard (because in your customization you have to write an XSL transformation to select and transform the way you want to)
    • although Docbook claims it supports multimedia like audio and video, I tend to doubt it does so easily. 
      there’s not a lot of eyeballs to spot bugs

    Advantages of hard-coding HTML

    • easy to set up. can use any tool.
    • it’s easier to code audio and video into a book.

    BUT

    • need to become extremely familiar with epub standard so you can hand-code those epub-specific files and validate them.
    • have to use something like regular expressions to make global changes to the code.
    • You are not really "single sourcing." More than likely you have different sets of html for each different ebook file. That’s a mess.
    • Don’t have an easy way to roll it into a PDF.
    • Very hard to produce an index.

    There are pain points in both. With HTML, it’s using regular expressions (or some global search and replace mechanism) to make global changes. With Docbook, it’s writing the custom XSL to make customizations when the vanilla style sheets don’t work.  If you don’t know XSL somewhat well (enough to be able to make a customization file and add parameter values), you may need a while to get off and running.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.