A new approach for splitting HTML outputs generated from DocBook XML source into individual files

By Jan Tošovský on Aug 24, 2017

Natural content splitting

If large documents are maintained, it is handy to split them into smaller parts. Such content is not only easier to navigate but it also brings other benefits, especially if the source files are versioned using systems like SVN or Git. If documents are edited by multiple users, there is a lower risk of conflicts when integrating changes back into the main branch. It is also easier to track particular changes in the version history.

Unfortunately, currently there are no means to preserve this natural splitting when generating set of HTML pages.

Current chunking methods

When generating HTML outputs, the original content can be split (chunked) into individual output files using two basic methods:

  1. automatic chunking - in simple terms all chapters and sections up to the specified depth produce separate HTML files

  2. manually controlled chunking - separate HTML files are produced based on the configuration file containing the final structure with IDs which match IDs in the source document

While the latter method allows to completely control the splitting process, it is only useful for documents with the stable structure as it requires hand editing. This is the main reason why the first method is still preferred, even though the result is suboptimal. By default the first section is kept together with its parent. This may me confusing as first sections do not produce separate files, but other sections do. However, if the first section would be split separately, the parent chunk could contain just the title.

XInclude-based chunking

Once document is split naturally, why not reuse this division also for chunking? Good news. It is not a problem any more when keeping several simple rules and employing one handy tool.

For further details please navigate to the project pages with the more detailed description and sample data.