Project Home

Nabu: Design Document

Author: Martin Blais <blais@furius.ca>
Date: 2005-06-22

Abstract

Design documentation for the Nabu publishing system.

Contents

Components

The system consists in a few components, described in the next sections.

nabu1.png

Nabu system components.

Finder

The finder is the publisher client. Its responsibilities are:

  • find the files that have a id marker in them;
  • query the server to find CRCs (MD5 sums) for these files;
  • compare the CRCs with the files on disk;
  • send the files that need to be updated to the server.

You can find it in module publish.

Server (Handler)

The server receives files or document tree sent by the finder and processes them, stores them in a "sources upload database", and then runs the extractors on them which fill other tables in a database with various kinds of information.

The server has to be configured with a specific set of extractors, which the person who sets up the server decides, depending on what information they want extracted from the document trees.

Note that this server can be technically be instantiated locally (i.e. imported as a library from the finder), this could be used to fill a local database. Normally, we expect it to be set up over a network connection, using XML-RPC, with a simple CGI hook on a web server, or incorporated within some web application framework.

You can find it in module server.

Uploaded Sources Storage

This interface is used to implement storage of the uploaded source documents, information about them (time/date, user, original filename, etc.) and the document tree (stored as a pickled stream usually in a database blob). You can create concrete implementations of the Source Storage to store this data in your favourite format and place. There is a default implementation that uses a SQL DBAPI-2.0 connection to store in an SQL database.

You can find it in module sources.

Extractors and Extractor Storage

Various extractors can be implemented to fetch information from the document tree, and/or to modify the document tree before it is stored for later retrieval.

Each of these extractors is configured by the server setup with specific extractor storage objects associated with each of the extractors, which are used to abstract the storage mechanism.

Thus, code for an extractor generally includes: a docutils transform -derived class (the extractor itself), and a class derived from ExtractorStorage for this extractor.

You can find the base classes in module extract. The extractors are available in the nabu.extractors package.

Other Modules

Processing the Extractors

The process module contains some code to apply the transforms on an existing document tree.

Debugging the Contents

The contents module contains code to display a basic dump of the various contents of the uploaded sources storage via the web.

Association Between Extracted Stuff and Source Documents

A key aspect of the system design is that source documents that get uploaded to the server may be uploaded again later for reprocessing. This implies that we need to replace all the data that had previously been extracted from a document when it is sent a 2nd time for processing. The way we achieve this, is by making sure that the extractor storage object stores this association for each extracted piece of data by storing the unique id of the document along with it, and the extractor storage has a protocol to clear all data associated with a specific document.