This document describes issues in setting up and configuring a Nabu server handler.
The current server requires the clients to authenticate themselves. That is, when calling methods on the server handler, a username must be supplied. This can be most easily implemented using the HTTP Authentication mechanism well supported by web servers.
The server handler code is generic, but the client expects to be talking to an XML-RPC server. There are various way to implement this, depending on your web application framework. The simplest way is probably to create a CGI script to call the server handler. There is a convenience method in the server code that does this. See code below for the CGI script.
When creating the server handler, you must decide which extractors you will want to run after processing. The server expects a list of pairs, each of which consisting in the extractor class to instantiate, and an instance of a storage object used by that extractor to store the information it collects from the document.
Here is an example CGI script that
# stdlib imports import sys, os from os.path import dirname, join # add the nabu libraries to load path of our CGI script root = dirname(dirname(sys.argv[0])) sys.path.append(join(root, 'lib', 'python')) # dbapi import. import psycopg2 # nabu imports from nabu import server, sources from nabu.extractors import document, link, contact def main(): # connect to the PostgreSQL database params = { 'database': 'nabu', 'user': 'admin', 'passwd': 'admpassw', 'host': 'localhost', } module, conn = psycopg2, psycopg2.connect(**params) # make sure we're authenticated username = os.environ.get('REMOTE_USER', None) if username is None: print 'Content-type:', 'text/plain' print 'Status: 404 Document Not Found.' print return # create a storage space for the uploaded source data src_pp = sources.DBSourceStorage(module, conn) src = sources.PerUserSourceStorageProxy(src_pp) # configure this server with whatever transforms we want to apply transforms = ( (document.Extractor, document.Storage(module, conn)), (link.Extractor, link.Storage(module, conn)), (contact.Extractor, contact.Storage(module, conn)), ) server.xmlrpc_handler(src, transforms, username, allow_reset=1) if __name__ == '__main__': main()
The simple DBSourceStorage uploads database shares the unique ids between all users. This means that users can see and erase each other's content ids in the database. Clearing the database clears all documents, including other users'. This is a reasonable arrangment if the data is pushed from a common body of text files, perhaps controlled with a source-code management system.
There is also a proxy sources storage class provided, which prepends the usernames to the unique ids stored in the database. This effectively creates a separate document set for each user. This is an interesting setup if users are expected to contribute disjoint document sets to the pool of information in the Nabu database. However, this has important consequences if the users are pushing documents from a shared body of source texts: those documents will be stored twice in the database. You need to choose the policy which matches your intended use.
To view or debug the contents sent to the database, you can view the contents via a CGI script that is provided with the source distribution. This is only meant as a debugging interface, you should create a custom presentation layer any other way you like.