Project Home

Nabu: A Text-Based Content Creation System

Author: Martin Blais <blais@furius.ca>
Date: 2005-04-25

Abstract

We describe a system that aims at making it possible for users to create relevance in their content and allow building ways to serve it intelligently by providing a semantically rich access to this data.

Contents

Introduction

We are considering a system for easing the management and publishing of various kinds of information personal to a single producer with a system that is "as simple as possible".

By personal information, we mean:

Motivation / Angles

We present differing angles to various problems of information management"

Simplicity

One central value behind our views, and one that you must keep in mind when considering the proposition that we're about to make, is that of the importance of simplicity. There is great value in keeping things as simple as they need to be, because it allows the most flexible reuse of the information.

Much of the rewards of keeping design and data simple can be observed in the power of the UNIX tools and operating system, which is built upon simple but very powerful ideas, sockets and files that consist in generic streams of bytes, and small tools that perform one task really well, and a simple and generic way to connect those tools together (See "The Art of UNIX Programming", E.S.Raymond). This has made possible the creation of complex tools without having to reinvent the small tools, but rather by improving those small tools in a generic way, that would henceforth allow more possibilities for connecting them in yet more different ways. Keeping things generic and as simple as possible is a potent idea.

This idea of designing systems as simple as they need be is also prevalent in the practice of software development. Over the past ten years, we are seeing methodologies of development convergence towards this idea. Extreme programming, agile methodologies, and the growing adoption of dynamic languages are a direct expression of the quest for reaching closer and closer to the essence of the problems we're trying to solve while trying to get rid of unneeded complications. In many ways, software development is in the business of creating complexity. We are essentially recognizing that keeping our designs and data models as simple as possible is the most efficient way of controlling the growth of this complexity.

History

This history behind the creation of this project stems from a long-standing need from its author to maintain personal information in a way that is most useful and that can be kept independent from specific software, over long periods of time. The sections below outline some of the problems I have tackled in the past, and the partial solutions I have come to before creating the Nabu extraction system. Nabu is meant to replace all these tricks to allow me to extract, organize and selectively publish some of this information.

Maintaining an Address Book

I needed to maintain an address book. At the time (circa 1993) on software was decent that output a textual format which could be read for converting the data into other formats. Thus around 1997, I decided to transcribe all my physical address books in a text file, following my supervisor's advice at university at the time, used a paragraph-grep program to query it. This worked great for many years, except that there was no integration with my email programs. I could however grep and sed the address book file to generate a text file that could in turn be imported by various email systems. Over time, the one address book file grew into many, and new contact information moved gradually into the documents which provided context for them.

I think at some point I have started using the LDAP LDIF format to store the files, but the naming was a bit too long or annoying to add entries with a text editor, so I just created my own simple format, which looks like a list of entries like this:

n: New Navarino Bakery & Pastry Shop
p: 514-279-7725
a: 5563, avenue du parc, Montréal, QC H2V 4H2

Cross-Browser Bookmarks

Another issue is that of maintaining a set of bookmarks. One of the problems is that every few years a new browser comes out, and I end up moving to it. For example, I started using the web with Xmosaic, and eventually moved to Netscape. On Windows I eventually had to use IE, and eventually switched to Konqueror on a Linux machine, and then Mozilla, which was very heavy, so eventually to Firefox. Most of these browsers have slightly different bookmark storage formats which are not conveniently edited within emacs.

A more important problem is that of the organization bookmarks. Adding all bookmarks in a linear list makes it nearly impossible to reuse them efficiently (it is very hard to find a bookmark that you're looking for). Tree structures help alleviate this problem to some extent, but add another problem: when you want to quickly add a bookmark (somehow, it always has to be quick), you have to choose a single most appropriate place to put it, and if you're not very careful with this you often have a hard time to find your bookmark back.

I found this problem really annoying, so I designed a very simple textual format for bookmarks, where I would enter a description, url, and a list of keywords. I wrote Tengis, a program that can read this format and can quickly query the bookmarks with keywords. Unfortunately, I never quite got used to using my own software on top of the browser, and always end up grepping for the file within emacs.

Here is an example excerpt of a bookmarks file:

Babelfish
http://babelfish.altavista.com
search, languages, translation

Amazon
http://www.amazon.com
search, books, music

Abebooks
http://www.abebooks.com/
search, books

Another problem is that various links end up being stored in documents, text files which I write when I accomplish some specific task. These do not make it to the global bookmarks file.

For convenience, I wrote a script that could convert this file in a tree structure and automatically generate bookmarks files for whatever browser I'm using at the time.

Project Ideas, Mind-Mapping

Whenever I have an idea for a project, something that I find interesting enough, I document it. I would like to share these documents, but they change quite a bit over time, and they don't necessarily belong together for the presentation layer.

Task Notes

There is much information to be acquired when using computers. A good habit that I have acquired is to start a text file to jot notes whenever I take on a task that is going to take a few hours. This helps keep my focus organized, and serves as reference if I have to repeat that task in the future. It is also very useful to just send those instructions when someone asks me how I accomplished this task in the past. I also avoid wasting time when I need to make a new iteration of the same task-- I can review my thoughts at the time, the decisions I made, etc.

Paper and Book Reviews

When you are surveying a lot of scientific papers, it is good to take notes on ideas and to summarize the crux of each paper that you read. This helps organize your thinking by forcing you to write and express your thoughts. I always wrote short 5 or 6 paragraph reviews of the papers that I read. These live in separate files and can sometimes be reused by friends when they ask me about specific subjects, when I point them to some paper or other.

Also, I like to take down quotes from the books that I read. Whenever I read a book, I mark down interesting passages, and when I'm done with the reading, I take 30 mins to copy these passages in text files. I sometimes like to feed from this body of quotations to add to my signature in email (although I must admit that I have eliminated using signatures at all for many years now). In any case, I sometimes enjoy going back to those review files when I'm having an idea that relates to a book that I have read.

Key Themes

A key theme behind the problems described above, is that the software that you use to manipulate your personal information or notes files, is going to change. Therefore it is a bad idea to use closed formats like that produced by MS Word, or similar software, if you want to be able to maintain and use these documents for a long time.

I very much trust simple text files. They will always be readable, and interpretable, and they use little storage. In this context, docutils is an amazing tool because it allows you to extract meaningful structure from them, as long as you follow minimal conventions. One of the principal motivators behind this system is to provide the ability to maintain all sorts of personal information using simple text files. This is a key aspect.

Goal

Simply stated, our goal is the following:

To make it possible for users to create relevant content and allow building ways to serve it intelligently by providing a semantically rich access to his data.

We want to make it possible to build services on top of the user's valuable resource: information. In order to do this, we have to make it possible for any user to build this meaningful source of his information, to add relevance to it. We want to:

  1. make it easy to enter the information in a way that allows an automated system to extract the meaningful chunks of data and associate them with pre-defined (and extensible) semantics.

    This may involve some form of simple markup (e.g. "create new document", "insert contact info", "insert bookmark"). Easy means simple. The interface and data format has to be simple, if not trivial;

  2. provide a service that will store this extracted information in a way that is accessible by various publishing services;

  3. create services that will offer creative views on this data.

    You can think of a blog interface, image galleries, a birthday notifier system, a system to sync your data store with your PDA, to serve your personal bookmarks as RSS feeds, to publish your travel log, to show your calendar of events, etc.

    These views would create value by providing convenient access and novelty on top of the user's data source. Each of these views would use as its basis the parsed data source, stored and access in an efficient manner (i.e. in a database).

Our aim is clearly NOT to:

We believe that relevance in information is the result of a certain amount of conscious effort from the part of the user, and that search technologies have an inherent limit in the quality of the information that they can provide, in terms of filtering and organizing the data that navigates in a user's system. This is a key aspect of this document and the scope of what we're trying to achieve. Search can help in organizing, but cannot organize for you. Better search can alleviate some of the need for organization, but we recognize that ultimately, to create high-quality content, a conscious effort has to be made.

Requirements

The problem is threefold:

  1. input and organization the information: the process of creating, editing, entering, storing the information in the system;
  2. extracting semantic chunks from it: parsing the input data and extracting meaning from its various components, meaning "across" the main organizational structure of input (for example, various input files may contain bookmarks, these bookmarks should be accessible in a global list of bookmarks);
  3. publishing the information: making selected views of the information accessible over the networks, with specialized interfaces.

Input

  • we will need to be able to edit the data offline, this is often the case for people who work with laptops or who are on the road;
  • the data lives in "files", where files consist in logical and convenient units of organization of information for the user to input and edit, e.g.
    • my bookmarks file specific to my workplace;
    • a contacts/address list that relates to a specific trip;
    • a blog entry, perhaps with a snippet of code in it and some links/bookmarks.
    • book reviews, which contain quotes, and a short public blurb and a link to the book (say, to amazon);
  • the information needs to have levels of disclosure, including the possibility of being hidden completely (i.e. not published nor extracted at all), and the possibility of being entirely public, and various levels in between. Each file, but also each entry must be able to specify its level of disclosure individually as well;
  • we want to have a system that is as simple as possible, therefore we will prefer text files that can be created in a normal editor, like emacs or vi, but that does not prevent the creation of client programs to generate these input files;
  • we assume that not all "data" a user produces and consumes is revelant, data that gets included in the system is subject to revision and a minimal effort has been made on the part of the user to clean it up and select it. The user has to "write" it, or somehow take a conscious step to request that certain information be included in the system;
  • optional: many people must be able to edit the data concurrently, or a single user be able to work using various independent copies of the data;

Extraction

  • we must extract meaningful chunks of information from the input files;
  • all information chunks that are extracted must be tracked to their input file, so that we can implement an incremental extraction algorithm that looks at the input and figures out:
    • which chunks are obsolete;
    • a list of only new chunks to be integrated.
  • the kinds of "chunks" must be extensible or generic, so that the system is very flexible;
  • input "files" may change location, therefore we should not rely on their filename as unique identifiers for the chunks of information;
  • the extraction should make the data available in a data store (e.g. a SQL database), in a way that makes it possible to perform incremental updates, full updates, and in a way that makes it flexibly accessible to various publishing interfaces.

Publishing

  • we need to be able to provide the various chunks of information in various ways, and to organize them in various ways, for example:
    • a blog, organized by dates and/or categories;
    • a travel journal;
    • lists of bookmarks, served up as RSS to browsers can integrate them;
    • a gallery of images, by trip;
    • notes taken about a certain task, e.g. setting up incremental backups, setting up software on a particular laptop;
    • a preferred wine list, a reading list;
    • project ideas, essays;
  • a pluggable architecture should be developed to make it possible to render each type of info chunk with a specific rendering system. We should be able to extend the system so that a new type of entry can be rendered in the existing publishing system;

Conclusion

The key ideas driving our design are: