Author: | Martin Blais <blais@furius.ca> |
---|---|
Date: | 2005-04-25 |
Abstract
We describe a system that aims at making it possible for users to create relevance in their content and allow building ways to serve it intelligently by providing a semantically rich access to this data.
We are considering a system for easing the management and publishing of various kinds of information personal to a single producer with a system that is "as simple as possible".
By personal information, we mean:
We present differing angles to various problems of information management"
Relevance of Information. Not all information that transits in a user's computer, or that is created by that user, is relevant. There is a great value in getting rid of junk.
We believe that at this point, this task can only be performed manually: there is no artificial intelligence algorithm that can be decide what is important for you. This is a key assumption in the motivation for this system: ultimately, the user is not just a producer of information, but is also an editor.
There is no search algorithm that will be able to automatically create value. However, we recognize that filtering technologies will play an important role in helping us become more efficient editors, but we do not believe that they of themselves will become able to create "the valuable" automatically anytime soon;
Disparate Storage and Unaccessible Source Data. Information is stored in various bits that have an associated meaning to them--let us call them "chunks" for now. All information chunks are stored in different places, for example, all addresses are stored in an address book manager program, all email in a contacts list program. Contacts lists are stored in PDAs, and despite the availability of synchronization programs (you're lucky if it even works), ultimately the data lives in different places and there are multiple copies of it.
Also, these data storages used different methods, often specific to the data model that they choose, and only readable by the specific software that created them. This makes them difficult to access this data, to build independent services on top.
Services that allow you to enter some of that data online fall in the same trap: they store the data on their machines in a format that is not accessible for you, the user, to get it back. You create value by publishing your data using their system but you do not have a way to get it back in a form suitable for reuse! ;
Data Entry. It is difficult to enter the information, for many reasons:
every time you need to enter a new type of information, you need to start a program specific to that information storage. These programs change over time, and this means that you must learn a plethora of programs, just to enter the data;
more importantly, many times you would want to mix different types of data together in one logical unit. For example, you might want to open a text file when you're researching a specific issue, for example, all information you find about a recently announced illness by your doctor, and you would want to store links to URLs of interest, contact information about local specialists may be able to help you, as well as text that you write yourself about the illness, or notes that you make on your condition, whatever.
Personally, whenever I embark on any substantial task, I create a new small document for it--in the form of a text file-- and jot down notes as I discover more and more aspects of the problem that I'm working on. I think many people do the same, or would do the same if they found a use for the document. For example, you might want to share some of this knowledge. Right now, there is no easy way to do that;
Publishing is Difficult. Publishing much of the information you accumulate is still very hard. There are many different systems out there which attempt to provide a way for you to publish certain types of data, in a specialized way. These require much cusomization, and a market for specialized services (such as blogger) has emerged for specific uses of the data.
Would it not be great if we could build services on top of all your data, rather than attempting to solve one specific use of the data?
Also of note: online publishing systems store the data remotely, which make it difficult to build client software to edit it efficiently. Making it possible to upload the data in a digested form once it's authored locally has some potential.
One central value behind our views, and one that you must keep in mind when considering the proposition that we're about to make, is that of the importance of simplicity. There is great value in keeping things as simple as they need to be, because it allows the most flexible reuse of the information.
Much of the rewards of keeping design and data simple can be observed in the power of the UNIX tools and operating system, which is built upon simple but very powerful ideas, sockets and files that consist in generic streams of bytes, and small tools that perform one task really well, and a simple and generic way to connect those tools together (See "The Art of UNIX Programming", E.S.Raymond). This has made possible the creation of complex tools without having to reinvent the small tools, but rather by improving those small tools in a generic way, that would henceforth allow more possibilities for connecting them in yet more different ways. Keeping things generic and as simple as possible is a potent idea.
This idea of designing systems as simple as they need be is also prevalent in the practice of software development. Over the past ten years, we are seeing methodologies of development convergence towards this idea. Extreme programming, agile methodologies, and the growing adoption of dynamic languages are a direct expression of the quest for reaching closer and closer to the essence of the problems we're trying to solve while trying to get rid of unneeded complications. In many ways, software development is in the business of creating complexity. We are essentially recognizing that keeping our designs and data models as simple as possible is the most efficient way of controlling the growth of this complexity.
This history behind the creation of this project stems from a long-standing need from its author to maintain personal information in a way that is most useful and that can be kept independent from specific software, over long periods of time. The sections below outline some of the problems I have tackled in the past, and the partial solutions I have come to before creating the Nabu extraction system. Nabu is meant to replace all these tricks to allow me to extract, organize and selectively publish some of this information.
I needed to maintain an address book. At the time (circa 1993) on software was decent that output a textual format which could be read for converting the data into other formats. Thus around 1997, I decided to transcribe all my physical address books in a text file, following my supervisor's advice at university at the time, used a paragraph-grep program to query it. This worked great for many years, except that there was no integration with my email programs. I could however grep and sed the address book file to generate a text file that could in turn be imported by various email systems. Over time, the one address book file grew into many, and new contact information moved gradually into the documents which provided context for them.
I think at some point I have started using the LDAP LDIF format to store the files, but the naming was a bit too long or annoying to add entries with a text editor, so I just created my own simple format, which looks like a list of entries like this:
n: New Navarino Bakery & Pastry Shop p: 514-279-7725 a: 5563, avenue du parc, Montréal, QC H2V 4H2
Another issue is that of maintaining a set of bookmarks. One of the problems is that every few years a new browser comes out, and I end up moving to it. For example, I started using the web with Xmosaic, and eventually moved to Netscape. On Windows I eventually had to use IE, and eventually switched to Konqueror on a Linux machine, and then Mozilla, which was very heavy, so eventually to Firefox. Most of these browsers have slightly different bookmark storage formats which are not conveniently edited within emacs.
A more important problem is that of the organization bookmarks. Adding all bookmarks in a linear list makes it nearly impossible to reuse them efficiently (it is very hard to find a bookmark that you're looking for). Tree structures help alleviate this problem to some extent, but add another problem: when you want to quickly add a bookmark (somehow, it always has to be quick), you have to choose a single most appropriate place to put it, and if you're not very careful with this you often have a hard time to find your bookmark back.
I found this problem really annoying, so I designed a very simple textual format for bookmarks, where I would enter a description, url, and a list of keywords. I wrote Tengis, a program that can read this format and can quickly query the bookmarks with keywords. Unfortunately, I never quite got used to using my own software on top of the browser, and always end up grepping for the file within emacs.
Here is an example excerpt of a bookmarks file:
Babelfish http://babelfish.altavista.com search, languages, translation Amazon http://www.amazon.com search, books, music Abebooks http://www.abebooks.com/ search, books
Another problem is that various links end up being stored in documents, text files which I write when I accomplish some specific task. These do not make it to the global bookmarks file.
For convenience, I wrote a script that could convert this file in a tree structure and automatically generate bookmarks files for whatever browser I'm using at the time.
Whenever I have an idea for a project, something that I find interesting enough, I document it. I would like to share these documents, but they change quite a bit over time, and they don't necessarily belong together for the presentation layer.
There is much information to be acquired when using computers. A good habit that I have acquired is to start a text file to jot notes whenever I take on a task that is going to take a few hours. This helps keep my focus organized, and serves as reference if I have to repeat that task in the future. It is also very useful to just send those instructions when someone asks me how I accomplished this task in the past. I also avoid wasting time when I need to make a new iteration of the same task-- I can review my thoughts at the time, the decisions I made, etc.
When you are surveying a lot of scientific papers, it is good to take notes on ideas and to summarize the crux of each paper that you read. This helps organize your thinking by forcing you to write and express your thoughts. I always wrote short 5 or 6 paragraph reviews of the papers that I read. These live in separate files and can sometimes be reused by friends when they ask me about specific subjects, when I point them to some paper or other.
Also, I like to take down quotes from the books that I read. Whenever I read a book, I mark down interesting passages, and when I'm done with the reading, I take 30 mins to copy these passages in text files. I sometimes like to feed from this body of quotations to add to my signature in email (although I must admit that I have eliminated using signatures at all for many years now). In any case, I sometimes enjoy going back to those review files when I'm having an idea that relates to a book that I have read.
A key theme behind the problems described above, is that the software that you use to manipulate your personal information or notes files, is going to change. Therefore it is a bad idea to use closed formats like that produced by MS Word, or similar software, if you want to be able to maintain and use these documents for a long time.
I very much trust simple text files. They will always be readable, and interpretable, and they use little storage. In this context, docutils is an amazing tool because it allows you to extract meaningful structure from them, as long as you follow minimal conventions. One of the principal motivators behind this system is to provide the ability to maintain all sorts of personal information using simple text files. This is a key aspect.
Simply stated, our goal is the following:
To make it possible for users to create relevant content and allow building ways to serve it intelligently by providing a semantically rich access to his data.
We want to make it possible to build services on top of the user's valuable resource: information. In order to do this, we have to make it possible for any user to build this meaningful source of his information, to add relevance to it. We want to:
make it easy to enter the information in a way that allows an automated system to extract the meaningful chunks of data and associate them with pre-defined (and extensible) semantics.
This may involve some form of simple markup (e.g. "create new document", "insert contact info", "insert bookmark"). Easy means simple. The interface and data format has to be simple, if not trivial;
provide a service that will store this extracted information in a way that is accessible by various publishing services;
create services that will offer creative views on this data.
You can think of a blog interface, image galleries, a birthday notifier system, a system to sync your data store with your PDA, to serve your personal bookmarks as RSS feeds, to publish your travel log, to show your calendar of events, etc.
These views would create value by providing convenient access and novelty on top of the user's data source. Each of these views would use as its basis the parsed data source, stored and access in an efficient manner (i.e. in a database).
Our aim is clearly NOT to:
We believe that relevance in information is the result of a certain amount of conscious effort from the part of the user, and that search technologies have an inherent limit in the quality of the information that they can provide, in terms of filtering and organizing the data that navigates in a user's system. This is a key aspect of this document and the scope of what we're trying to achieve. Search can help in organizing, but cannot organize for you. Better search can alleviate some of the need for organization, but we recognize that ultimately, to create high-quality content, a conscious effort has to be made.
The problem is threefold:
The key ideas driving our design are: