- restrict finder to certain extensions (default is all files) - test with .txt extensions
For now, we think that the exclude option combined with the discrimintation of having markers in only certain files is sufficient for most usage. And you can customize the marker. Also, since we're just reading the header of the files, it is still really fast even when there are a lot of large binary files.
We need to find out if it is possible from a CGI script to let the client know that we're done, and to keep running on the server, while letting the client exit? note that there might be some options in the client for that purpose. find out.
Do this later. This could be interesting.
try this in headers from Nabu server script:
Connection: close
Print processing/pending queue (server must provide that information)
We need an interface for servers that implement deferred processing. We should be able to query this processing queue to find out if our documents have been processed so we can query for errors.
When pushing new documents to the server, spit a warning if there is a clash of documents between users?
This should be in the server config and perhaps belongs to the domain of policy.
Store length or number of nodes in the sources document, to be able to evaluate the size or time to perform a specific task on the data.
Do this later.
Loose data models: automatically adding columns for entry "fields" that are not currently in the database, makes the data model looser
Do this later, when I have a good demonstration of nicely extracted stuff.
find a way for the entryforms to be "loosely coupled" with the data model on the server. This involves growing tables dynamically, depending on the found contents in the documents, i.e. unspecified type field lists.
We can do this much later. The fixed extractors are very cool as it is.
On the server, we could keep the doctrees before and after the nabu transforms, to be able to reprocess the entire documents after reconfiguring the server, without having to reparse nor upload, but we do not expect significant changes to be applied to the document by the transforms, and therefore it seems like a lot of data duplication for not much. We will store only the processed doctrees for now.
Worse case, you could always clear all the data and reload it, it should not be that big a deal.