Сводка из двух личных писем Сета Никелла (Seth Nickell) Александру Прокудину к статье в Компьютерре.
I know most about this so I'll talk about it most of course ;-)
“storage-store” provides a DBus service that allows fetching objects over the Free Desktop
getting their attributes, relating them to eachother, running queries etc. “storage-store” uses postgresql to store the structured objects and perform queries. Because objects are accessed “live” rather than as “buffers”, changes are instantly propagated across the bus, so multiple applications or users can work on the same document and instantly see changes other people make. I'm currently working on architecture to storage-store into standard IM presence information so you will be able to see buddy icons of other people and what part of the document they are working on inside storage applications. I have a lot of user experience goals for Storage (or more accurately, for applications and desktop that use storage). You can find information about most of them at http://www.gnome.org/~seth/blog/storage-speaking-notes
and at http://www.gnome.org/~seth/storage
. Though these goals are more important to me than document indexing, I will focus on document indexing in order to compare and contrast with the earlier systems.
libstorage-translators provides a framework for translators that can take structured object data in the store (metadata and the actual data itself) and translate it to and from byte streams (such as files). The goal is not indexing files, but for providing a way to move files in and out of the store. So for example, if your friend sent you a PDF file by e-mail, you could drag that file into your local store and the libstorage-translators will automatically decompose the information for placing in the store (and of course extract lots of metadata like album name, description, image width, etc etc in the process). Currently I have only worked on the “importer” side of translators, not the “exporter”, so they are effectively like indexers. There are currently importers for: Doc Book
, HTML, any image format supported by gdk-pixbuf (JPEG, PNG, BMP, GIF, and several more obscure formats), PDF, text, and any format supported by gstreamer (MP3, OGG, AVI, MPEG2, etc). Importers can also create thumbnails for the data for convenient display later. Storage also includes a renderer system for displaying the relevant metadata etc for different sorts of results to a query. A major drawback
is that I don't have translators for common document formats like Gnumeric or OO.o at the moment.
Queries can either be performed using an SQL-like format (slightly higher level than SQL but not much, it gets translated to SQL) or using
natural language queries. A large chunk of storage code is currently in its NL system which uses very sophisticated HPSG grammars and other
techniques to translate human language phrases into the SQL query format.
A storage:/// VFS URI is provided which automatically invokes translators when files are dragged into the store. That means you can,
e.g. open a nautilus window to storage:/// and drag files in to add them to the store. It also provides query folders like Medusa. So for example
you can have a folder “spreadsheets” or “songs by John Lennon that don't have the word 'love' in them” that is live updated to contain objects matching those criteria.
> I'm also curious, whether this NL system (the PET part of it, if I'm not mistaken) is
> smart (and modular) enough to handle grammar for languages from different
> families languages (like russian, chinese, arabic etc.).
The base syntax parser for the NL system is a Head-driven Phrase Structure Grammar (HPSG) parser called PET. HPSG is a very general
“theory” that allows specification of sophisticated grammars for (almost?) any human language. Both PET & HPSG are from the formal linguistics community where the primary interest is in addressing and representing the whole scope of human language's syntax.
On a syntactic level, Storage should be able to use any grammar which is compatible with the Lin GO
grammar matrix (http://depts.washington.edu/uwcl/matrix/)
. You can think of the Lin GO
grammar matrix as being an “API” that HPSG grammars can implement. Many
compatible grammars exist today for languages such as: German, English, Norwegian, Italian, Spanish, Japanese, Swedish and modern Greek. There
are probably many more I'm simply not aware of. Many of these are done by universities and are freely available, some are not.
In the broader context of HPSG grammars (not necessarily Lin GO
compatible, but the same rough format and many could be translated to
with some work) grammars are already available for many many languages, including a number of Slavic languages.
So that is the situation with the syntactic grammar. Every language also needs an accompanying semantic grammar, which is specific to storage. Fortunately these are much *much* easier to write than syntactic grammars.
> When is it a good time to start working at sentence parsing rules for other languages?
I don't have a specific timeframe. I was going to do a simple implementation of either Spanish or Italian (languages I'm somewhat familiar with) for GUADEC a couple weeks ago, but ended up getting side tricked.
At the moment I'm primarily working on a re-architecture of the store aspect, so I'm deferring more NL work for a little while.
Ссылок на эту страницу нет