'Adding improved file search to KDE'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Adding improved file search to KDE
From:       Jonathan Gardner <jgardner () jonathangardner ! net>
Date:       2004-06-14 23:11:08
Message-ID: 200406141611.08762.jgardner () jonathangardner ! net
[Download RAW message or body]

On the KDE usability lists, one of our illustrious imaginators wanted to 
push for better file searching. I've done a lot of work with PostgreSQL and 
I feel I have a good grasp on database topics and how they can tie into 
search.

Let me break this up into 2 parts. The first part is the "big picture". This 
is where we can be at one day, in the not too distant future. The second 
part is the "next step". This is what we can do next to move towards the 
"big picture".

Files should exist independent of the actual filesystem. In other words, the 
filesystem should be one way of many to find a file.

We store thoughts by context. We store a ton of metadata for each memory or 
idea, and we access them by index on these metadata. We do a search, "Show 
me that time I went to Hawaii and we were on the beach last year" and we 
get back a good candidate result with an option to think harder and come up 
with something else, or to change our search parameters "No, it was when I 
was with my wife."

Certain files shouldn't be readily accessible to the user: cache files and 
configuration files, for instance. Other files should score higher on the 
search results independent of the search parameters: Files that are 
modified frequently and/or recently. Files that are viewed often or 
recently. Files that are owned by the user or user's group. Other files 
should score lower: Files that are frequently passed up in favor of other 
results, files that are never accessed or modified, or files that are 
modified by not by the user (like log files).

There should be a personality profile kept for each user. When I search for 
"log", I expect to see Apache and PostgreSQL log files. But my next-door 
neighbor may want to see what is happening in the logging industry.

The file search dialog will become a ubiquitos replacement for the file open 
dialog. Meta data like "project" or "event" will keep files grouped 
together. User will never see the underlying filesystem, if any.

I am sure you can come up with many metadata attributes that I can't even 
imagine. I am sure each individual user will have their tastes as well. 
However, actually collecting the metadata is difficult. For this purpose, 
we will have to keep a running context so that when a new file is created 
or edited in that context, the context's metadata is transferred. For 
instance, when I am working on a particular project, the system should 
realize this. When I go to edit a file, it will add that file as part of 
the project, unless there is strong evidence not to. We certainly can't 
expect the user to type "This is the picture of me and Sally in Hawaii on 
North Shore" for every single picture as they download them. Maybe metadata 
can be sparse at first and eventually grow as context is added.

Disk space will be abundant in the future. It will be cheap to get terrabyte 
drives, which will be more than enough to store a movie of most of your 
life. The remaining space will be used as a cache and as an index to every 
file on the disk. The system disk will also keep a running index of CDs and 
other media. Perhaps it will even index websites and FTP sites that the 
user has visited or will likely visit one day.

Processor power and excessive memory will make compiling all this 
information into a meaningful and efficient index easy. Keeping it up to 
date won't be an issue. Combining all the known indexes plus the user's 
preferences and personality will yield superior search results 
instantaneously.

Now for talk of today. Right now, the only tool that does any file indexing 
is "locate". It does a terrible job at that. It only indexes file names. It 
also doesn't update but once a night.

I propose we begin a project to start indexes on all running instances of 
KDE. At first, we will index the entire filesystem. Later, we will add 
indexes of CDs that the user has accessed. Maybe in the future we can index 
sites the user has visited.

The only metadata we should store are the file's name, title (if 
applicable), permissions, ownership, and last access "score". Also, file 
type and category will be recorded.

The access score is a decaying number. Every time the user accesses it, it 
increases. Every time the user modified it, it increases more. This way, 
frequently accessed items and recently accessed items will be given the 
highest score.

The file "type" is its MIME type. This should be easy to determine. The 
"file" tool can assist. Category is a broader description. We'll have "user 
files", maybe broken up into "documents", "music". "movies"... We'll also 
have "system files": "log files", "configuration files", "data files", 
"cache", and "code files". Every file should fall into one of these broad 
categories.

Indexes will be maintained by a PostgreSQL database instance. At first, we 
will index only the files in the user's home directory. Then we can branch 
out and index every file everywhere. How do we keep an accurate and timely 
index? There are two ways:

(1) We get notified when a file is modified. There is no way to do this at 
the kernel level (yet!) but the KDE API allows us to put our own hooks in. 
Most applications will use the KDE API to load and store files.

(2) We browse the filesystem, keeping resource usage to a minimum, looking 
for modified files.

Any thoughts?

-- 
Jonathan Gardner
jgardner@jonathangardner.net

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]