On Tuesday 01 June 2004 06:46 pm, Jamethiel Knorth wrote:
> >From: Jonathan Gardner <jgardner@jonathangardner.net>
> >Date: Tue, 1 Jun 2004 15:05:42 -0700
>
[snip]
>
> An index of the entire filesystem cannot be maintained without wasting a
> ton of computer power. Google does it by having thousands of computers
> working constantly. Do not compare to Google.
>

We have two things working for us on the desktop of a typical user: Unused 
computing power and unused hard drive space. While Google indexes billions 
of pages, we'll only be indexing tens of thousands of files. So the 
comparison to Google is not too far off. They have thousands of computers, 
but do roughly thousands times more work than we need to for a typical 
desktop.

I think we need to start capitalizing on these extra resources. We need to 
start preparing for the future when we have THz processors and TB of 
storage space that goes unused, as well as TB of unused RAM. In a small 
way, we are already there.

> >
> >Creating the index is the problem. A backend process should be
> > constantly running at a low priority. Initially, it indexes all the
> > files. Then, it begins to index files as they change. It always keep a
> > fresh index of the most important files, and gets around to less
> > important files when it has the time.
>
> It is hard to do a good job tracking what changes without significantly
> slowing the system down.
>

Can't we have a kernel hook or module that will allow us to watch what has 
changed? Perhaps it can keep a database of recent changes so we can ask it 
"What has changed since I last checked?" We do this in databases without a 
problem at all.

Maybe this whole system would have to run at a lower level than typical 
applications as it would need to be tied so close to the kernel.

>
> Although I did kinda make harsh criticism, you probably know more about
> this than I do. This is what I have recommended on the page [1],
> truncated to a more brief form for this e-mail.
>

I don't mind harshness if it is technical. So I take no offense at your 
tone. I hope the feeling is mutual. (See note on footnote [1] below).

> - Indexing -
>
> Have an index which tracks file locations, but also tracks how up-to-date
> it is and what its rate of change is. It doesn't need to all be at the
> same point. Update times are recorded for a directory in a non-recursive
> manner.
>

I don't see where this is going. Perhaps you want to get an index of what 
files have changed when, and how often they change? Perhaps you want also 
who did it? Maybe how it was changed (appended, small changes, or complete 
rewrite)? Maybe we could track the history of a file? For instance, "This 
file was copied from that file originally." Or, "This file was originally 
at X but later move to Y on the 5th of May."

> Whenever a directory is searched, it is added to the index again.
> Programs would be expected to also update the index whenever they list a
> directory. To optimize for speed, such updates wouldn't always need to be
> inserted into the directory immediately. So, if Konqueror opens a
> directory, it sends the list of contents off to the list of things to add
> into the index.
>

I like it: Opportunistic indexing. If the file is in the memory cache, index 
it right away. We may even want to index it as it is read into the cache. 
Again, this hints at plugins that work lower than typical applications.

> The actual indexer is a daemon (hopefully run with higher privileges so
> that all users can share). When the daemon indexes a directory, it notes
> how much it changed since the last time, and how long between updates,
> and calculates the rate of change.
>
> - Lazy Checking -
> When updating, do updates according to what needs it most. This is
> determined according to rate-of-change and time-of-last-update.
>
> This would be done slowly and lazily. Whenever the system has extra
> resources, another directory would be checked, then the indexer would
> pause a moment, then check another.
>

I think I tried to put this into the "file importance" thing I originally 
proposed. Some files are more important to be indexed than others. We 
should have the system keep an up-to-date index on the most important files 
all the time, less important files some of the time, and the least 
important files only if resources are available.

Here are some qualities of a file that would make it more important:
 - Frequently changed
 - Recently changed
 - The type of file a user recently used or frequently uses.
 - Some type of document or song or image
 - Non-spam email messages
 - Something the user sees directly (MyDocuments) rather than something the 
system uses (/var/log/httpd).
 - Help files
 - Favorite, or near a favorite

Here are some qualities that would make a file less important:
 - Log files
 - System or user configuration files that are meant to be hidden from 
casual users.
 - Cache files.
 - Backup files.
 - The type is a type the user never uses or only rarely uses.
 - The file is never viewed or is rarely viewed by the user.
 - The file is rarely or never changed, and only changed by the system.

> - User Accessibility -
>
> The help people know what is causing thrashing, there would be a systray
> icon showing if it was doing updates, and allowing a window to be
> displayed showing how up-to-date various portions of the index are. This
> also would allow users to force updates and to set some directories to
> specific priorities, if they really wanted to. (Root privileges would be
> needed for some of those activities, if the indexer ran as root.)
>

No, I think this should be something inherent in the system. It should run 
at a lower priority and be as unobtrusive as possible, but it should not be 
something a casual user should even know about. How long did it take you to 
figure out that 'locate' has an index and that index is updated every 
night? When you did find out, did you modify the way it worked? This system 
should be the same way.

Power users may want to modify the caching system, or disable it completely. 
System administrators should have the options readily available, for 
deployed desktops and the like.

> - Usage Enhancements -
>
> Search speed can be improved by optimizing where plugins search and all.
> I won't go into that here.
>

Yes, this is pretty obvious. I believe the #1 reason no one uses search or 
it is used rarely is because it is so slow. This will make it much, much 
faster, and provide better parameters and more relevant results.

> - Metadata -
>
> Metadata could be checked by having programs throw information to the
> indexer when they get it. The indexer would get metadata about all music
> if JuK was used. It would get information whenever an image was
> previewed. It would get information from the details and information
> views in Konqueror. RPM databases could keep it up-to-date.
>

I don't think we can reasonably rely on application programmers to remember 
to throw metadata at the system. I doubt we can get application programmers 
to even *agree* on what metadata to supply or what parameters to supply it 
with. Therefore, the metadata should be discovered outside of the 
application.

User supplied meta-data could be very useful. Exactly how to do this without 
complicating the system is problematic, though. If we modified the 
file-save dialog so that you can specify meta-data for any file, that would 
be a start. I can barely speculate what kind of metadata people will want 
to use in the future, so we'll have to start simple.

Remember we have "favorites" and their ilk as meta data already.

> - Removeable/Remote Filesystems - (this isn't on the page yet)
>
> The index would keep track of removed media and remote filesystems for a
> while, but would keep them in separate indexes. This way, they wouldn't
> need to sit in memory. Depending on user requests, it could give more
> longevity to the indexes for some mediums. This could potentially be
> interworked with user tools to be very powerful (I would want it hooked
> into a CD cataloging program).
>

A wonderful idea!

It could keep track of CDs you have loaded, then searching, it would say, 
"That song/file/image/whatever is on your XYZ CD." even if the CD isn't on 
the drive or hasn't been accessed for months or even years. I don't think 
the indexing information on a 700MB CD is going to amount to more than 
kilobytes, so storing it to hard drive won't be a problem.

>
> [1]
> http://localhost/designs/ubiquitous_searching/index.html#implementation
>

Not on my localhost... unfortunately. Could you provide a more public URL?

One thing to wrap it all up, is that I'm really looking far in the future. I 
don't see a way to reasonably implement any of the above today. Maybe you 
are arguing more for today and reasonable expectations. I think you are 
right in that sense - we need to start somewhere.

Shall we discuss "today" and a way to get something working now that will be 
useful and a step in the right direction?

I think I will start coding up an index table in PostgreSQL to start 
experimentations.

-- 
Jonathan Gardner
jgardner@jonathangardner.net
_______________________________________________
kde-usability mailing list
kde-usability@kde.org
https://mail.kde.org/mailman/listinfo/kde-usability