On Tuesday 01 June 2004 06:46 pm, Jamethiel Knorth wrote: > >From: Jonathan Gardner > >Date: Tue, 1 Jun 2004 15:05:42 -0700 > [snip] > > An index of the entire filesystem cannot be maintained without wasting a > ton of computer power. Google does it by having thousands of computers > working constantly. Do not compare to Google. > We have two things working for us on the desktop of a typical user: Unused computing power and unused hard drive space. While Google indexes billions of pages, we'll only be indexing tens of thousands of files. So the comparison to Google is not too far off. They have thousands of computers, but do roughly thousands times more work than we need to for a typical desktop. I think we need to start capitalizing on these extra resources. We need to start preparing for the future when we have THz processors and TB of storage space that goes unused, as well as TB of unused RAM. In a small way, we are already there. > > > >Creating the index is the problem. A backend process should be > > constantly running at a low priority. Initially, it indexes all the > > files. Then, it begins to index files as they change. It always keep a > > fresh index of the most important files, and gets around to less > > important files when it has the time. > > It is hard to do a good job tracking what changes without significantly > slowing the system down. > Can't we have a kernel hook or module that will allow us to watch what has changed? Perhaps it can keep a database of recent changes so we can ask it "What has changed since I last checked?" We do this in databases without a problem at all. Maybe this whole system would have to run at a lower level than typical applications as it would need to be tied so close to the kernel. > > Although I did kinda make harsh criticism, you probably know more about > this than I do. This is what I have recommended on the page [1], > truncated to a more brief form for this e-mail. > I don't mind harshness if it is technical. So I take no offense at your tone. I hope the feeling is mutual. (See note on footnote [1] below). > - Indexing - > > Have an index which tracks file locations, but also tracks how up-to-date > it is and what its rate of change is. It doesn't need to all be at the > same point. Update times are recorded for a directory in a non-recursive > manner. > I don't see where this is going. Perhaps you want to get an index of what files have changed when, and how often they change? Perhaps you want also who did it? Maybe how it was changed (appended, small changes, or complete rewrite)? Maybe we could track the history of a file? For instance, "This file was copied from that file originally." Or, "This file was originally at X but later move to Y on the 5th of May." > Whenever a directory is searched, it is added to the index again. > Programs would be expected to also update the index whenever they list a > directory. To optimize for speed, such updates wouldn't always need to be > inserted into the directory immediately. So, if Konqueror opens a > directory, it sends the list of contents off to the list of things to add > into the index. > I like it: Opportunistic indexing. If the file is in the memory cache, index it right away. We may even want to index it as it is read into the cache. Again, this hints at plugins that work lower than typical applications. > The actual indexer is a daemon (hopefully run with higher privileges so > that all users can share). When the daemon indexes a directory, it notes > how much it changed since the last time, and how long between updates, > and calculates the rate of change. > > - Lazy Checking - > When updating, do updates according to what needs it most. This is > determined according to rate-of-change and time-of-last-update. > > This would be done slowly and lazily. Whenever the system has extra > resources, another directory would be checked, then the indexer would > pause a moment, then check another. > I think I tried to put this into the "file importance" thing I originally proposed. Some files are more important to be indexed than others. We should have the system keep an up-to-date index on the most important files all the time, less important files some of the time, and the least important files only if resources are available. Here are some qualities of a file that would make it more important: - Frequently changed - Recently changed - The type of file a user recently used or frequently uses. - Some type of document or song or image - Non-spam email messages - Something the user sees directly (MyDocuments) rather than something the system uses (/var/log/httpd). - Help files - Favorite, or near a favorite Here are some qualities that would make a file less important: - Log files - System or user configuration files that are meant to be hidden from casual users. - Cache files. - Backup files. - The type is a type the user never uses or only rarely uses. - The file is never viewed or is rarely viewed by the user. - The file is rarely or never changed, and only changed by the system. > - User Accessibility - > > The help people know what is causing thrashing, there would be a systray > icon showing if it was doing updates, and allowing a window to be > displayed showing how up-to-date various portions of the index are. This > also would allow users to force updates and to set some directories to > specific priorities, if they really wanted to. (Root privileges would be > needed for some of those activities, if the indexer ran as root.) > No, I think this should be something inherent in the system. It should run at a lower priority and be as unobtrusive as possible, but it should not be something a casual user should even know about. How long did it take you to figure out that 'locate' has an index and that index is updated every night? When you did find out, did you modify the way it worked? This system should be the same way. Power users may want to modify the caching system, or disable it completely. System administrators should have the options readily available, for deployed desktops and the like. > - Usage Enhancements - > > Search speed can be improved by optimizing where plugins search and all. > I won't go into that here. > Yes, this is pretty obvious. I believe the #1 reason no one uses search or it is used rarely is because it is so slow. This will make it much, much faster, and provide better parameters and more relevant results. > - Metadata - > > Metadata could be checked by having programs throw information to the > indexer when they get it. The indexer would get metadata about all music > if JuK was used. It would get information whenever an image was > previewed. It would get information from the details and information > views in Konqueror. RPM databases could keep it up-to-date. > I don't think we can reasonably rely on application programmers to remember to throw metadata at the system. I doubt we can get application programmers to even *agree* on what metadata to supply or what parameters to supply it with. Therefore, the metadata should be discovered outside of the application. User supplied meta-data could be very useful. Exactly how to do this without complicating the system is problematic, though. If we modified the file-save dialog so that you can specify meta-data for any file, that would be a start. I can barely speculate what kind of metadata people will want to use in the future, so we'll have to start simple. Remember we have "favorites" and their ilk as meta data already. > - Removeable/Remote Filesystems - (this isn't on the page yet) > > The index would keep track of removed media and remote filesystems for a > while, but would keep them in separate indexes. This way, they wouldn't > need to sit in memory. Depending on user requests, it could give more > longevity to the indexes for some mediums. This could potentially be > interworked with user tools to be very powerful (I would want it hooked > into a CD cataloging program). > A wonderful idea! It could keep track of CDs you have loaded, then searching, it would say, "That song/file/image/whatever is on your XYZ CD." even if the CD isn't on the drive or hasn't been accessed for months or even years. I don't think the indexing information on a 700MB CD is going to amount to more than kilobytes, so storing it to hard drive won't be a problem. > > [1] > http://localhost/designs/ubiquitous_searching/index.html#implementation > Not on my localhost... unfortunately. Could you provide a more public URL? One thing to wrap it all up, is that I'm really looking far in the future. I don't see a way to reasonably implement any of the above today. Maybe you are arguing more for today and reasonable expectations. I think you are right in that sense - we need to start somewhere. Shall we discuss "today" and a way to get something working now that will be useful and a step in the right direction? I think I will start coding up an index table in PostgreSQL to start experimentations. -- Jonathan Gardner jgardner@jonathangardner.net _______________________________________________ kde-usability mailing list kde-usability@kde.org https://mail.kde.org/mailman/listinfo/kde-usability