From kde-usability  Wed Jun 09 18:38:00 2004
From: Manuel Amador <rudd-o () amautacorp ! com>
Date: Wed, 09 Jun 2004 18:38:00 +0000
To: kde-usability
Subject: Re: Easier Searching in KDE
Message-Id: <1086806279.30180.5.camel () localhost ! localdomain>
X-MARC-Message: https://marc.info/?l=kde-usability&m=108681043602458
MIME-Version: 1
Content-Type: multipart/mixed; boundary="--===============0820608578=="


--===============0820608578==
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature";
	boundary="=-/i3tLz0ktmKtC0B1j0i7"


--=-/i3tLz0ktmKtC0B1j0i7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

El lun, 07-06-2004 a las 18:11, Gustavo Sverzut Barbieri escribi=C3=B3:

> > I really don't like the way the windows search works: It generates resu=
lts
> > slowly, and you don't really have an idea of how much is left to search
> > over. I want results right away, like Google. I want the results catalo=
gued
> > and listed by relevance and user preferences.
>=20
> Problem with google caching is:
> 	- we don't have a cluster to be frequently updating indexes without impa=
ct=20
> the computer usage
> 	- lost of sync between the cache and the real system. It really matter w=
hen=20
> you come to messing with your home dir, when you may create/delete/move f=
iles=20
> really fast.

I have done some calculations with some code, and, assuming you have a
file indexing daemon which uses a notification system and extracts
metadata (and a small portion of data, say, for a full text indexing
database), you'd be looking at one or two seconds of indexing per file,
at a 20 niceness level.

That ain't slow.  Coupled with a register/notification
(observer/observable pattern for queries) system for a hypothetical
search daemon, you could add a word "Casaperrogato" to one of your
OpenOffice documents, and see it appear in a search window in approx. 10
seconds.

That's much more immediate than Windows search, and it is possible.

For the problem of moving files, ideally you'd need to store an MD5 hash
of each file as a primary key in the index, so when files get moved
around they'd be recognized instead of reindexed.  Alternatively (or
primarily) you could (in the indexing daemon) associate an Extended
Attribute named, say, "File ID" which would act like an object index so
the search daemon would recognize it everywhere on your FS (even NFS),
regardless of file contents.

This could actually spawn a completely new filing paradigm, based on
objects and categories/contents, instead of subdirectories.

>=20
>=20
> > > 		- cache previous results, maybe they're used again soon since users
> > > often refine their search
> >
> > Unfortunately, this doesn't help the first search. The first search is
> > always more important than subsequent searches, in my opinion.
>=20
> I agree... but it's better than having it slow everytime.
>=20
>=20
> > > 		- change kio_slaves to update a db everytime a file is modified...
> > > with that we can have something fast for user and better sync'ed than
> > > slocate. The problem is other apps, like gnome or openoffice.
> >
> > This may slow down the speed at which updates happen.
>=20
> Do you think so?
> I'm saying that when you save your files or delete then, the index must b=
e=20
> updated. But that just help kde apps.
>=20
>=20
> > > 	All of them have problems. The real problem I see is with home dir..=
.
> > > it's the part of the system that changes most and in short periods of
> > > time, probably the user changes and then search... that common case m=
akes
> > > life difficult and should be optimized.
> >
> > I have been doing some work with Materialized Views in PostgreSQL. Here=
 is
> > what I think will work with KDE.
> >
> > Google type search. Everything on disk is indexed. With an 80GB hard dr=
ive,
> > it's not a problem to have everything indexed in multiple ways. The sea=
rch
> > should take a couple of seconds, max. Let the OS worry about in-memory
> > caching, etc... If there isn't enough room for the index, then we shoul=
d
> > provide a weaker system like Windows where the search is done in real t=
ime,
> > only the most important files are indexed, and recent search results ar=
e
> > stored.
>=20
> Fair.
>=20
>=20
> > File Importance. Files in the home directory, files modified or viewed
> > frequently by the user, files in the favorites list, etc, are more
> > important than system files or log files or cache files. They should be
> > listed first and indexed first.
> >
> > Creating the index is the problem. A backend process should be constant=
ly
> > running at a low priority. Initially, it indexes all the files. Then, i=
t
> > begins to index files as they change. It always keep a fresh index of t=
he
> > most important files, and gets around to less important files when it h=
as
> > the time.
>=20
> That doesn't work.
> 	- a process accessing the disk everytime will screw up with OS disc cach=
ing.
> 	- IMHO, one may want to find the last file he/she accessed/saved/created=
=20
> other than anything else. So these will be the most important... and prob=
ably=20
> will not be indexed.
>=20
>=20
> > When a search is made, it is run completely against the index. Sure, th=
e
> > index may be out-of-date, but the most important files will be indexed
> > almost as soon as they are modified, and the least important files may
> > never get indexed, so the results are usually relevant.
> >
> > Metadata is critical. How do we accumulate the meta data? How is it sto=
red?
> > I don't have the slightest in this department. Already, we have
> > permissions, location, access/modification times, and MIME type that gi=
ve
> > us good meta data. But I would like additional meta-data specified by t=
he
> > user, like "project" or "author" or "comments". Importance is another p=
iece
> > of meta-data that is useful.
> >
> > I don't think a real database (PostgreSQL or MySQL) will be appropriate=
 for
> > the database part of the tool. The requirements for the index cache are
> > very different than what PostgreSQL provides. I do believe that we can
> > steal a lot of ideas from the database community on how to search vast
> > indexes efficiently. Perhaps early implementations will rely on a datab=
ase,
> > but that should be temporary.
>=20
> I also don't think so.
--=20
	Manuel Amador
	Jefe de I+D                         +593 (9) 847-7372
	Amauta                     http://www.amautacorp.com/
	GNU Privacy Guard key ID: 0xC1033CAD at keyserver.net

--=-/i3tLz0ktmKtC0B1j0i7
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: Esta parte del mensaje =?ISO-8859-1?Q?est=E1?= firmada
	digitalmente

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQBAx1kHWyznNMEDPK0RAr5zAJ0b/INRbu9N7yLKOAxgW2n4Us/teACfZmrm
mx4g/iYaJVuaDoyvFwN1O8g=
=yIYj
-----END PGP SIGNATURE-----

--=-/i3tLz0ktmKtC0B1j0i7--


--===============0820608578==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
kde-usability mailing list
kde-usability@kde.org
https://mail.kde.org/mailman/listinfo/kde-usability

--===============0820608578==--