[prev in list] [next in list] [prev in thread] [next in thread]
List: kfm-devel
Subject: Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
From: Christoph Feck <christoph () maxiom ! de>
Date: 2012-10-25 12:09:31
Message-ID: 201210251409.31686.christoph () maxiom ! de
[Download RAW message or body]
On Thursday 25 October 2012 12:20:42 Mark wrote:
> So i would end up with a:
> struct SortEntry
> {
> QByteArray collatingKey;
> KFileItem *fileItem;
> };
>
> Where the collatingKey is meant to be what?
Let's start with comparing using strcmp. It works like this, omitting
details:
while (*s1 && *s2 && *s1 == *s2) {
++s1;
++s2;
}
return *s1 - *s2;
As you see, the code comparing each character is very fast, but does
not handle anything special, in particular it does not
- compare case insensitive
- sort german 'ä' after 'a' but before 'b' ("locale aware")
- sort 'a12' after 'a2' ("natural sorting")
The solution to the first inability is known to you: You convert each
string to lower case before comparing them. This converted string is
the actual "sort key", i.e. what you use for the sort's "lessThan"
functor:
'A' -> 'a'
'B' -> 'b'
'a' -> 'a'
Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking
up the actual item which you stored within the "SortEntry".
This very same idea can be applied to tackle the second inability. You
convert each string to its locale-dependend "collating" string, often
just called a "sequence", because it ususally does contain non-
characters. Here are example keys that sort 'a' < 'ä' < 'b':
'a' -> 'a'
'b' -> 'b'
'ä' -> 'a' '\0377'
The appended 0xFF character ensures that 'ä' is always after 'a',
regardless of which other character follows. But it will never be
after 'b', because of the first 'a' in the sort key.
If you look up the Unicode collating algorithm, you will see that it
is much more complicated, but the basic idea is the same. It should
not bother you for the initial version; you can simply use glibc
function "strxfrm()" to get the collating sequence for your string
parts where you would call "localeAwareCompare" on.
Also the last inability can be solved in the same way. Simply
_prepend_ a code which states the magnitude (number of digits).
Example:
'a1234' -> 'a' \004' '123'
'a56' -> 'a' '\002' '56'
'a8' -> 'a' '\001' '8'
A combined algorithm could, for example, return these collating
sequences as sort keys:
'a12' -> 'a' '\002' '12'
'ä8' -> 'a' '\0377' '\001' '8'
'A6' -> 'a' \001' '6'
'Ä56' -> 'a' '\0377' '\002' '56'
If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which
might be what you want, but you could also want ('A6', 'ä8', 'a12',
'ä56'), which would require a different algorithm for generating the
sort key.
Christoph Feck (kdepepo)
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic