On Thursday 25 October 2012 12:20:42 Mark wrote:
> So i would end up with a:
>         struct SortEntry
>         {
>                 QByteArray collatingKey;
>                 KFileItem *fileItem;
>         };
>=20
> Where the collatingKey is meant to be what?

Let's start with comparing using strcmp. It works like this, omitting=20
details:

	while (*s1 && *s2 && *s1 =3D=3D *s2) {
		++s1;
		++s2;
	}
	return *s1 - *s2;

As you see, the code comparing each character is very fast, but does=20
not handle anything special, in particular it does not
=2D compare case insensitive
=2D sort german '=E4' after 'a' but before 'b' ("locale aware")
=2D sort 'a12' after 'a2' ("natural sorting")

The solution to the first inability is known to you: You convert each=20
string to lower case before comparing them. This converted string is=20
the actual "sort key", i.e. what you use for the sort's "lessThan"=20
functor:

	'A' -> 'a'
	'B' -> 'b'
	'a' -> 'a'

Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking=20
up the actual item which you stored within the "SortEntry".

This very same idea can be applied to tackle the second inability. You=20
convert each string to its locale-dependend "collating" string, often=20
just called a "sequence", because it ususally does contain non-
characters. Here are example keys that sort 'a' < '=E4' < 'b':

	'a' ->	'a'
	'b' -> 'b'
	'=E4' -> 'a' '\0377'

The appended 0xFF character ensures that '=E4' is always after 'a',=20
regardless of which other character follows. But it will never be=20
after 'b', because of the first 'a' in the sort key.

If you look up the Unicode collating algorithm, you will see that it=20
is much more complicated, but the basic idea is the same. It should=20
not bother you for the initial version; you can simply use glibc=20
function "strxfrm()" to get the collating sequence for your string=20
parts where you would call "localeAwareCompare" on.

Also the last inability can be solved in the same way. Simply=20
_prepend_ a code which states the magnitude (number of digits).=20
Example:

	'a1234' -> 'a' \004' '123'
	'a56' -> 'a' '\002' '56'
	'a8' -> 'a' '\001' '8'

A combined algorithm could, for example, return these collating=20
sequences as sort keys:

	'a12' -> 'a' '\002' '12'
	'=E48' -> 'a' '\0377' '\001' '8'
	'A6' -> 'a' \001' '6'
	'=C456' -> 'a' '\0377' '\002' '56'

If you sort by those keys, you get ('A6', 'a12', '=E48', '=E456'), which=20
might be what you want, but you could also want ('A6', '=E48', 'a12',=20
'=E456'), which would require a different algorithm for generating the=20
sort key.

Christoph Feck (kdepepo)