'Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
From:       Mark <markg85 () gmail ! com>
Date:       2012-10-25 20:15:41
Message-ID: CAPd6JnHwZOQomnFr4RMJO-LE54QdBO9GO=coKZA_a=ZG3r9yXw () mail ! gmail ! com
[Download RAW message or body]

On Thu, Oct 25, 2012 at 8:24 PM, Mark <markg85@gmail.com> wrote:
> On Thu, Oct 25, 2012 at 2:09 PM, Christoph Feck <christoph@maxiom.de> wrote:
> > On Thursday 25 October 2012 12:20:42 Mark wrote:
> > > So i would end up with a:
> > > struct SortEntry
> > > {
> > > QByteArray collatingKey;
> > > KFileItem *fileItem;
> > > };
> > > 
> > > Where the collatingKey is meant to be what?
> > 
> > Let's start with comparing using strcmp. It works like this, omitting
> > details:
> > 
> > while (*s1 && *s2 && *s1 == *s2) {
> > ++s1;
> > ++s2;
> > }
> > return *s1 - *s2;
> > 
> > As you see, the code comparing each character is very fast, but does
> > not handle anything special, in particular it does not
> > - compare case insensitive
> > - sort german 'ä' after 'a' but before 'b' ("locale aware")
> > - sort 'a12' after 'a2' ("natural sorting")
> > 
> > The solution to the first inability is known to you: You convert each
> > string to lower case before comparing them. This converted string is
> > the actual "sort key", i.e. what you use for the sort's "lessThan"
> > functor:
> > 
> > 'A' -> 'a'
> > 'B' -> 'b'
> > 'a' -> 'a'
> > 
> > Now if you sort ('A', 'B', 'a'), you get ('A', 'a', 'B') after looking
> > up the actual item which you stored within the "SortEntry".
> 
> Till this point i fully understand it.
> > 
> > This very same idea can be applied to tackle the second inability. You
> > convert each string to its locale-dependend "collating" string, often
> > just called a "sequence", because it ususally does contain non-
> > characters. Here are example keys that sort 'a' < 'ä' < 'b':
> > 
> > 'a' ->  'a'
> > 'b' -> 'b'
> > 'ä' -> 'a' '\0377'
> > 
> > The appended 0xFF character ensures that 'ä' is always after 'a',
> > regardless of which other character follows. But it will never be
> > after 'b', because of the first 'a' in the sort key.
> 
> This is where things get odd... So, i should place 'ä' before 'a', but
> that means that i have to make a translation table somewhere from
> latin chars to alphabet chars.. Or am i misunderstanding something
> here?
> 
> There must be some automatic way to do this, right?
> 
> > 
> > If you look up the Unicode collating algorithm, you will see that it
> > is much more complicated, but the basic idea is the same. It should
> > not bother you for the initial version; you can simply use glibc
> > function "strxfrm()" to get the collating sequence for your string
> > parts where you would call "localeAwareCompare" on.
> > 
> > Also the last inability can be solved in the same way. Simply
> > _prepend_ a code which states the magnitude (number of digits).
> > Example:
> > 
> > 'a1234' -> 'a' \004' '123'
> > 'a56' -> 'a' '\002' '56'
> > 'a8' -> 'a' '\001' '8'
> > 
> > A combined algorithm could, for example, return these collating
> > sequences as sort keys:
> > 
> > 'a12' -> 'a' '\002' '12'
> > 'ä8' -> 'a' '\0377' '\001' '8'
> > 'A6' -> 'a' \001' '6'
> > 'Ä56' -> 'a' '\0377' '\002' '56'
> > 
> > If you sort by those keys, you get ('A6', 'a12', 'ä8', 'ä56'), which
> > might be what you want, but you could also want ('A6', 'ä8', 'a12',
> > 'ä56'), which would require a different algorithm for generating the
> > sort key.
> > 
> > Christoph Feck (kdepepo)
> 
> Lets skip the third issue for the time being, the locate aware stuff
> seems "slightly" more important then the natural part.
> 
> So how about this stuff in the implementations side?
> Am i right that i'm going to get 2 new parts:
> 
> 1. A part for building up the collation sequence. That part is outside
> the sorting algorithm.
> 2. The sorting algorithm becomes a LOT less code and simply does
> string compares based on the collation sequences. So no more difficult
> checks like are done now. Thus the sorting would indeed speed up
> massively, right?
> 
> For Todd and Christoph, thanks a lot for writing such long informative
> replies! I really appreciate that and it really helps me understand
> how to implement this. Though i obviously still have questions ;)
> 
> Cheers,
> Mark

I'm guessing it won't be accepted to use QCollater? ;-) (Qt 5.1 material)
http://qt.gitorious.org/qt/qtbase/blobs/1bf5865a4137c615e7b9b10fd157a32e2bf61274/src/corelib/tools/qcollator.cpp



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic