'Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
From:       Frank Reininghaus <frank78ac () googlemail ! com>
Date:       2012-10-29 8:42:29
Message-ID: CAFoZWWgMSsfRMn6TDEL0iMocWvPnE0fJmrE-05OPBBp2xrSsnQ () mail ! gmail ! com
[Download RAW message or body]

Hi Mark,

2012/10/28 Mark:
[...]
> http://paste.kde.org/584534/
> 
> I already had the nagging feeling that the QChar->unicode() stuff was
> making it more complicated then it had to bee in the lessThan
> function. So i converted it all to ints. You can see that in the above
> paste.
> 
> Before sorting
> "0000.txt"
> "1a2b3a4.txt"
> "a1.1001.txt"
> "a10.txt"
> "a6.txt"
> "a3.txt"
> "a8.txt"
> "a1.txt"
> "a2.txt"
> "a9.txt"
> "a5.txt"
> "a4.txt"
> "a14.txt"
> "a11.txt"
> "a12.txt"
> "a1.2.txt"
> "a1.11.txt"
> "a7.txt"
> 
> After sorting
> "0000.txt"
> "1a2b3a4.txt"
> "a1.txt"
> "a1.2.txt"
> "a1.11.txt"
> "a1.1001.txt"
> "a2.txt"
> "a3.txt"
> "a4.txt"
> "a5.txt"
> "a6.txt"
> "a7.txt"
> "a8.txt"
> "a9.txt"
> "a10.txt"
> "a11.txt"
> "a12.txt"
> "a14.txt"

Looks like a reasonable sorting of these files, yes :-)

> Victory i guess :)
> Well, i actually think that a call for victory might be too soon. I
> still have to test this on dolphin itself. Which i'm going to do right
> now.

Another thing that might be worth checking is if your comparison
algorithm passes all unit tests:

https://projects.kde.org/projects/kde/kdelibs/repository/revisions/master/entry/kdecore/tests/kstringhandlertest.cpp#L52


There's some pretty tricky stuff in there, but any candidate for a new
comparison function must pass these, of course - we obviously don't
want regressions.

Another thing that might be worth trying is to take a very large set
of files (like all files on your hard drive) and check if they are
sorted equally using the existing naturalCompare() function and your
algorithm. If that is not the case, find out two files for which the
lessThan() results differ, add a new unit test, then try to fix it.

I haven't checked every detail of your code yet, but I'm not sure if
replacing digits by ':' in the collation sequence is a good idea. This
is only asking for trouble with file names which contain ':' and might
also cause other issues. Christoph's suggestion to prepend numbers
with their number of digits (which probably has to be done even
recursively to make sure that numbers with 10 digits are really
considered larger than a number with 9 digits) looks better to me.

Best regards,
Frank


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic