'Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
From:       Mark <markg85 () gmail ! com>
Date:       2012-10-29 12:30:10
Message-ID: CAPd6JnGtwan3KXnET6jxVkiTmGntFFnum8sbZgw4kYPO_qoyAg () mail ! gmail ! com
[Download RAW message or body]

On Mon, Oct 29, 2012 at 9:42 AM, Frank Reininghaus
<frank78ac@googlemail.com> wrote:
> Hi Mark,
> 
> 2012/10/28 Mark:
> [...]
> > http://paste.kde.org/584534/
> > 
> > I already had the nagging feeling that the QChar->unicode() stuff was
> > making it more complicated then it had to bee in the lessThan
> > function. So i converted it all to ints. You can see that in the above
> > paste.
> > 
> > Before sorting
> > "0000.txt"
> > "1a2b3a4.txt"
> > "a1.1001.txt"
> > "a10.txt"
> > "a6.txt"
> > "a3.txt"
> > "a8.txt"
> > "a1.txt"
> > "a2.txt"
> > "a9.txt"
> > "a5.txt"
> > "a4.txt"
> > "a14.txt"
> > "a11.txt"
> > "a12.txt"
> > "a1.2.txt"
> > "a1.11.txt"
> > "a7.txt"
> > 
> > After sorting
> > "0000.txt"
> > "1a2b3a4.txt"
> > "a1.txt"
> > "a1.2.txt"
> > "a1.11.txt"
> > "a1.1001.txt"
> > "a2.txt"
> > "a3.txt"
> > "a4.txt"
> > "a5.txt"
> > "a6.txt"
> > "a7.txt"
> > "a8.txt"
> > "a9.txt"
> > "a10.txt"
> > "a11.txt"
> > "a12.txt"
> > "a14.txt"
> 
> Looks like a reasonable sorting of these files, yes :-)
> 
> > Victory i guess :)
> > Well, i actually think that a call for victory might be too soon. I
> > still have to test this on dolphin itself. Which i'm going to do right
> > now.
> 
> Another thing that might be worth checking is if your comparison
> algorithm passes all unit tests:
> 
> https://projects.kde.org/projects/kde/kdelibs/repository/revisions/master/entry/kdecore/tests/kstringhandlertest.cpp#L52
>  
> There's some pretty tricky stuff in there, but any candidate for a new
> comparison function must pass these, of course - we obviously don't
> want regressions.
> 
> Another thing that might be worth trying is to take a very large set
> of files (like all files on your hard drive) and check if they are
> sorted equally using the existing naturalCompare() function and your
> algorithm. If that is not the case, find out two files for which the
> lessThan() results differ, add a new unit test, then try to fix it.
> 
> I haven't checked every detail of your code yet, but I'm not sure if
> replacing digits by ':' in the collation sequence is a good idea. This
> is only asking for trouble with file names which contain ':' and might
> also cause other issues. Christoph's suggestion to prepend numbers
> with their number of digits (which probably has to be done even
> recursively to make sure that numbers with 10 digits are really
> considered larger than a number with 9 digits) looks better to me.
> 
> Best regards,
> Frank

Hi Frank,

Thanks a lot for your suggestions. I will certainly try to add the
tests in the mix. However, the tests are likely to be different. The
naturalCompare function returns an integer which is more then 0 and
1... My "alternative" is a lessThan replacement and only returns a
bool.

So obviously some tests are going to fail.. Is it even possible to
change the tests to work for this?

As for the numbers, i already have a different approach there :) The
algorithm right now is as follows:
- if the current value is a number and the next value isn't then the
collation key becomes the integer representation of a number (not the
unicode representation)
- if the current value AND the next value are both numbers then the
current value is 10 + the current number. The 10 makes sure that it's
sorted after the 9. This is recursive so a input of: a1000.txt will
become this (for the numbers) 1 = 11, 0 = 10, 0 = 10, 0 = 0.

You might wonder why i'm using the number representation and not the
unicode representation. The reasoning for that is putting a 10 after a
9 would mean that i have to pick the next unicode value for 1 which is
":". That works but makes it more complicated then needed and will
also give issues if filenames have an actual ":" in them. So i just
decided to numbers as they are. If i look at the unicode table [1]
then i'm safe. The highest possible number would be 19 in my algorithm
which is before any unicode character.

[1] http://www.tamasoft.co.jp/en/general-info/unicode.html


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic