From kfm-devel Mon Oct 29 12:30:10 2012 From: Mark Date: Mon, 29 Oct 2012 12:30:10 +0000 To: kfm-devel Subject: Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files. Message-Id: X-MARC-Message: https://marc.info/?l=kfm-devel&m=135151384920018 On Mon, Oct 29, 2012 at 9:42 AM, Frank Reininghaus wrote: > Hi Mark, > > 2012/10/28 Mark: > [...] >> http://paste.kde.org/584534/ >> >> I already had the nagging feeling that the QChar->unicode() stuff was >> making it more complicated then it had to bee in the lessThan >> function. So i converted it all to ints. You can see that in the above >> paste. >> >> Before sorting >> "0000.txt" >> "1a2b3a4.txt" >> "a1.1001.txt" >> "a10.txt" >> "a6.txt" >> "a3.txt" >> "a8.txt" >> "a1.txt" >> "a2.txt" >> "a9.txt" >> "a5.txt" >> "a4.txt" >> "a14.txt" >> "a11.txt" >> "a12.txt" >> "a1.2.txt" >> "a1.11.txt" >> "a7.txt" >> >> After sorting >> "0000.txt" >> "1a2b3a4.txt" >> "a1.txt" >> "a1.2.txt" >> "a1.11.txt" >> "a1.1001.txt" >> "a2.txt" >> "a3.txt" >> "a4.txt" >> "a5.txt" >> "a6.txt" >> "a7.txt" >> "a8.txt" >> "a9.txt" >> "a10.txt" >> "a11.txt" >> "a12.txt" >> "a14.txt" > > Looks like a reasonable sorting of these files, yes :-) > >> Victory i guess :) >> Well, i actually think that a call for victory might be too soon. I >> still have to test this on dolphin itself. Which i'm going to do right >> now. > > Another thing that might be worth checking is if your comparison > algorithm passes all unit tests: > > https://projects.kde.org/projects/kde/kdelibs/repository/revisions/master/entry/kdecore/tests/kstringhandlertest.cpp#L52 > > There's some pretty tricky stuff in there, but any candidate for a new > comparison function must pass these, of course - we obviously don't > want regressions. > > Another thing that might be worth trying is to take a very large set > of files (like all files on your hard drive) and check if they are > sorted equally using the existing naturalCompare() function and your > algorithm. If that is not the case, find out two files for which the > lessThan() results differ, add a new unit test, then try to fix it. > > I haven't checked every detail of your code yet, but I'm not sure if > replacing digits by ':' in the collation sequence is a good idea. This > is only asking for trouble with file names which contain ':' and might > also cause other issues. Christoph's suggestion to prepend numbers > with their number of digits (which probably has to be done even > recursively to make sure that numbers with 10 digits are really > considered larger than a number with 9 digits) looks better to me. > > Best regards, > Frank Hi Frank, Thanks a lot for your suggestions. I will certainly try to add the tests in the mix. However, the tests are likely to be different. The naturalCompare function returns an integer which is more then 0 and 1... My "alternative" is a lessThan replacement and only returns a bool. So obviously some tests are going to fail.. Is it even possible to change the tests to work for this? As for the numbers, i already have a different approach there :) The algorithm right now is as follows: - if the current value is a number and the next value isn't then the collation key becomes the integer representation of a number (not the unicode representation) - if the current value AND the next value are both numbers then the current value is 10 + the current number. The 10 makes sure that it's sorted after the 9. This is recursive so a input of: a1000.txt will become this (for the numbers) 1 = 11, 0 = 10, 0 = 10, 0 = 0. You might wonder why i'm using the number representation and not the unicode representation. The reasoning for that is putting a 10 after a 9 would mean that i have to pick the next unicode value for 1 which is ":". That works but makes it more complicated then needed and will also give issues if filenames have an actual ":" in them. So i just decided to numbers as they are. If i look at the unicode table [1] then i'm safe. The highest possible number would be 19 in my algorithm which is before any unicode character. [1] http://www.tamasoft.co.jp/en/general-info/unicode.html