From kfm-devel Sun Oct 28 16:55:35 2012 From: Mark Date: Sun, 28 Oct 2012 16:55:35 +0000 To: kfm-devel Subject: Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files. Message-Id: X-MARC-Message: https://marc.info/?l=kfm-devel&m=135144336826034 On Sun, Oct 28, 2012 at 1:39 AM, Mark wrote: > On Sat, Oct 27, 2012 at 10:14 PM, Mark wrote: >> On Fri, Oct 26, 2012 at 9:48 PM, Mark wrote: >>> On Fri, Oct 26, 2012 at 9:40 PM, Bernd Buschinski >>> wrote: >>>> On Friday 26 October 2012 21:21:39 Mark wrote: >>>>> >>>>> Oke, i got "something" working now but i kinda miss some more >>>>> knowledge in the sorting area. >>>>> >>>>> First the current implementation. I went for a structure like this: >>>>> struct KFileItem >>>>> { >>>>> QString collatingSequence; >>>>> QString filename; >>>>> .... >>>>> }; >>>>> >>>>> Note: this is just in a clean new (one file) project so don't bother the >>>>> names. >>>>> >>>>> filename is obviously the filename. collatingSequence is a copy of >>>>> filename. In that collatingSequence i'm tweaking the unicode values >>>>> and lower them for numbers. So the intention is to have numbers shown >>>>> before characters. >>>>> >>>>> So: >>>>> 000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116) >>>>> a.txt (in unicode: 97, 46, 116, 120, 116) >>>>> ... >>>>> >>>>> And i tweaked that to look like this: (1 + num) >>>>> 000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116) >>>>> a.txt (in unicode: 97, 46, 116, 120, 116) >>>>> ... >>>>> >>>>> >>>>> That part "seems" to be oke in theory but i'm getting stuck when >>>>> sorting. For this "proof of concept" i simply use qSort like so: >>>>> QList fileList; >>>>> >>>>> ... add a bunch of items in fileList ... >>>>> >>>>> qSort(fileList.begin(), fileList.end(), lessThan); >>>>> >>>>> lessThan is where i'm stuck. I simply don't understand yet how >>>>> lessThan should work. What conditions i should do and when i should >>>>> return false or true.. My current version looks like this: >>>>> bool intSort(const KFileItem& s1, const KFileItem& s2) >>>>> { >>>>> const QChar* sOne = s1.collatingSequence.unicode(); >>>>> const QChar* sTwo = s2.collatingSequence.unicode(); >>>>> >>>>> while(!sOne->isNull() && !sTwo->isNull()) >>>>> { >>>>> if(sTwo->unicode() < sOne->unicode()) >>>>> { >>>>> return false; >>>>> } >>>>> >>>>> ++sOne; >>>>> ++sTwo; >>>>> } >>>>> >>>>> return true; >>>>> } >>>>> >>>>> Which compiles and runs, but does a very strange sorting :p >>>>> If someone can explain what the intended implementation is for a lessThan..? >>>>> >>>>> Kind regards, >>>>> Mark >>>> >>>> isn't the if wrong? shouldn't this be "if a is less than b"? >>>> >>>> so >>>> if(sTwo->unicode() < sOne->unicode()) >>>> { >>>> return false; >>>> } >>>> >>>> should be >>>> >>>> if(sOne->unicode() < sTwo->unicode()) >>>> { >>>> return false; >>>> } >>>> >>>> or? (untested) >>>> >>>> >>>> >>>> all: >>>> >>>> bool intSort(const KFileItem& s1, const KFileItem& s2) >>>> { >>>> const QChar* sOne = s1.collatingSequence.unicode(); >>>> const QChar* sTwo = s2.collatingSequence.unicode(); >>>> >>>> while(!sOne->isNull() && !sTwo->isNull()) >>>> { >>>> if(sOne->unicode() < sTwo->unicode()) >>>> { >>>> return false; >>>> } >>>> >>>> ++sOne; >>>> ++sTwo; >>>> } >>>> >>>> return true; >>>> } >>>> >>>> >>>> Best regards >>>> Bernd >>>> >>> >>> Nope, i tried that but it doesn't improve sorting. >>> Here is the VERY HACKY single file project. >>> http://paste.kde.org/582536/ >>> >>> Make an empty Qt console project and paste this as the main.cpp and it >>> should just run. There is quite a bit of debug info in there and quite >>> a lot more commented out. >> >> I have more stuff working now. You can find the current code here: >> http://paste.kde.org/583586/ the same warnings apply as last time :) >> >> The sorting of just text works fine, but i'm having issues as soon as >> i ass numbers in the mix. >> If you sort by character then a 0 comes before a 9, just as it should be. >> In this example code i have the following names: >> "0000.txt" >> "a4.txt" >> "a1.1001.txt" >> "a10.txt" >> "a1.txt" >> "a1.2.txt" >> "a1.11.txt" >> >> and once the sorting is run over it the result becomes: >> "0000.txt" >> "a1.2.txt" >> "a1.11.txt" >> "a1.txt" >> "a1.1001.txt" >> "a4.txt" >> "a10.txt" >> >> So i'm getting close, but that "a1.txt" is giving me troubles now. >> Note, for the sorting i just use the names without the .txt since that >> should be done when sorting on extension anyway. >> >> The "trick" i'm doing right now is looking ahead in the string. If the >> current character is a number and the following character is also a >> number then i invert the current character. So if the current >> character is a 1 and the next character is a 0 then i replace the >> current character with 9 - current character. Thus far this seems to >> work quite well, but i haven't drawn this out to see if it will work >> everywhere. I'm doing this because i really like to avoid detecting a >> number in a string. That will obviously work, but will also post new >> issues like the size of an int, float or double.. I prefer the per >> character way if i can get that working. >> >> Again, that a1 is biting me now. Does anyone know a solution to get >> this order right? a1 should be above a1.1 and below 0000. >> >> Kind regards, >> Mark > > Little update: http://paste.kde.org/583670/ > Prevents leading zeros from screwing my results.. I still have the > same issue with a1.txt -_- That one is really getting on my nerves > now. Another little update. Before sorting "0000.txt" "1a2b3a4.txt" "a1.1001.txt" "a10.txt" "a1.txt" "a1.2.txt" "a1.11.txt" "a7.txt" After sorting "0000.txt" "1a2b3a4.txt" "a1.1001.txt" "a1.txt" "a1.2.txt" "a1.11.txt" "a7.txt" "a10.txt" Somehow i need to place a1.1001.txt below a1.11.txt ... I'm currently out of ideas on how to do that. The code: http://paste.kde.org/584462/ Once this "algorithm" work i will rework it to have just an integer array in KFileItem and an int arraySize. Then the actual sorting function (intSort) is even simpler and probably even faster.