From kfm-devel  Sun Oct 28 16:55:35 2012
From: Mark <markg85 () gmail ! com>
Date: Sun, 28 Oct 2012 16:55:35 +0000
To: kfm-devel
Subject: Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
Message-Id: <CAPd6JnHpB1ZsPLmBf3posz2jdgztii5qniiTj2S_F_abK8tr6Q () mail ! gmail ! com>
X-MARC-Message: https://marc.info/?l=kfm-devel&m=135144336826034

On Sun, Oct 28, 2012 at 1:39 AM, Mark <markg85@gmail.com> wrote:
> On Sat, Oct 27, 2012 at 10:14 PM, Mark <markg85@gmail.com> wrote:
>> On Fri, Oct 26, 2012 at 9:48 PM, Mark <markg85@gmail.com> wrote:
>>> On Fri, Oct 26, 2012 at 9:40 PM, Bernd Buschinski
>>> <b.buschinski@googlemail.com> wrote:
>>>> On Friday 26 October 2012 21:21:39 Mark wrote:
>>>>>
>>>>> Oke, i got "something" working now but i kinda miss some more
>>>>> knowledge in the sorting area.
>>>>>
>>>>> First the current implementation. I went for a structure like this:
>>>>> struct KFileItem
>>>>> {
>>>>>     QString collatingSequence;
>>>>>     QString filename;
>>>>>     ....
>>>>> };
>>>>>
>>>>> Note: this is just in a clean new (one file) project so don't bother the
>>>>> names.
>>>>>
>>>>> filename is obviously the filename. collatingSequence is a copy of
>>>>> filename. In that collatingSequence i'm tweaking the unicode values
>>>>> and lower them for numbers. So the intention is to have numbers shown
>>>>> before characters.
>>>>>
>>>>> So:
>>>>> 000.txt (in unicode: 48, 48, 48, 46, 116, 120, 116)
>>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>>> ...
>>>>>
>>>>> And i tweaked that to look like this: (1 + num)
>>>>> 000.txt (in unicode: 1, 1, 1, 46, 116, 120, 116)
>>>>> a.txt (in unicode: 97, 46, 116, 120, 116)
>>>>> ...
>>>>>
>>>>>
>>>>> That part "seems" to be oke in theory but i'm getting stuck when
>>>>> sorting. For this "proof of concept" i simply use qSort like so:
>>>>> QList<KFileItem> fileList;
>>>>>
>>>>> ... add a bunch of items in fileList ...
>>>>>
>>>>> qSort(fileList.begin(), fileList.end(), lessThan);
>>>>>
>>>>> lessThan is where i'm stuck. I simply don't understand yet how
>>>>> lessThan should work. What conditions i should do and when i should
>>>>> return false or true.. My current version looks like this:
>>>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>>>> {
>>>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>>>
>>>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>>>     {
>>>>>         if(sTwo->unicode() < sOne->unicode())
>>>>>         {
>>>>>             return false;
>>>>>         }
>>>>>
>>>>>         ++sOne;
>>>>>         ++sTwo;
>>>>>     }
>>>>>
>>>>>     return true;
>>>>> }
>>>>>
>>>>> Which compiles and runs, but does a very strange sorting :p
>>>>> If someone can explain what the intended implementation is for a lessThan..?
>>>>>
>>>>> Kind regards,
>>>>> Mark
>>>>
>>>> isn't the if wrong? shouldn't this be "if a is less than b"?
>>>>
>>>> so
>>>>         if(sTwo->unicode() < sOne->unicode())
>>>>         {
>>>>             return false;
>>>>         }
>>>>
>>>> should be
>>>>
>>>>         if(sOne->unicode() < sTwo->unicode())
>>>>         {
>>>>             return false;
>>>>         }
>>>>
>>>> or? (untested)
>>>>
>>>>
>>>>
>>>> all:
>>>>
>>>> bool intSort(const KFileItem& s1, const KFileItem& s2)
>>>> {
>>>>     const QChar* sOne = s1.collatingSequence.unicode();
>>>>     const QChar* sTwo = s2.collatingSequence.unicode();
>>>>
>>>>     while(!sOne->isNull() && !sTwo->isNull())
>>>>     {
>>>>         if(sOne->unicode() < sTwo->unicode())
>>>>         {
>>>>             return false;
>>>>         }
>>>>
>>>>         ++sOne;
>>>>         ++sTwo;
>>>>     }
>>>>
>>>>     return true;
>>>> }
>>>>
>>>>
>>>> Best regards
>>>> Bernd
>>>>
>>>
>>> Nope, i tried that but it doesn't improve sorting.
>>> Here is the VERY HACKY single file project.
>>> http://paste.kde.org/582536/
>>>
>>> Make an empty Qt console project and paste this as the main.cpp and it
>>> should just run. There is quite a bit of debug info in there and quite
>>> a lot more commented out.
>>
>> I have more stuff working now. You can find the current code here:
>> http://paste.kde.org/583586/ the same warnings apply as last time :)
>>
>> The sorting of just text works fine, but i'm having issues as soon as
>> i ass numbers in the mix.
>> If you sort by character then a 0 comes before a 9, just as it should be.
>> In this example code i have the following names:
>> "0000.txt"
>> "a4.txt"
>> "a1.1001.txt"
>> "a10.txt"
>> "a1.txt"
>> "a1.2.txt"
>> "a1.11.txt"
>>
>> and once the sorting is run over it the result becomes:
>> "0000.txt"
>> "a1.2.txt"
>> "a1.11.txt"
>> "a1.txt"
>> "a1.1001.txt"
>> "a4.txt"
>> "a10.txt"
>>
>> So i'm getting close, but that "a1.txt" is giving me troubles now.
>> Note, for the sorting i just use the names without the .txt since that
>> should be done when sorting on extension anyway.
>>
>> The "trick" i'm doing right now is looking ahead in the string. If the
>> current character is a number and the following character is also a
>> number then i invert the current character. So if the current
>> character is a 1 and the next character is a 0 then i replace the
>> current character with 9 - current character. Thus far this seems to
>> work quite well, but i haven't drawn this out to see if it will work
>> everywhere. I'm doing this because i really like to avoid detecting a
>> number in a string. That will obviously work, but will also post new
>> issues like the size of an int, float or  double.. I prefer the per
>> character way if i can get that working.
>>
>> Again, that a1 is biting me now. Does anyone know a solution to get
>> this order right? a1 should be above a1.1 and below 0000.
>>
>> Kind regards,
>> Mark
>
> Little update: http://paste.kde.org/583670/
> Prevents leading zeros from screwing my results.. I still have the
> same issue with a1.txt -_- That one is really getting on my nerves
> now.

Another little update.
Before sorting
"0000.txt"
"1a2b3a4.txt"
"a1.1001.txt"
"a10.txt"
"a1.txt"
"a1.2.txt"
"a1.11.txt"
"a7.txt"

After sorting
"0000.txt"
"1a2b3a4.txt"
"a1.1001.txt"
"a1.txt"
"a1.2.txt"
"a1.11.txt"
"a7.txt"
"a10.txt"


Somehow i need to place a1.1001.txt below a1.11.txt ... I'm currently
out of ideas on how to do that.
The code: http://paste.kde.org/584462/

Once this "algorithm" work i will rework it to have just an integer
array in KFileItem and an int arraySize. Then the actual sorting
function (intSort) is even simpler and probably even faster.