'Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files.
From:       Mark <markg85 () gmail ! com>
Date:       2012-12-28 15:32:36
Message-ID: CAPd6JnH=dVUzHBMg6obPDnT_VpjVKk6afYsbqX6hOwtNngnOwA () mail ! gmail ! com
[Download RAW message or body]

On Fri, Dec 28, 2012 at 3:41 PM, Mark <markg85@gmail.com> wrote:
> On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus
> <frank78ac@googlemail.com> wrote:
>> Hi,
>>
>> 2012/10/24 Christoph Feck:
>>> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote:
>>>> Speed of comparing the sequences is that of a memcpy()
>>>
>>> Of course I meant memcmp().
>>>
>>>> So what you have to work on before doing anything else, is to ask
>>>> Frank, if he is okey with adding a sort key (QByteArray basically)
>>>> for each file item. Your code would have to make sure it is always
>>>> updated on file renames, locale changes, sort mode changes
>>>> (natural vs. simple) etc.
>>>
>>> Another idea is to not store this side information "permanently", but
>>> only create it temporarily when sorting. So you effectively only sort
>>> a list with side information.
>>>
>>>         struct SortEntry
>>>         {
>>>                 QByteArray collatingKey;
>>>                 KFileItem *fileItem;
>>>         };
>>
>> I have been thinking about sorting from time to time, and I had an
>> idea how to to store the sort key. It could be done either
>>
>> a) In the ItemData's "values" hash (as a QVariant with a key like
>> "naturalCollatingSequence"), or,
>>
>> b) in a new member "QByteArray naturalCollatingSequence" in ItemData
>> (if a QByteArray seems more suitable because QByteArrays can be
>> compared with < directly and the casting overhead for each comparison
>> seems to much).
>>
>> When comparing two ItemData, we would then first check if both already
>> have a collating sequence, and calculate and store them if that is not
>> the case. One could then just compare those sequences.
>>
>> When an item changes, the collating sequence would be deleted from the
>> hash (or set to an empty QByteArray, respectively).
>>
>> This would make sure that the collating sequence is calculated only
>> once for each item, and only if it is really needed.
>>
>> Best regards,
>> Frank
>
> Hi Frank,
>
> That's a nice idea and close to what i was experimenting with as well.
>
> The issue that i keep having all the time is building a collation
> sequence itself. For example, take the following "file names":
>
> 88.txt
> 100.txt
>
> What i'm doing right now is translating everything to numbers, and
> then just comparing the numbers. Another way would be to convert them
> all to numbers and then make a new long string that contains all the
> numbers. So i end up with:
>
> 88.txt
> 8 = 8
> 8 = 8
> . = <some number>
> t = <some number>
> x = <some number>
> t = <some number>
>
> 100.txt
> 1 = 1
> 0 = 0
> 0 = 0
> . = <some number>
> t = <some number>
> x = <some number>
> t = <some number>
>
> That works, but only till 9.. That's a pitfall i discovered early on.
> So then i continues and changed the algorithm to say: "if the next
> character is also a number then the current number gets incremented by
> 10" so that will end up with: (omitting the .txt)
> 88.txt
> 8 = 18
> 8 = 8
>
> 100.txt
> 1 = 11
> 0 = 10
> 0 = 0
>
> Doing that improved it greatly but it still gives unexpected results.
> So what i was about to try now is making the algorithm say: "increment
> the current number by the previous number. If there is no previous
> number, but there are next numbers, just add 10." so that would end up
> looking like this:
> 88.txt // stays the same
> 8 = 18
> 8 = 8
>
> 100.txt differs
> 1 = 11
> 0 = 10 (+11) = 21
> 0 = 0
>
> I still have to see if that one nails all the issues.
> Once i get that working then it's just a simple matter of putting all
> the numbers in one string again, so for the last example that would
> end up like so: "11210" as a string. But i'm not there - by far! It's
> very complicated to get it right and i fear that this last option also
> has some unexpected consequences.
>
> If you have any other idea to get this working, please do share. This
> issue is really nagging me and i want to get it fixed. It would be the
> fastest natural sorting algorithm that i know of if it works :)
>
> Cheers,
> Mark

Note: one way that will certainly fix it is prefixing with 0. So if we
take the 88 and 100 that would end up like this:
088
100

That will work with sorting, but all "collation sequences" will be as
long as the longest string which is probably not what we want to have.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic