From kfm-devel Fri Dec 28 14:41:26 2012 From: Mark Date: Fri, 28 Dec 2012 14:41:26 +0000 To: kfm-devel Subject: Re: Ideas for improving sorting. Goal: under 1 second for 200.000 files. Message-Id: X-MARC-Message: https://marc.info/?l=kfm-devel&m=135670572315725 On Fri, Dec 28, 2012 at 2:46 PM, Frank Reininghaus wrote: > Hi, > > 2012/10/24 Christoph Feck: >> On Wednesday 24 October 2012 01:13:40 Christoph Feck wrote: >>> Speed of comparing the sequences is that of a memcpy() >> >> Of course I meant memcmp(). >> >>> So what you have to work on before doing anything else, is to ask >>> Frank, if he is okey with adding a sort key (QByteArray basically) >>> for each file item. Your code would have to make sure it is always >>> updated on file renames, locale changes, sort mode changes >>> (natural vs. simple) etc. >> >> Another idea is to not store this side information "permanently", but >> only create it temporarily when sorting. So you effectively only sort >> a list with side information. >> >> struct SortEntry >> { >> QByteArray collatingKey; >> KFileItem *fileItem; >> }; > > I have been thinking about sorting from time to time, and I had an > idea how to to store the sort key. It could be done either > > a) In the ItemData's "values" hash (as a QVariant with a key like > "naturalCollatingSequence"), or, > > b) in a new member "QByteArray naturalCollatingSequence" in ItemData > (if a QByteArray seems more suitable because QByteArrays can be > compared with < directly and the casting overhead for each comparison > seems to much). > > When comparing two ItemData, we would then first check if both already > have a collating sequence, and calculate and store them if that is not > the case. One could then just compare those sequences. > > When an item changes, the collating sequence would be deleted from the > hash (or set to an empty QByteArray, respectively). > > This would make sure that the collating sequence is calculated only > once for each item, and only if it is really needed. > > Best regards, > Frank Hi Frank, That's a nice idea and close to what i was experimenting with as well. The issue that i keep having all the time is building a collation sequence itself. For example, take the following "file names": 88.txt 100.txt What i'm doing right now is translating everything to numbers, and then just comparing the numbers. Another way would be to convert them all to numbers and then make a new long string that contains all the numbers. So i end up with: 88.txt 8 = 8 8 = 8 . = t = x = t = 100.txt 1 = 1 0 = 0 0 = 0 . = t = x = t = That works, but only till 9.. That's a pitfall i discovered early on. So then i continues and changed the algorithm to say: "if the next character is also a number then the current number gets incremented by 10" so that will end up with: (omitting the .txt) 88.txt 8 = 18 8 = 8 100.txt 1 = 11 0 = 10 0 = 0 Doing that improved it greatly but it still gives unexpected results. So what i was about to try now is making the algorithm say: "increment the current number by the previous number. If there is no previous number, but there are next numbers, just add 10." so that would end up looking like this: 88.txt // stays the same 8 = 18 8 = 8 100.txt differs 1 = 11 0 = 10 (+11) = 21 0 = 0 I still have to see if that one nails all the issues. Once i get that working then it's just a simple matter of putting all the numbers in one string again, so for the last example that would end up like so: "11210" as a string. But i'm not there - by far! It's very complicated to get it right and i fear that this last option also has some unexpected consequences. If you have any other idea to get this working, please do share. This issue is really nagging me and i want to get it fixed. It would be the fastest natural sorting algorithm that i know of if it works :) Cheers, Mark