[prev in list] [next in list] [prev in thread] [next in thread] 

List:       coreutils
Subject:    Re: [coreutils] added ability in sort to skip n number of lines for each file
From:       Pádraig Brady <P () draigBrady ! com>
Date:       2010-11-23 16:21:07
Message-ID: 4CEBE9F3.9060306 () draigBrady ! com
[Download RAW message or body]

On 23/11/10 15:57, Jim Hester wrote:
> Below I have an updated proper patch, it is quite a bit larger than my
> first, but should address all of the concerns from Assaf and Pádraig.
> 
> My main motivation here is not just to make this common operation less
> annoying, it was mostly for increased performance.  I made a test
> dataset of 10 files with 3 header lines each and 500,000 lines to sort,
> then ran sort by using head and tail as Pádraig suggests, and then again
> using my implemented header skip on an 8 core machine.  Larger files
> seem to show similar speed up as well.  I believe this speedup comes
> from the fact that the multithreaded sort is trying to read from the
> buffer faster than tail can write to the buffer.
> 
>>time { (head -q -n 3 test[0-9] | head -n 3; tail -q -n+4 test[0-9] |
> ./sort -n ) > out2; }
> 
> real    0m51.660s
> user    2m0.324s
> sys     0m4.115s
> 
>>time ./sort -n -l 3 test[0-9] > out
> 
> real    0m31.834s
> user    2m17.775s
> sys     0m3.981s
>>diff out out2

The user time from the head;tail|sort
is lower than sort -l which suggests that
the first invocation was just waiting on disk?

Could you please repeat the test using precached data?

Currently the threads in `sort` are passed data that is read
sequentially from input files (as otherwise `sort`
would have to start worrying about device ids,
and /sys/block/<blockdev>/queue/rotational etc.
so as to not thrash disk heads). That kind of
logic is probably always best outside of `sort`.

cheers,
Pádraig.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic