'[jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementati'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    [jira] [Comment Edited] (LUCENE-4062) More fine-grained control over the packed integer implementati
From:       "Toke Eskildsen (JIRA)" <jira () apache ! org>
Date:       2012-06-29 22:04:43
Message-ID: 1107762871.73699.1341007483871.JavaMail.jiratomcat () issues-vm
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404234#comment-13404234 ] 

Toke Eskildsen edited comment on LUCENE-4062 at 6/29/12 10:03 PM:
------------------------------------------------------------------

+1 for replacing the Packed64SingleBlock with the optimized version. It is \
consistently better.

I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. \
For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for \
the "Best possible performance without going all-out with Direct".

I tried running two tests in parallel on te-prime (see \
http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent \
                results:
  * For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and \
16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, \
12, 16, 21 & 32 with non-conforming bpvs rounded up, this effectively means that \
there is zero gain in using padded over contiguous with regard to set on that \
                machine.
  * For get, the picture is nearly the same, except for bpv 17-21 where contiguous is \
slower than padded (observed in process 2 as well as the single-thread run). The \
difference is less than 10% though. The same pattern, although noisier, can be seen \
on mars.

My preliminary conclusion for i7-processors is thus that Packed64Strategy is the \
right choice for "Best possible performance without going all-out with Direct". I am \
getting very curious about the untested AMD architecture now.

A by-product of the te-prime parallel test is that the amount of cache seems to \
matter little when it comes to selecting the most fitting implementation. Thank \
$diety for small favors.  
      was (Author: toke):
    +1 for replacing the Packed64SingleBlock with the optimized version. It is \
consistently better.

I have updated the charts at http://ekot.dk/misc/packedints/ with i7 measurements. \
For the three non-i7-based Intel processors, Packed64SingleBlock seems clear-cut for \
the "Best possible performance without going all-out with Direct".

I tried running two tests in parallel on te-prime (see \
http://ekot.dk/misc/packedints/te-prime_parallel.html) and got very consistent \
                results:
  * For set, contiguous strategy is faster than or equal to padded for bpvs 3-12 and \
16-21 (with bpv 1 and 2 being hard to measure). Since padded only supports bpv 1-10, \
12, 16, 21 & 32 with non-conforming bpvs rounded up, this effectively means that \
there is zero gain in using padded over contiguous with regard to set on that \
                machine.
  * For get, the picture is the same, except for process 2, where 17-21 was slower \
for contiguous than for padded. Sigh. I should of course have started 3 processes so \
voting would have been easier. The same pattern, although noisier, can be seen on \
mars.

My preliminary conclusion for i7-processors is thus that Packed64Strategy is the \
right choice for "Best possible performance without going all-out with Direct". I am \
getting very curious about the untested AMD architecture now.

A by-product of the te-prime parallel test is that the amount of cache seems to \
matter little when it comes to selecting the most fitting implementation. Thank \
$diety for small favors.  
> More fine-grained control over the packed integer implementation that is chosen
> -------------------------------------------------------------------------------
> 
> Key: LUCENE-4062
> URL: https://issues.apache.org/jira/browse/LUCENE-4062
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/other
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Labels: performance
> Fix For: 4.0, 5.0
> 
> Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, LUCENE-4062.patch, \
> LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, \
> LUCENE-4062.patch, Packed64SingleBlock.java, Packed64Strategy.java, \
> Packed64calc.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, \
> PackedIntsBenchmark.java, measurements_te_graphs.pdf, measurements_te_i7.txt, \
> measurements_te_p4.txt, measurements_te_xeon.txt 
> 
> In order to save space, Lucene has two main PackedInts.Mutable implentations, one \
> that is very fast and is based on a byte/short/integer/long array (Direct*) and \
> another one which packs bits in a memory-efficient manner (Packed*). The packed \
> implementation tends to be much slower than the direct one, which discourages some \
> Lucene components to use it. On the other hand, if you store 21 bits integers in a \
> Direct32, this is a space loss of (32-21)/32=35%. If you accept to trade some space \
> for speed, you could store 3 of these 21 bits integers in a long, resulting in an \
> overhead of 1/3 bit per value. One advantage of this approach is that you never \
> need to read more than one block to read or write a value, so this can be \
> significantly faster than Packed32 and Packed64 which always need to read/write two \
> blocks in order to avoid costly branches. I ran some tests, and for 10000000 21 \
> bits values, this implementation takes less than 2% more space and has 44% faster \
> writes and 30% faster reads. The 12 bits version (5 values per block) has the same \
> performance improvement and a 6% memory overhead compared to the packed \
> implementation. In order to select the best implementation for a given integer \
> size, I wrote the {{PackedInts.getMutable(valueCount, bitsPerValue, \
> acceptableOverheadPerValue)}} method. This method select the fastest implementation \
> that has less than {{acceptableOverheadPerValue}} wasted bits per value. For \
> example, if you accept an overhead of 20% ({{acceptableOverheadPerValue = 0.2f * \
> bitsPerValue}}), which is pretty reasonable, here is what implementations would be \
>                 selected:
> * 1: Packed64SingleBlock1
> * 2: Packed64SingleBlock2
> * 3: Packed64SingleBlock3
> * 4: Packed64SingleBlock4
> * 5: Packed64SingleBlock5
> * 6: Packed64SingleBlock6
> * 7: Direct8
> * 8: Direct8
> * 9: Packed64SingleBlock9
> * 10: Packed64SingleBlock10
> * 11: Packed64SingleBlock12
> * 12: Packed64SingleBlock12
> * 13: Packed64
> * 14: Direct16
> * 15: Direct16
> * 16: Direct16
> * 17: Packed64
> * 18: Packed64SingleBlock21
> * 19: Packed64SingleBlock21
> * 20: Packed64SingleBlock21
> * 21: Packed64SingleBlock21
> * 22: Packed64
> * 23: Packed64
> * 24: Packed64
> * 25: Packed64
> * 26: Packed64
> * 27: Direct32
> * 28: Direct32
> * 29: Direct32
> * 30: Direct32
> * 31: Direct32
> * 32: Direct32
> * 33: Packed64
> * 34: Packed64
> * 35: Packed64
> * 36: Packed64
> * 37: Packed64
> * 38: Packed64
> * 39: Packed64
> * 40: Packed64
> * 41: Packed64
> * 42: Packed64
> * 43: Packed64
> * 44: Packed64
> * 45: Packed64
> * 46: Packed64
> * 47: Packed64
> * 48: Packed64
> * 49: Packed64
> * 50: Packed64
> * 51: Packed64
> * 52: Packed64
> * 53: Packed64
> * 54: Direct64
> * 55: Direct64
> * 56: Direct64
> * 57: Direct64
> * 58: Direct64
> * 59: Direct64
> * 60: Direct64
> * 61: Direct64
> * 62: Direct64
> Under 32 bits per value, only 13, 17 and 22-26 bits per value would still choose \
> the slower Packed64 implementation. Allowing a 50% overhead would prevent the \
> packed implementation to be selected for bits per value under 32. Allowing an \
> overhead of 32 bits per value would make sure that a Direct* implementation is \
> always selected. Next steps would be to:
> * make lucene components use this {{getMutable}} method and let users decide what \
>                 trade-off better suits them,
> * write a Packed32SingleBlock implementation if necessary (I didn't do it because I \
> have no 32-bits computer to test the performance improvements). I think this would \
> allow more fine-grained control over the speed/space trade-off, what do you think?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: \
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more \
information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic