'Re: [jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ord'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: [jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ord
From:       Michael Sokolov <msokolov () gmail ! com>
Date:       2021-09-24 10:58:24
Message-ID: CAGUSZHB2Sj0DRbQc0+U+REgeac-1Cm2FquFawK=my3aB_VyQrQ () mail ! gmail ! com
[Download RAW message or body]

Hard to read on the phone, but is that a 482% speed up I saw??!

On Thu, Sep 23, 2021, 1:28 PM Greg Miller (Jira) <jira@apache.org> wrote:

> 
> [
> https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349
>  ]
> 
> Greg Miller commented on LUCENE-10062:
> --------------------------------------
> 
> I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand]
> added new faceting tasks (thanks Mike!). Looks like there's a nice
> improvement on these new faceting tasks as well with this change (and no
> regressions anywhere else that I see).
> 
> I was waiting to iterate on my PR until I was able to run these new
> benchmarking tasks, but it seems like there's enough benefit to this change
> to pick it back up.
> 
> 
> {noformat}
> TaskQPS baseline      StdDevQPS candidate
> StdDev                Pct diff p-value
> HighTermDayOfYearSort       70.02     (13.7%)       68.45
> (9.7%)   -2.2% ( -22% -   24%) 0.551
> MedTerm     1300.90      (5.5%)     1275.97
> (6.7%)   -1.9% ( -13% -   10%) 0.324
> HighTerm     1953.46      (5.8%)     1925.79
> (7.9%)   -1.4% ( -14% -   13%) 0.518
> HighTermTitleBDVSort      122.35     (15.6%)      120.86
> (14.9%)   -1.2% ( -27% -   34%) 0.801
> TermDTSort      133.47      (8.7%)      131.86
> (7.4%)   -1.2% ( -15% -   16%) 0.637
> LowTerm     1636.13      (5.5%)     1622.34
> (7.4%)   -0.8% ( -12% -   12%) 0.682
> Prefix3       25.69      (6.0%)       25.48
> (6.3%)   -0.8% ( -12% -   12%) 0.676
> LowSpanNear      118.02      (2.1%)      117.31
> (1.8%)   -0.6% (  -4% -    3%) 0.326
> HighTermMonthSort      140.17      (9.8%)      139.47
> (9.9%)   -0.5% ( -18% -   21%) 0.872
> AndHighHigh       49.17      (3.1%)       48.92
> (2.7%)   -0.5% (  -6% -    5%) 0.584
> HighSpanNear       25.54      (2.7%)       25.41
> (2.2%)   -0.5% (  -5% -    4%) 0.529
> AndHighLow      556.68      (5.8%)      554.80
> (5.4%)   -0.3% ( -10% -   11%) 0.848
> BrowseDayOfYearSSDVFacets       16.53      (2.5%)       16.47
> (2.4%)   -0.3% (  -5% -    4%) 0.674
> IntNRQ       87.76      (2.0%)       87.49
> (2.1%)   -0.3% (  -4% -    3%) 0.634
> MedSpanNear       31.11      (2.2%)       31.04
> (1.6%)   -0.2% (  -3% -    3%) 0.714
> OrNotHighLow      765.10      (4.5%)      763.60
> (5.4%)   -0.2% (  -9% -   10%) 0.901
> MedPhrase      160.05      (3.1%)      159.83
> (2.9%)   -0.1% (  -5% -    6%) 0.885
> HighSloppyPhrase       27.67      (3.1%)       27.64
> (3.0%)   -0.1% (  -6% -    6%) 0.915
> LowPhrase       61.12      (3.2%)       61.05
> (3.2%)   -0.1% (  -6% -    6%) 0.921
> OrHighMed       71.85      (2.9%)       71.82
> (2.1%)   -0.0% (  -4% -    5%) 0.963
> HighPhrase       29.40      (2.3%)       29.39
> (2.8%)   -0.0% (  -5% -    5%) 0.971
> Fuzzy2       32.58      (4.3%)       32.57
> (6.1%)   -0.0% (  -9% -   10%) 0.992
> LowIntervalsOrdered      150.30      (1.9%)      150.28
> (1.9%)   -0.0% (  -3% -    3%) 0.986
> AndHighMed      151.32      (3.9%)      151.31
> (4.1%)   -0.0% (  -7% -    8%) 0.993
> OrHighHigh       23.90      (2.3%)       23.91
> (1.9%)    0.0% (  -4% -    4%) 0.970
> OrHighNotLow      579.17      (5.1%)      579.35
> (6.4%)    0.0% ( -10% -   12%) 0.986
> MedIntervalsOrdered       86.93      (1.7%)       86.98
> (1.9%)    0.1% (  -3% -    3%) 0.913
> OrHighNotHigh      536.17      (5.6%)      536.57
> (6.6%)    0.1% ( -11% -   12%) 0.969
> OrNotHighHigh      787.07      (6.5%)      787.96
> (8.1%)    0.1% ( -13% -   15%) 0.961
> OrNotHighMed      687.97      (4.7%)      688.77
> (6.9%)    0.1% ( -10% -   12%) 0.950
> MedSloppyPhrase       68.62      (2.8%)       68.74
> (2.7%)    0.2% (  -5% -    5%) 0.838
> LowSloppyPhrase      130.37      (2.6%)      130.62
> (2.2%)    0.2% (  -4% -    5%) 0.797
> OrHighLow      440.44      (4.1%)      441.33
> (4.1%)    0.2% (  -7% -    8%) 0.877
> Wildcard      122.01      (5.2%)      122.35
> (5.3%)    0.3% (  -9% -   11%) 0.867
> HighIntervalsOrdered       14.24      (2.2%)       14.34
> (2.1%)    0.6% (  -3% -    5%) 0.350
> Respell       52.04      (2.2%)       52.48
> (2.0%)    0.8% (  -3% -    5%) 0.209
> OrHighNotMed      674.76      (4.8%)      680.97
> (8.0%)    0.9% ( -11% -   14%) 0.659
> PKLookup      153.45      (4.3%)      155.13
> (3.8%)    1.1% (  -6% -    9%) 0.394
> Fuzzy1       56.57      (9.1%)       57.76
> (6.7%)    2.1% ( -12% -   19%) 0.406
> BrowseMonthSSDVFacets       19.59     (10.4%)       20.03
> (6.7%)    2.3% ( -13% -   21%) 0.413
> AndHighHighDayTaxoFacets       19.22      (1.6%)       22.13
> (2.2%)   15.1% (  11% -   19%) 0.000
> AndHighMedDayTaxoFacets       25.62      (1.5%)       29.93
> (2.2%)   16.8% (  12% -   20%) 0.000
> MedTermDayTaxoFacets       12.96      (2.2%)       18.99
> (3.4%)   46.5% (  39% -   53%) 0.000
> OrHighMedDayTaxoFacets        3.97      (2.0%)        5.81
> (4.3%)   46.5% (  39% -   53%) 0.000
> BrowseMonthTaxoFacets        2.59     (10.9%)       11.16
> (35.8%)  330.4% ( 255% -  423%) 0.000
> BrowseDateTaxoFacets        2.44      (9.7%)       13.12
> (51.8%)  438.1% ( 343% -  553%) 0.000
> BrowseDayOfYearTaxoFacets        2.44      (9.7%)       13.13
> (51.7%)  438.2% ( 343% -  552%) 0.000
> {noformat}
> 
> 
> > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for
> faceting
> > 
> --------------------------------------------------------------------------------
> > 
> > Key: LUCENE-10062
> > URL: https://issues.apache.org/jira/browse/LUCENE-10062
> > Project: Lucene - Core
> > Issue Type: Improvement
> > Components: modules/facet
> > Reporter: Greg Miller
> > Assignee: Greg Miller
> > Priority: Minor
> > Time Spent: 1h 40m
> > Remaining Estimate: 0h
> > 
> > We currently encode taxonomy ordinals using varint style packing in a
> binary doc values field. I suspect there have been a number of improvements
> to SortedNumericDocValues since taxonomy faceting was first introduced, and
> I plan to explore replacing the custom binary format we have today with a
> SORTED_NUMERIC type dv field instead.
> > I'll report benchmark results and index size impact here.
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
> 
> 


[Attachment #3 (text/html)]

<div dir="auto">Hard to read on the phone, but is that a 482% speed up I saw??! \
</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 23, \
2021, 1:28 PM Greg Miller (Jira) &lt;<a \
href="mailto:jira@apache.org">jira@apache.org</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><br>  [ <a \
href="https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugi \
n.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=17419349#comment-17419349" \
rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian. \
jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=17419349#comment-17419349</a> \
] <br> <br>
Greg Miller commented on LUCENE-10062:<br>
--------------------------------------<br>
<br>
I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand] added new \
faceting tasks (thanks Mike!). Looks like there&#39;s a nice improvement on these new \
faceting tasks as well with this change (and no regressions anywhere else that I \
see).<br> <br>
I was waiting to iterate on my PR until I was able to run these new benchmarking \
tasks, but it seems like there&#39;s enough benefit to this change to pick it back \
up.<br> <br>
<br>
{noformat}<br>
                            TaskQPS baseline      StdDevQPS candidate      StdDev     \
                Pct diff p-value<br>
           HighTermDayOfYearSort       70.02     (13.7%)       68.45      (9.7%)   \
                -2.2% ( -22% -   24%) 0.551<br>
                         MedTerm     1300.90      (5.5%)     1275.97      (6.7%)   \
                -1.9% ( -13% -   10%) 0.324<br>
                        HighTerm     1953.46      (5.8%)     1925.79      (7.9%)   \
                -1.4% ( -14% -   13%) 0.518<br>
            HighTermTitleBDVSort      122.35     (15.6%)      120.86     (14.9%)   \
                -1.2% ( -27% -   34%) 0.801<br>
                      TermDTSort      133.47      (8.7%)      131.86      (7.4%)   \
                -1.2% ( -15% -   16%) 0.637<br>
                         LowTerm     1636.13      (5.5%)     1622.34      (7.4%)   \
                -0.8% ( -12% -   12%) 0.682<br>
                         Prefix3       25.69      (6.0%)       25.48      (6.3%)   \
                -0.8% ( -12% -   12%) 0.676<br>
                     LowSpanNear      118.02      (2.1%)      117.31      (1.8%)   \
                -0.6% (  -4% -    3%) 0.326<br>
               HighTermMonthSort      140.17      (9.8%)      139.47      (9.9%)   \
                -0.5% ( -18% -   21%) 0.872<br>
                     AndHighHigh       49.17      (3.1%)       48.92      (2.7%)   \
                -0.5% (  -6% -    5%) 0.584<br>
                    HighSpanNear       25.54      (2.7%)       25.41      (2.2%)   \
                -0.5% (  -5% -    4%) 0.529<br>
                      AndHighLow      556.68      (5.8%)      554.80      (5.4%)   \
                -0.3% ( -10% -   11%) 0.848<br>
       BrowseDayOfYearSSDVFacets       16.53      (2.5%)       16.47      (2.4%)   \
                -0.3% (  -5% -    4%) 0.674<br>
                          IntNRQ       87.76      (2.0%)       87.49      (2.1%)   \
                -0.3% (  -4% -    3%) 0.634<br>
                     MedSpanNear       31.11      (2.2%)       31.04      (1.6%)   \
                -0.2% (  -3% -    3%) 0.714<br>
                    OrNotHighLow      765.10      (4.5%)      763.60      (5.4%)   \
                -0.2% (  -9% -   10%) 0.901<br>
                       MedPhrase      160.05      (3.1%)      159.83      (2.9%)   \
                -0.1% (  -5% -    6%) 0.885<br>
                HighSloppyPhrase       27.67      (3.1%)       27.64      (3.0%)   \
                -0.1% (  -6% -    6%) 0.915<br>
                       LowPhrase       61.12      (3.2%)       61.05      (3.2%)   \
                -0.1% (  -6% -    6%) 0.921<br>
                       OrHighMed       71.85      (2.9%)       71.82      (2.1%)   \
                -0.0% (  -4% -    5%) 0.963<br>
                      HighPhrase       29.40      (2.3%)       29.39      (2.8%)   \
                -0.0% (  -5% -    5%) 0.971<br>
                          Fuzzy2       32.58      (4.3%)       32.57      (6.1%)   \
                -0.0% (  -9% -   10%) 0.992<br>
             LowIntervalsOrdered      150.30      (1.9%)      150.28      (1.9%)   \
                -0.0% (  -3% -    3%) 0.986<br>
                      AndHighMed      151.32      (3.9%)      151.31      (4.1%)   \
                -0.0% (  -7% -    8%) 0.993<br>
                      OrHighHigh       23.90      (2.3%)       23.91      (1.9%)    \
                0.0% (  -4% -    4%) 0.970<br>
                    OrHighNotLow      579.17      (5.1%)      579.35      (6.4%)    \
                0.0% ( -10% -   12%) 0.986<br>
             MedIntervalsOrdered       86.93      (1.7%)       86.98      (1.9%)    \
                0.1% (  -3% -    3%) 0.913<br>
                   OrHighNotHigh      536.17      (5.6%)      536.57      (6.6%)    \
                0.1% ( -11% -   12%) 0.969<br>
                   OrNotHighHigh      787.07      (6.5%)      787.96      (8.1%)    \
                0.1% ( -13% -   15%) 0.961<br>
                    OrNotHighMed      687.97      (4.7%)      688.77      (6.9%)    \
                0.1% ( -10% -   12%) 0.950<br>
                 MedSloppyPhrase       68.62      (2.8%)       68.74      (2.7%)    \
                0.2% (  -5% -    5%) 0.838<br>
                 LowSloppyPhrase      130.37      (2.6%)      130.62      (2.2%)    \
                0.2% (  -4% -    5%) 0.797<br>
                       OrHighLow      440.44      (4.1%)      441.33      (4.1%)    \
                0.2% (  -7% -    8%) 0.877<br>
                        Wildcard      122.01      (5.2%)      122.35      (5.3%)    \
                0.3% (  -9% -   11%) 0.867<br>
            HighIntervalsOrdered       14.24      (2.2%)       14.34      (2.1%)    \
                0.6% (  -3% -    5%) 0.350<br>
                         Respell       52.04      (2.2%)       52.48      (2.0%)    \
                0.8% (  -3% -    5%) 0.209<br>
                    OrHighNotMed      674.76      (4.8%)      680.97      (8.0%)    \
                0.9% ( -11% -   14%) 0.659<br>
                        PKLookup      153.45      (4.3%)      155.13      (3.8%)    \
                1.1% (  -6% -    9%) 0.394<br>
                          Fuzzy1       56.57      (9.1%)       57.76      (6.7%)    \
                2.1% ( -12% -   19%) 0.406<br>
           BrowseMonthSSDVFacets       19.59     (10.4%)       20.03      (6.7%)    \
                2.3% ( -13% -   21%) 0.413<br>
        AndHighHighDayTaxoFacets       19.22      (1.6%)       22.13      (2.2%)   \
                15.1% (  11% -   19%) 0.000<br>
         AndHighMedDayTaxoFacets       25.62      (1.5%)       29.93      (2.2%)   \
                16.8% (  12% -   20%) 0.000<br>
            MedTermDayTaxoFacets       12.96      (2.2%)       18.99      (3.4%)   \
                46.5% (  39% -   53%) 0.000<br>
          OrHighMedDayTaxoFacets        3.97      (2.0%)        5.81      (4.3%)   \
                46.5% (  39% -   53%) 0.000<br>
           BrowseMonthTaxoFacets        2.59     (10.9%)       11.16     (35.8%)  \
                330.4% ( 255% -  423%) 0.000<br>
            BrowseDateTaxoFacets        2.44      (9.7%)       13.12     (51.8%)  \
                438.1% ( 343% -  553%) 0.000<br>
       BrowseDayOfYearTaxoFacets        2.44      (9.7%)       13.13     (51.7%)  \
438.2% ( 343% -  552%) 0.000<br> {noformat}<br>
<br>
<br>
&gt; Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for \
faceting<br> &gt; --------------------------------------------------------------------------------<br>
 &gt;<br>
&gt;                 Key: LUCENE-10062<br>
&gt;                 URL: <a \
href="https://issues.apache.org/jira/browse/LUCENE-10062" rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-10062</a><br> &gt;       \
Project: Lucene - Core<br> &gt;          Issue Type: Improvement<br>
&gt;          Components: modules/facet<br>
&gt;            Reporter: Greg Miller<br>
&gt;            Assignee: Greg Miller<br>
&gt;            Priority: Minor<br>
&gt;          Time Spent: 1h 40m<br>
&gt;  Remaining Estimate: 0h<br>
&gt;<br>
&gt; We currently encode taxonomy ordinals using varint style packing in a binary doc \
values field. I suspect there have been a number of improvements to \
SortedNumericDocValues since taxonomy faceting was first introduced, and I plan to \
explore replacing the custom binary format we have today with a SORTED_NUMERIC type \
dv field instead.<br> &gt; I&#39;ll report benchmark results and index size impact \
here.<br> <br>
<br>
<br>
--<br>
This message was sent by Atlassian Jira<br>
(v8.3.4#803005)<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:issues-unsubscribe@lucene.apache.org" \
target="_blank" rel="noreferrer">issues-unsubscribe@lucene.apache.org</a><br> For \
additional commands, e-mail: <a href="mailto:issues-help@lucene.apache.org" \
target="_blank" rel="noreferrer">issues-help@lucene.apache.org</a><br> <br>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic