[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-dev
Subject: Re: [jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ord
From: Michael Sokolov <msokolov () gmail ! com>
Date: 2021-09-24 10:58:24
Message-ID: CAGUSZHB2Sj0DRbQc0+U+REgeac-1Cm2FquFawK=my3aB_VyQrQ () mail ! gmail ! com
[Download RAW message or body]
Hard to read on the phone, but is that a 482% speed up I saw??!
On Thu, Sep 23, 2021, 1:28 PM Greg Miller (Jira) <jira@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349
> ]
>
> Greg Miller commented on LUCENE-10062:
> --------------------------------------
>
> I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand]
> added new faceting tasks (thanks Mike!). Looks like there's a nice
> improvement on these new faceting tasks as well with this change (and no
> regressions anywhere else that I see).
>
> I was waiting to iterate on my PR until I was able to run these new
> benchmarking tasks, but it seems like there's enough benefit to this change
> to pick it back up.
>
>
> {noformat}
> TaskQPS baseline StdDevQPS candidate
> StdDev Pct diff p-value
> HighTermDayOfYearSort 70.02 (13.7%) 68.45
> (9.7%) -2.2% ( -22% - 24%) 0.551
> MedTerm 1300.90 (5.5%) 1275.97
> (6.7%) -1.9% ( -13% - 10%) 0.324
> HighTerm 1953.46 (5.8%) 1925.79
> (7.9%) -1.4% ( -14% - 13%) 0.518
> HighTermTitleBDVSort 122.35 (15.6%) 120.86
> (14.9%) -1.2% ( -27% - 34%) 0.801
> TermDTSort 133.47 (8.7%) 131.86
> (7.4%) -1.2% ( -15% - 16%) 0.637
> LowTerm 1636.13 (5.5%) 1622.34
> (7.4%) -0.8% ( -12% - 12%) 0.682
> Prefix3 25.69 (6.0%) 25.48
> (6.3%) -0.8% ( -12% - 12%) 0.676
> LowSpanNear 118.02 (2.1%) 117.31
> (1.8%) -0.6% ( -4% - 3%) 0.326
> HighTermMonthSort 140.17 (9.8%) 139.47
> (9.9%) -0.5% ( -18% - 21%) 0.872
> AndHighHigh 49.17 (3.1%) 48.92
> (2.7%) -0.5% ( -6% - 5%) 0.584
> HighSpanNear 25.54 (2.7%) 25.41
> (2.2%) -0.5% ( -5% - 4%) 0.529
> AndHighLow 556.68 (5.8%) 554.80
> (5.4%) -0.3% ( -10% - 11%) 0.848
> BrowseDayOfYearSSDVFacets 16.53 (2.5%) 16.47
> (2.4%) -0.3% ( -5% - 4%) 0.674
> IntNRQ 87.76 (2.0%) 87.49
> (2.1%) -0.3% ( -4% - 3%) 0.634
> MedSpanNear 31.11 (2.2%) 31.04
> (1.6%) -0.2% ( -3% - 3%) 0.714
> OrNotHighLow 765.10 (4.5%) 763.60
> (5.4%) -0.2% ( -9% - 10%) 0.901
> MedPhrase 160.05 (3.1%) 159.83
> (2.9%) -0.1% ( -5% - 6%) 0.885
> HighSloppyPhrase 27.67 (3.1%) 27.64
> (3.0%) -0.1% ( -6% - 6%) 0.915
> LowPhrase 61.12 (3.2%) 61.05
> (3.2%) -0.1% ( -6% - 6%) 0.921
> OrHighMed 71.85 (2.9%) 71.82
> (2.1%) -0.0% ( -4% - 5%) 0.963
> HighPhrase 29.40 (2.3%) 29.39
> (2.8%) -0.0% ( -5% - 5%) 0.971
> Fuzzy2 32.58 (4.3%) 32.57
> (6.1%) -0.0% ( -9% - 10%) 0.992
> LowIntervalsOrdered 150.30 (1.9%) 150.28
> (1.9%) -0.0% ( -3% - 3%) 0.986
> AndHighMed 151.32 (3.9%) 151.31
> (4.1%) -0.0% ( -7% - 8%) 0.993
> OrHighHigh 23.90 (2.3%) 23.91
> (1.9%) 0.0% ( -4% - 4%) 0.970
> OrHighNotLow 579.17 (5.1%) 579.35
> (6.4%) 0.0% ( -10% - 12%) 0.986
> MedIntervalsOrdered 86.93 (1.7%) 86.98
> (1.9%) 0.1% ( -3% - 3%) 0.913
> OrHighNotHigh 536.17 (5.6%) 536.57
> (6.6%) 0.1% ( -11% - 12%) 0.969
> OrNotHighHigh 787.07 (6.5%) 787.96
> (8.1%) 0.1% ( -13% - 15%) 0.961
> OrNotHighMed 687.97 (4.7%) 688.77
> (6.9%) 0.1% ( -10% - 12%) 0.950
> MedSloppyPhrase 68.62 (2.8%) 68.74
> (2.7%) 0.2% ( -5% - 5%) 0.838
> LowSloppyPhrase 130.37 (2.6%) 130.62
> (2.2%) 0.2% ( -4% - 5%) 0.797
> OrHighLow 440.44 (4.1%) 441.33
> (4.1%) 0.2% ( -7% - 8%) 0.877
> Wildcard 122.01 (5.2%) 122.35
> (5.3%) 0.3% ( -9% - 11%) 0.867
> HighIntervalsOrdered 14.24 (2.2%) 14.34
> (2.1%) 0.6% ( -3% - 5%) 0.350
> Respell 52.04 (2.2%) 52.48
> (2.0%) 0.8% ( -3% - 5%) 0.209
> OrHighNotMed 674.76 (4.8%) 680.97
> (8.0%) 0.9% ( -11% - 14%) 0.659
> PKLookup 153.45 (4.3%) 155.13
> (3.8%) 1.1% ( -6% - 9%) 0.394
> Fuzzy1 56.57 (9.1%) 57.76
> (6.7%) 2.1% ( -12% - 19%) 0.406
> BrowseMonthSSDVFacets 19.59 (10.4%) 20.03
> (6.7%) 2.3% ( -13% - 21%) 0.413
> AndHighHighDayTaxoFacets 19.22 (1.6%) 22.13
> (2.2%) 15.1% ( 11% - 19%) 0.000
> AndHighMedDayTaxoFacets 25.62 (1.5%) 29.93
> (2.2%) 16.8% ( 12% - 20%) 0.000
> MedTermDayTaxoFacets 12.96 (2.2%) 18.99
> (3.4%) 46.5% ( 39% - 53%) 0.000
> OrHighMedDayTaxoFacets 3.97 (2.0%) 5.81
> (4.3%) 46.5% ( 39% - 53%) 0.000
> BrowseMonthTaxoFacets 2.59 (10.9%) 11.16
> (35.8%) 330.4% ( 255% - 423%) 0.000
> BrowseDateTaxoFacets 2.44 (9.7%) 13.12
> (51.8%) 438.1% ( 343% - 553%) 0.000
> BrowseDayOfYearTaxoFacets 2.44 (9.7%) 13.13
> (51.7%) 438.2% ( 343% - 552%) 0.000
> {noformat}
>
>
> > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for
> faceting
> >
> --------------------------------------------------------------------------------
> >
> > Key: LUCENE-10062
> > URL: https://issues.apache.org/jira/browse/LUCENE-10062
> > Project: Lucene - Core
> > Issue Type: Improvement
> > Components: modules/facet
> > Reporter: Greg Miller
> > Assignee: Greg Miller
> > Priority: Minor
> > Time Spent: 1h 40m
> > Remaining Estimate: 0h
> >
> > We currently encode taxonomy ordinals using varint style packing in a
> binary doc values field. I suspect there have been a number of improvements
> to SortedNumericDocValues since taxonomy faceting was first introduced, and
> I plan to explore replacing the custom binary format we have today with a
> SORTED_NUMERIC type dv field instead.
> > I'll report benchmark results and index size impact here.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>
>
[Attachment #3 (text/html)]
<div dir="auto">Hard to read on the phone, but is that a 482% speed up I saw??! \
</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Sep 23, \
2021, 1:28 PM Greg Miller (Jira) <<a \
href="mailto:jira@apache.org">jira@apache.org</a>> wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><br> [ <a \
href="https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugi \
n.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349" \
rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian. \
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419349#comment-17419349</a> \
] <br> <br>
Greg Miller commented on LUCENE-10062:<br>
--------------------------------------<br>
<br>
I re-ran {{luceneutil}} benchmarks {{wikimedium10m}} since [~mikemccand] added new \
faceting tasks (thanks Mike!). Looks like there's a nice improvement on these new \
faceting tasks as well with this change (and no regressions anywhere else that I \
see).<br> <br>
I was waiting to iterate on my PR until I was able to run these new benchmarking \
tasks, but it seems like there's enough benefit to this change to pick it back \
up.<br> <br>
<br>
{noformat}<br>
TaskQPS baseline StdDevQPS candidate StdDev \
Pct diff p-value<br>
HighTermDayOfYearSort 70.02 (13.7%) 68.45 (9.7%) \
-2.2% ( -22% - 24%) 0.551<br>
MedTerm 1300.90 (5.5%) 1275.97 (6.7%) \
-1.9% ( -13% - 10%) 0.324<br>
HighTerm 1953.46 (5.8%) 1925.79 (7.9%) \
-1.4% ( -14% - 13%) 0.518<br>
HighTermTitleBDVSort 122.35 (15.6%) 120.86 (14.9%) \
-1.2% ( -27% - 34%) 0.801<br>
TermDTSort 133.47 (8.7%) 131.86 (7.4%) \
-1.2% ( -15% - 16%) 0.637<br>
LowTerm 1636.13 (5.5%) 1622.34 (7.4%) \
-0.8% ( -12% - 12%) 0.682<br>
Prefix3 25.69 (6.0%) 25.48 (6.3%) \
-0.8% ( -12% - 12%) 0.676<br>
LowSpanNear 118.02 (2.1%) 117.31 (1.8%) \
-0.6% ( -4% - 3%) 0.326<br>
HighTermMonthSort 140.17 (9.8%) 139.47 (9.9%) \
-0.5% ( -18% - 21%) 0.872<br>
AndHighHigh 49.17 (3.1%) 48.92 (2.7%) \
-0.5% ( -6% - 5%) 0.584<br>
HighSpanNear 25.54 (2.7%) 25.41 (2.2%) \
-0.5% ( -5% - 4%) 0.529<br>
AndHighLow 556.68 (5.8%) 554.80 (5.4%) \
-0.3% ( -10% - 11%) 0.848<br>
BrowseDayOfYearSSDVFacets 16.53 (2.5%) 16.47 (2.4%) \
-0.3% ( -5% - 4%) 0.674<br>
IntNRQ 87.76 (2.0%) 87.49 (2.1%) \
-0.3% ( -4% - 3%) 0.634<br>
MedSpanNear 31.11 (2.2%) 31.04 (1.6%) \
-0.2% ( -3% - 3%) 0.714<br>
OrNotHighLow 765.10 (4.5%) 763.60 (5.4%) \
-0.2% ( -9% - 10%) 0.901<br>
MedPhrase 160.05 (3.1%) 159.83 (2.9%) \
-0.1% ( -5% - 6%) 0.885<br>
HighSloppyPhrase 27.67 (3.1%) 27.64 (3.0%) \
-0.1% ( -6% - 6%) 0.915<br>
LowPhrase 61.12 (3.2%) 61.05 (3.2%) \
-0.1% ( -6% - 6%) 0.921<br>
OrHighMed 71.85 (2.9%) 71.82 (2.1%) \
-0.0% ( -4% - 5%) 0.963<br>
HighPhrase 29.40 (2.3%) 29.39 (2.8%) \
-0.0% ( -5% - 5%) 0.971<br>
Fuzzy2 32.58 (4.3%) 32.57 (6.1%) \
-0.0% ( -9% - 10%) 0.992<br>
LowIntervalsOrdered 150.30 (1.9%) 150.28 (1.9%) \
-0.0% ( -3% - 3%) 0.986<br>
AndHighMed 151.32 (3.9%) 151.31 (4.1%) \
-0.0% ( -7% - 8%) 0.993<br>
OrHighHigh 23.90 (2.3%) 23.91 (1.9%) \
0.0% ( -4% - 4%) 0.970<br>
OrHighNotLow 579.17 (5.1%) 579.35 (6.4%) \
0.0% ( -10% - 12%) 0.986<br>
MedIntervalsOrdered 86.93 (1.7%) 86.98 (1.9%) \
0.1% ( -3% - 3%) 0.913<br>
OrHighNotHigh 536.17 (5.6%) 536.57 (6.6%) \
0.1% ( -11% - 12%) 0.969<br>
OrNotHighHigh 787.07 (6.5%) 787.96 (8.1%) \
0.1% ( -13% - 15%) 0.961<br>
OrNotHighMed 687.97 (4.7%) 688.77 (6.9%) \
0.1% ( -10% - 12%) 0.950<br>
MedSloppyPhrase 68.62 (2.8%) 68.74 (2.7%) \
0.2% ( -5% - 5%) 0.838<br>
LowSloppyPhrase 130.37 (2.6%) 130.62 (2.2%) \
0.2% ( -4% - 5%) 0.797<br>
OrHighLow 440.44 (4.1%) 441.33 (4.1%) \
0.2% ( -7% - 8%) 0.877<br>
Wildcard 122.01 (5.2%) 122.35 (5.3%) \
0.3% ( -9% - 11%) 0.867<br>
HighIntervalsOrdered 14.24 (2.2%) 14.34 (2.1%) \
0.6% ( -3% - 5%) 0.350<br>
Respell 52.04 (2.2%) 52.48 (2.0%) \
0.8% ( -3% - 5%) 0.209<br>
OrHighNotMed 674.76 (4.8%) 680.97 (8.0%) \
0.9% ( -11% - 14%) 0.659<br>
PKLookup 153.45 (4.3%) 155.13 (3.8%) \
1.1% ( -6% - 9%) 0.394<br>
Fuzzy1 56.57 (9.1%) 57.76 (6.7%) \
2.1% ( -12% - 19%) 0.406<br>
BrowseMonthSSDVFacets 19.59 (10.4%) 20.03 (6.7%) \
2.3% ( -13% - 21%) 0.413<br>
AndHighHighDayTaxoFacets 19.22 (1.6%) 22.13 (2.2%) \
15.1% ( 11% - 19%) 0.000<br>
AndHighMedDayTaxoFacets 25.62 (1.5%) 29.93 (2.2%) \
16.8% ( 12% - 20%) 0.000<br>
MedTermDayTaxoFacets 12.96 (2.2%) 18.99 (3.4%) \
46.5% ( 39% - 53%) 0.000<br>
OrHighMedDayTaxoFacets 3.97 (2.0%) 5.81 (4.3%) \
46.5% ( 39% - 53%) 0.000<br>
BrowseMonthTaxoFacets 2.59 (10.9%) 11.16 (35.8%) \
330.4% ( 255% - 423%) 0.000<br>
BrowseDateTaxoFacets 2.44 (9.7%) 13.12 (51.8%) \
438.1% ( 343% - 553%) 0.000<br>
BrowseDayOfYearTaxoFacets 2.44 (9.7%) 13.13 (51.7%) \
438.2% ( 343% - 552%) 0.000<br> {noformat}<br>
<br>
<br>
> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for \
faceting<br> > --------------------------------------------------------------------------------<br>
><br>
> Key: LUCENE-10062<br>
> URL: <a \
href="https://issues.apache.org/jira/browse/LUCENE-10062" rel="noreferrer noreferrer" \
target="_blank">https://issues.apache.org/jira/browse/LUCENE-10062</a><br> > \
Project: Lucene - Core<br> > Issue Type: Improvement<br>
> Components: modules/facet<br>
> Reporter: Greg Miller<br>
> Assignee: Greg Miller<br>
> Priority: Minor<br>
> Time Spent: 1h 40m<br>
> Remaining Estimate: 0h<br>
><br>
> We currently encode taxonomy ordinals using varint style packing in a binary doc \
values field. I suspect there have been a number of improvements to \
SortedNumericDocValues since taxonomy faceting was first introduced, and I plan to \
explore replacing the custom binary format we have today with a SORTED_NUMERIC type \
dv field instead.<br> > I'll report benchmark results and index size impact \
here.<br> <br>
<br>
<br>
--<br>
This message was sent by Atlassian Jira<br>
(v8.3.4#803005)<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href="mailto:issues-unsubscribe@lucene.apache.org" \
target="_blank" rel="noreferrer">issues-unsubscribe@lucene.apache.org</a><br> For \
additional commands, e-mail: <a href="mailto:issues-help@lucene.apache.org" \
target="_blank" rel="noreferrer">issues-help@lucene.apache.org</a><br> <br>
</blockquote></div>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic