[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    [jira] [Commented] (LUCENE-8374) Reduce reads for sparse DocValues
From:       "Adrien Grand (JIRA)" <jira () apache ! org>
Date:       2018-06-29 16:49:00
Message-ID: JIRA.13168646.1530110276000.42577.1530290940073 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/LUCENE-8374?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527935#comment-16527935 ] 

Adrien Grand commented on LUCENE-8374:
--------------------------------------

bq. No temporarily switching back to a known stable sub-version of Solr if a new \
release within the same major version turns out to have severe problems.

This is something that is not guaranteed today, and we actually broke this on many \
minor releases. 

> Reduce reads for sparse DocValues
> ---------------------------------
> 
> Key: LUCENE-8374
> URL: https://issues.apache.org/jira/browse/LUCENE-8374
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: master (8.0), 7.3.1
> Reporter: Toke Eskildsen
> Priority: Major
> 
> The {{Lucene70DocValuesProducer}} has the internal classes \
> {{SparseNumericDocValues}} and {{BaseSortedSetDocValues}} (sparse code path), which \
> again uses {{IndexedDISI}} to handle the docID -> value-ordinal lookup. The \
> value-ordinal is the index of the docID assuming an abstract tightly packed \
> monotonically increasing list of docIDs: If the docIDs with corresponding values \
> are {{[0, 4, 1432]}}, their value-ordinals will be {{[0, 1, 2]}}. h2. Outer blocks
> The lookup structure of {{IndexedDISI}} consists of blocks of 2^16 values (65536), \
> where each block can be either {{ALL}}, {{DENSE}} (2^12 to 2^16 values) or \
> {{SPARSE}} (< 2^12 values ~= 6%). Consequently blocks vary quite a lot in size and \
> ordinal resolving strategy. When a sparse Numeric DocValue is needed, the code \
> first locates the block containing the wanted docID flag. It does so by iterating \
> blocks one-by-one until it reaches the needed one, where each iteration requires a \
> lookup in the underlying {{IndexSlice}}. For a common memory mapped index, this \
> translates to either a cached request or a read operation. If a segment has 6M \
> documents, worst-case is 91 lookups. In our web archive, our segments has ~300M \
> values: A worst-case of 4577 lookups! One obvious solution is to use a lookup-table \
> for blocks: A long[]-array with an entry for each block. For 6M documents, that is \
> < 1KB and would allow for direct jumping (a single lookup) in all instances. \
> Unfortunately this lookup-table cannot be generated upfront when the writing of \
> values is purely streaming. It can be appended to the end of the stream before it \
> is closed, but without knowing the position of the lookup-table the reader cannot \
> seek to it. One strategy for creating such a lookup-table would be to generate it \
> during reads and cache it for next lookup. This does not fit directly into how \
> {{IndexedDISI}} currently works (it is created anew for each invocation), but could \
> probably be added with a little work. An advantage to this is that this does not \
> change the underlying format and thus could be used with existing indexes. h2. The \
> lookup structure inside each block If {{ALL}} of the 2^16 values are defined, the \
> structure is empty and the ordinal is simply the requested docID with some modulo \
> and multiply math. Nothing to improve there. If the block is {{DENSE}} (2^12 to \
> 2^16 values are defined), a bitmap is used and the number of set bits up to the \
> wanted index (the docID modulo the block origo) are counted. That bitmap is a \
> long[1024], meaning that worst case is to lookup and count all set bits for 1024 \
> longs! One known solution to this is to use a [rank \
> structure|[https://en.wikipedia.org/wiki/Succinct_data_structure]]. I [implemented \
> it|[https://github.com/tokee/lucene-solr/blob/solr5894/solr/core/src/java/org/apache/solr/search/sparse/count/plane/RankCache.java]] \
> for a related project and with that (), the rank-overhead for a {{DENSE}} block \
> would be long[32] and would ensure a maximum of 9 lookups. It is not trivial to \
> build the rank-structure and caching it (assuming all blocks are dense) for 6M \
> documents would require 22 KB (3.17% overhead). It would be far better to generate \
> the rank-structure at index time and store it immediately before the bitset (this \
> is possible with streaming as each block is fully resolved before flushing), but of \
> course that would require a change to the codec. If {{SPARSE}} (< 2^12 values ~= \
> 6%) are defined, the docIDs are simply in the form of a list. As a comment in the \
> code suggests, a binary search through these would be faster, although that would \
> mean seeking backwards. If that is not acceptable, I don't have any immediate idea \
> for avoiding the full iteration. I propose implementing query-time caching of both \
> block-jumps and inner-block lookups for {{DENSE}} (using rank) as first improvement \
> and an index-time {{DENSE}}-rank structure for future improvement. As query-time \
> caching is likely to be too costly for rapidly-changing indexes, it should probably \
> be an opt-in in solrconfig.xml. h2. Some real-world observations
> This analysis was triggered by massive (10x) slowdown problems with both simple \
> querying and large exports from our webarchive index after upgrading from Solr 4.10 \
> to 7.3.1. The query-matching itself takes  ½-2 seconds, but returning the top-10 \
> documents takes 5-20 seconds (~50 non-stored DocValues fields), up from  ½-2 \
> seconds in total from Solr 4.10 (more of a mix of stored vs. DocValues, so might \
> not be directly comparable). Measuring with VisualVM points to \
> {{NIOFSIndexInput.readInternal}} as *the* hotspot.   We ran some tests with simple \
> queries on a single 307,171,504 document segment with different single-value \
> DocValued fields in the fl and got 
> > > Field||Type||Docs with value||Docs w/ val %||Speed in docs/sec||
> > url|String|307,171,504|100%|12,500|
> > content_type_ext|String|224,375,378|73%|360|
> > author|String|1,506,365|0.5%|1,100|
> > crawl_date|DatePoint|307,171,498|~100%|90|
> > content_text_length|IntPoint|285,800,212|93%|410|
> > content_length|IntPoint|307,016,816|99.9%|100|
> > crawl_year|IntPoint|307,171,498|~100%|14,500|
> > last_modified|DatePoint|6,835,065|2.2%|570|
> > source_file_offset|LongPoint|307,171,504|100%|28,000|
> Note how both url and source_file_offset are very fast and also has a value for \
> _all_ documents. Contrary to this, content_type_ext is very slow and crawl_date is \
> extremely slow and as they both have _nearly_ all documents, I presume they are \
> using {{IndexedDISI#DENSE}}. last_modified is also quite slow and presumably uses \
> {{IndexedDISI#SPARSE}}. The only mystery is crawl_year which is also present in \
> _nearly_ all documents, but is very fast. I have no explanation for that one yet. I \
> hope to take a stab at this around August 2018, but no promises.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic