'Re: Empty rows from /export?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: Empty rows from /export?
From:       Erick Erickson <erickerickson () gmail ! com>
Date:       2019-05-31 22:08:38
Message-ID: D620914B-5826-4D56-80D9-3C8BC9D9FB51 () gmail ! com
[Download RAW message or body]

docValues are indeed, realized in Lucene. It's just that Lucene has no notion of \
"schema". So when you define the schema, Solr carefully constructs the appropriate \
low-level Lucene calls to take care of all of the options you've specified in the \
schema, things like stored, indexed, docValues etc. when a doc is indexed.

Now we get to optimize. All Solr does is tell Lucene to mash together all the \
segments and Lucene does its tricks. Lucene assumes it "knows" everything it needs to \
know by what's already in the segments it's merging without reference to Solr's \
schema. Therein lies the rub. If one segment has docValues for a field and another \
segment doesn't, the result is "interesting". In general, Lucene can't reconstruct \
the original data.

From Robert Muir:
"I think the key issue here is Lucene is an index not a database. Because it is a \
lossy index and does not retain all of the user's data, its not possible to safely \
migrate some things automagically. In the norms case IndexWriter needs to re-analyze \
the text ("re-index") and compute stats to get back the value, so it can be \
re-encoded. The function is y = f(x) and if x is not available its not possible, so \
lucene can't do it."

DocValues is a special case because all the data necessary to all docValues is \
already in the index, i.e. the indexed data (assuming you originally put it in with \
indexed=true). But it requires extra effort, thus the \
UninvertDocValuesMergePolicyFactory.

> > I was curious if it
> > was safe to change the id field to docValues without reindexing

I'd be very reluctant.  It's not something that's explicitly tested or supported so \
there'e likely edge cases.

Best,
Erick

> On May 31, 2019, at 2:02 PM, David Hastings <hastings.recursive@gmail.com> wrote:
> 
> > Ah. So docValues are managed by Solr outside of Lucene. Interesting.
> 
> i was under the impression docValues are in lucene, and he is just saying
> that an optimize is not a re-index, its just taking the actual files that
> already exist in your index and arranging them and removing deletions, an
> optimize doesnt re-read the schema and re-index content
> 
> On Fri, May 31, 2019 at 1:59 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
> 
> > Ah. So docValues are managed by Solr outside of Lucene. Interesting.
> > 
> > That actually answers a question I had not asked yet. I was curious if it
> > was safe to change the id field to docValues without reindexing if we never
> > sorted on it. It looks like fetching the value won't work until everything
> > is reindexed.
> > 
> > It seems like this would be a useful thing to have supported, migrating a
> > field to docValues.
> > 
> > wunder
> > Walter Underwood
> > wunder@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> > 
> > > On May 31, 2019, at 5:00 AM, Erick Erickson <erickerickson@gmail.com>
> > wrote:
> > > 
> > > bq. but I optimized all the cores, which should rewrite every segment as
> > docValues.
> > > 
> > > Not true. Optimize is a Lucene level force merge. Dealing with segments,
> > i.e. merging and the like, is a low-level Lucene operation and Lucene has
> > no notion of a schema. So a change you made to the schema is irrelevant to
> > merging.
> > > 
> > > You have to have something at the Solr level that does some magic for
> > this to work. Take a look at UninvertDocValuesMergePolicyFactory if you
> > have Solr 7.0 or later. WARNING: I haven't used that personally, and I do
> > not know what the behavior would be on an index that is "mixed", i.e. one
> > that already has segments with some docs having DV entries and some not.
> > > 
> > > Best,
> > > Erick
> > > 
> > > > On May 31, 2019, at 12:35 AM, Walter Underwood <wunder@wunderwood.org>
> > wrote:
> > > > 
> > > > That field was changed to docValues, but I optimized all the cores,
> > which should rewrite every segment as docValues.
> > > > 
> > > > wunder
> > > > Walter Underwood
> > > > wunder@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > > 
> > > > > On May 30, 2019, at 7:37 PM, Erick Erickson <erickerickson@gmail.com>
> > wrote:
> > > > > 
> > > > > This is odd. The only reason I know of that would happen is if there
> > were no docValues for that field in those documents. By any chance were
> > docValues added to an existing index without totally reindexing into a new
> > collection?
> > > > > 
> > > > > What happens if you just query the collection rather than the
> > individual core? I'm thinking using a streaming expression as a check…..
> > > > > 
> > > > > > On May 30, 2019, at 6:41 PM, Walter Underwood <wunder@wunderwood.org>
> > wrote:
> > > > > > 
> > > > > > 3/4 of the documents I'm getting back from /export are empty. This
> > collection has four shards, so I'm querying the leader core on each shard
> > with /export. The results start like this:
> > > > > > 
> > > > > > 
> > {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
> > 
> > > > > > 
> > > > > > The final 1/4 of the results have UUIDs (the ID type). The id field
> > is stored as docValues. This is the URL.
> > > > > > 
> > > > > > 
> > http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
> > 
> > > > > > 
> > > > > > Running 6.6.2, Solr Cloud. The total number of non-null ids from all
> > four shards is a bit less than 1/4 of the document count.
> > > > > > 
> > > > > > Any ideas about what is going on?
> > > > > > 
> > > > > > wunder
> > > > > > Walter Underwood
> > > > > > wunder@wunderwood.org
> > > > > > http://observer.wunderwood.org/  (my blog)
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> > 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic