[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: How might one search for dupe IDs other than faceting on the ID field?
From:       "Jack Krupansky" <jack () basetechnology ! com>
Date:       2013-07-30 21:48:35
Message-ID: 8C32E29B09384D8B8F9436B931C756B4 () JackKrupansky
[Download RAW message or body]

You could also try the terms component which provides a very efficient 
facet-like feature - counting the terms. And you can set a minimum term 
frequency of 2, so only the dups would come back:

curl "http://localhost:8983/solr/terms?terms.fl=id&terms.mincount=2"

-- Jack Krupansky

-----Original Message----- 
From: Jack Krupansky
Sent: Tuesday, July 30, 2013 4:14 PM
To: solr-user@lucene.apache.org
Subject: Re: How might one search for dupe IDs other than faceting on the ID 
field?

The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe...
any particular reason you did not use it?

See:
http://wiki.apache.org/solr/Deduplication

and

https://cwiki.apache.org/confluence/display/solr/De-Duplication

And I give a bunch of examples in my book.

-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Tuesday, July 30, 2013 2:16 PM
To: solr-user@lucene.apache.org
Subject: How might one search for dupe IDs other than faceting on the ID
field?

To search for duplicate IDs, I am running the following query:
select?q=*:*&facet=true&facet.field=id&rows=0

However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving
OutOfMemoryError errors instead of the desired facet:

<response><lst name="error"><str
name="msg">java.lang.OutOfMemoryError: Java heap space</str><str
name="trace">java.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
    at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at ...

Might there be a less resource-intensive way to get this information.
This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index
has over 100,000,000 small records, for a total of about 95 GiB of
disk space, with Solr running on it's own disk. Actually, the 'disk'
is an Amazon Web Service EBS volume.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic