[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-user
Subject:    unlimited wildcard term expansion
From:       John Z <zjavier_1 () yahoo ! com>
Date:       2004-06-30 17:50:43
Message-ID: 20040630175043.76285.qmail () web51804 ! mail ! yahoo ! com
[Download RAW message or body]


Hi,
 
I am trying to find a way to handle the wildcard queries in Lucene without going out \
of memory and have been having some problems with it.    
I have modified some parts in search part of Lucene to just keep only about 1000 \
terms in memory and write the rest of the terms to a file (this is done in the \
getQuery() method of MultiTermQuery.java, PrefixQuery.java, etc.).    
Then when we create scorer objects and collect scores for each clause in the score() \
method of the BooleanScorer.java, after all the clauses (that are in memory) are \
processed, then I continue reading from the file that I created earlier.  I read out \
each term from the file and create a TermQuery, then get the scorer object from this \
TermQuery and collect the score for it.  
Then the bucketTable will do collectHits of everything.
 
I have tested out my changes with small indexes with about 2 terms in memory and \
about 2 or 3 terms in the file, and it worked fine.  
However, when I tried this out with bigger indexes (> 1 million docs) and with 1000 \
in memory and 972 in the file, I got into an infinite loop when doing \
bucketTable.collectHits().  I printed out the doc in each bucket and noticed that \
about half way through the bucket list, it started to have about 4 - 5 repeated docs \
in the rest of the list and there was no null at the end of the list to end it.  
I have looked at everywhere and even tried to increase the bucket table size to be \
the sum of the number of terms in memory and number of terms in the file.  But that \
still did not work.  
I would really appreciate any suggestions/ideas/help on this.
 
Thanks.
Javier

		
---------------------------------
Do you Yahoo!?
Read only the mail you want - Yahoo! Mail SpamGuard.



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic