'Re: POS Tagger'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: POS Tagger
From:       Nicolas Paris <nicolas.paris () riseup ! net>
Date:       2019-10-25 18:43:21
Message-ID: 20191025184321.3iijc3wh2nombg2s () riseup ! net
[Download RAW message or body]

Also the openNlp solr POS tagger [1] uses the typeAsSynonymFilter to
store the POS: 

" Index the POS for each token as a synonym, after prefixing the POS with @ "

Not sure how to deal with POS after such indexing, but this looks
interesting approach ?

[1] http://lucene.apache.org/solr/guide/7_3/language-analysis.html#opennlp-part-of-speech-filter
 On Fri, Oct 25, 2019 at 06:25:36PM +0200, Nicolas Paris wrote:
> > Do you use the POS tagger at query time, or just at index time? 
> 
> I have the POS tagger pipeline ready but nothing done yet on the solr
> part. Right now I am wondering how to use it but still looking for
> relevant implementation.
> 
> I guess having the POS information ready before indexation gives the
> flexibility to test multiple scenario.
> 
> In case of acronyms, one possible way is indeed to consider the user
> query as NOUNS, and from the index side, only keep the acronyms that
> are tagged with NOUNS. (i.e. detect acronyms within text, and look for
> it's POS; remove it in case it's not a NOUN)
> 
> Definitely, I prefer the pre-processing approach for this, than creating
> dedicated solr analysers because my context is batch processing, and
> also this simplifies testing and debugging - while offering large panel
> of NLP tools to deal with.
> 
> On Fri, Oct 25, 2019 at 04:09:29PM +0000, Audrey Lorberfeld - \
> Audrey.Lorberfeld@ibm.com wrote:
> > Nicolas,
> > 
> > Do you use the POS tagger at query time, or just at index time? 
> > 
> > We are thinking of using it to filter the tokens we will eventually perform ML \
> > on. Basically, we have a bunch of acronyms in our corpus. However, many \
> > departments use the same acronyms but expand those acronyms to different things. \
> > Eventually, we are thinking of using ML on our index to determine which expansion \
> > is meant by a particular query according to the context we find in certain \
> > documents. However, since we don't want to run ML on all tokens in a query, and \
> > since we think that acronyms are usually the nouns in a multi-token query, we \
> > want to only feed nouns to the ML model (TBD). 
> > Does that make sense? So, we'd want both an index-side POS tagger (could be \
> > slow), and also a query-side POS tagger (must be fast). 
> > -- 
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > Audrey.Lorberfeld@IBM.com
> > 
> > 
> > On 10/25/19, 11:57 AM, "Nicolas Paris" <nicolas.paris@riseup.net> wrote:
> > 
> > Also we are using stanford POS tagger for french. The processing time is
> > mitigated by the spark-corenlp package which distribute the process over
> > multiple node.
> > 
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields. 
> > 
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my purposes,
> > > I use it to pre-process data before indexing, to apply more metadata to the
> > > documents in a separate field(s)
> > > 
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
> > > 
> > > > No, I meant for part-of-speech tagging __ But that's interesting that you
> > > > use StanfordNLP. I've read that it's very slow, so we are concerned that it
> > > > might not work for us at query-time. Do you use it at query-time, or just
> > > > index-time?
> > > > 
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > Audrey.Lorberfeld@IBM.com
> > > > 
> > > > 
> > > > On 10/25/19, 10:30 AM, "David Hastings" <hastings.recursive@gmail.com>
> > > > wrote:
> > > > 
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get out the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > > 
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
> > > > 
> > > > > Hi All,
> > > > > 
> > > > > Does anyone use a POS tagger with their Solr instance other than
> > > > > OpenNLP's? We are considering OpenNLP, SpaCy, and Watson.
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > Audrey.Lorberfeld@IBM.com
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > 
> > -- 
> > nicolas
> > 
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic