[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wikitech-l
Subject:    Re: [Wikitech-l] Some questions about Cirrussearch/Elasticsearch
From:       Erik Bernhardson <ebernhardson () wikimedia ! org>
Date:       2015-10-29 21:55:06
Message-ID: CAK4LxJfj++U7kVAnZE8HEzoEjCM8tZSebScpLrKtMiToCnxoqw () mail ! gmail ! com
[Download RAW message or body]

On Thu, Oct 29, 2015 at 2:22 PM, Strainu <strainu10@gmail.com> wrote:

> Thanks for the response Erik, it's been very informative. I have a few
> follow up questions (inline)
>
> On 29 octombrie 2015 17:56:25 EET, Erik Bernhardson <
> ebernhardson@wikimedia.org> wrote:
> >On Thu, Oct 29, 2015 at 8:47 AM, Strainu <strainu10@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I've been reading the mw.org and wikitech pages on Cirrussearch (and
> >> the code) in the hope that I will be able to understand how is the
> >> page content transformed before being sent to ES and how is it kept
> >in
> >> ES and I have a few questions:
> >>
> >> 1. Is the documentation available anywhere? I don't see it on
> >> https://doc.wikimedia.org/
> >>
> >>
> >Feature documentation is at
> >https://www.mediawiki.org/wiki/Help:CirrusSearch,
> >operational documentation is at
> >https://wikitech.wikimedia.org/wiki/Search
>
> I was referring to the code docs,  they make it easier to follow the class
> hierarchy.
>

There is very minimal documentation of the code, outside of the code
itself. The best you will find, which only cover a small portion of the
code, are the parts i wrote up in the Indexing
<https://wikitech.wikimedia.org/wiki/Search#Indexing> and Job queue
<https://wikitech.wikimedia.org/wiki/Search#Job_queue> portions of the
operational documentation. Feel free to stop by the #wikimedia-discovery
channel on freenode and ask questions, some of the developers on the team
might be able to point you in the right direction.

>
> >
> >> 2. What part of the whole ecosystem transforms the wikitext into
> >> indexable text? Where can I find it? It should be somewhere
> >downstream
> >> fromCirrusSearch\Updater::updateFromTitle(), but I can't figure uout
> >> where exactly.
> >>
> >>
> >The documents are built using the classes in
> >
> https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/tree/master/includes/BuildDocument
>
> I see you use already parsed text. I'm wondering if using the output of
> mwparserfromhell would work - I have some wikitext that is not in a mw
> database that I would like to index. I'm guessing I'll have to write some
> code,  but the idea would be the same.
>
>
Correct, we use the output of the php wikitext parser for the initial
portion of the transformation. The easiest way to integrate with
cirrussearch will be to reuse the mediawiki parser. I've never played with
mwparserfromhell but as with most software, with some effort you can tie
almost anything together :)


> >
> >
> >> If this transformation doesn't happen, from where is the searchable
> >> text obtained?
> >>
> >> 3. Where can I find the ES schema used for wikipages? Is it different
> >> for images/categories?
> >>
> >>
> >ES schema is the same everywhere, the easiest way to see what the data
> >looks like is just request a dump for a particular page. This will
> >output
> >json, i use a chrome extension called JsonView to make this look nice:
> >https://wikitech.wikimedia.org/wiki/Search?action=cirrusdump
>
> That is very cool indeed.
>
> Thanks again,
>  Strainu
> >
> >
> >> Thanks,
> >>    Strainu
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >_______________________________________________
> >Wikitech-l mailing list
> >Wikitech-l@lists.wikimedia.org
> >https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic