'Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wikitech-l
Subject:    Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)
From:       Adam Baso <abaso () wikimedia ! org>
Date:       2020-11-18 14:45:54
Message-ID: CAB74=NpR0yMJa35VLywsWP+2GsFy0eQO+-5C31pUOt4m6Xw3Zg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Dan Andreescu <dandreescu@wikimedia.org> wrote:

> Maybe something exists already in Hadoop
>>
>
> The page properties table is already loaded into Hadoop on a monthly basis
> (wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
> also has JSON-parsing goodies, so give it a shot and let me know if you get
> stuck.  In general, data from the databases can be sqooped into Hadoop.  We
> do this for large pipelines like edit history
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading> and
> it's very easy
> <https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505>
> to add a table.  We're looking at just replicating the whole db on a more
> frequent basis, but we have to do some groundwork first to allow
> incremental updates (see Apache Iceberg if you're interested).
>
>
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).

[Attachment #5 (text/html)]

<div dir="ltr"><div dir="ltr">Dan Andreescu &lt;<a \
href="mailto:dandreescu@wikimedia.org">dandreescu@wikimedia.org</a>&gt; \
wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div \
class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div \
dir="auto">Maybe something exists already in \
Hadoop</div></div></blockquote><div><br></div><div>The page properties  table is \
already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props).   I \
haven&#39;t played with it much, but Hive also has JSON-parsing goodies, so give it a \
shot and let me know if you get stuck.   In general, data from the databases can be \
sqooped  into Hadoop.   We do this for large pipelines like <a \
href="https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading" \
target="_blank">edit history</a>  and it&#39;s <a \
href="https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505" \
target="_blank">very easy</a> to add a table.   We&#39;re looking at just replicating \
the whole db on a more frequent basis, but we have to do some groundwork first to \
allow incremental updates (see Apache Iceberg if you&#39;re \
interested).</div></div></div><br></blockquote><div><br></div><div>Yes, I like that \
and all of the other wmf_raw goodies! I&#39;ll follow up off thread on accessing the \
parser cache DBs (they&#39;re in site.pp and db-eqiad.php, but I don&#39;t think \
those are presently represented by refinery.util as they&#39;re not in .dblist \
files).</div><div><br></div><div><br></div><div>  </div></div></div>

[Attachment #6 (text/plain)]

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[prev in list] [next in list] [prev in thread] [next in thread]