[prev in list] [next in list] [prev in thread] [next in thread]
List: wikitech-l
Subject: Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)
From: Adam Baso <abaso () wikimedia ! org>
Date: 2020-11-18 14:45:54
Message-ID: CAB74=NpR0yMJa35VLywsWP+2GsFy0eQO+-5C31pUOt4m6Xw3Zg () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Dan Andreescu <dandreescu@wikimedia.org> wrote:
> Maybe something exists already in Hadoop
>>
>
> The page properties table is already loaded into Hadoop on a monthly basis
> (wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive
> also has JSON-parsing goodies, so give it a shot and let me know if you get
> stuck. In general, data from the databases can be sqooped into Hadoop. We
> do this for large pipelines like edit history
> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading> and
> it's very easy
> <https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505>
> to add a table. We're looking at just replicating the whole db on a more
> frequent basis, but we have to do some groundwork first to allow
> incremental updates (see Apache Iceberg if you're interested).
>
>
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).
[Attachment #5 (text/html)]
<div dir="ltr"><div dir="ltr">Dan Andreescu <<a \
href="mailto:dandreescu@wikimedia.org">dandreescu@wikimedia.org</a>> \
wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div \
class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div \
dir="auto">Maybe something exists already in \
Hadoop</div></div></blockquote><div><br></div><div>The page properties table is \
already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props). I \
haven't played with it much, but Hive also has JSON-parsing goodies, so give it a \
shot and let me know if you get stuck. In general, data from the databases can be \
sqooped into Hadoop. We do this for large pipelines like <a \
href="https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading" \
target="_blank">edit history</a> and it's <a \
href="https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505" \
target="_blank">very easy</a> to add a table. We're looking at just replicating \
the whole db on a more frequent basis, but we have to do some groundwork first to \
allow incremental updates (see Apache Iceberg if you're \
interested).</div></div></div><br></blockquote><div><br></div><div>Yes, I like that \
and all of the other wmf_raw goodies! I'll follow up off thread on accessing the \
parser cache DBs (they're in site.pp and db-eqiad.php, but I don't think \
those are presently represented by refinery.util as they're not in .dblist \
files).</div><div><br></div><div><br></div><div> </div></div></div>
[Attachment #6 (text/plain)]
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic