'[jira] [Updated] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's M'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    [jira] [Updated] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's M
From:       "Mark Miller (JIRA)" <jira () apache ! org>
Date:       2013-08-31 21:32:56
Message-ID: JIRA.12431129.1248261840966.65443.1377984776111 () arcas
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Mark Miller updated SOLR-1301:
------------------------------

    Attachment: SOLR-1301.patch

Here is a patch with my current progress.

This is a Solr contrib module that can build Solr indexes in HDFS via MapReduce. It \
builds upon the Solr support for reading and writing to HDFS.

It supports a GoLive feature that allows merging into a running cluster as the final \
step of the MapReduce job.

There is fairly comprehensive help documentation as part of the MapReduceIndexerTool.

For ETL, Morphlines from the open source Cloudera CDK is used: \
https://github.com/cloudera/cdk/tree/master/cdk-morphlines This is the same ETL \
library that the Solr integration with Apache Flume uses.

What I have recently done: updated to latest code, fixed 5x requires solr.xml now, \
converted maven to ivy+ant, updated license files, fixed validation errors, \
integrated tests fully into test framework, got tests passing.

All tests are passing with this patch for me, but there are still a variety of issues \
to address:

* run yarn *and* mr1 - the maven build would run the unit tests against yarn or mr1 \
depending on the profile chosen on the command line - this patch runs against yarn.

* The MiniYarnCluster used for unit tests is hard coded to use the \
'current-working-dir'/target path. This is a bad and illegal location. For the \
moment, I've relaxed the Lucene tests policy file to allow read/writes anywhere - \
this needs to be addressed before committing.

* We depend on some Morphline commands that depend on Solr - this could cause us \
problems in the future, and we want to own the code for this commands in Solr I \
think.

* There are thread leaks in the tests that should be looked into - some might not be \
avoidable as in other Hadoop tests (as we wait for fixes from the Hadoop project).

* We need to sync up with the latest code from the maven version - there have been \
some changes since this code was extracted.

There are a number of new contributors to this issue that I will be sure to enumerate \
in CHANGES.

I'll add whatever I'm forgetting in a later comment.
                
> Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
> ---------------------------------------------------------------------------------
> 
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
> Issue Type: New Feature
> Reporter: Andrzej Bialecki 
> Assignee: Mark Miller
> Fix For: 4.5, 5.0
> 
> Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, \
> hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, \
> hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, \
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SolrRecordWriter.java 
> 
> This patch contains  a contrib module that provides distributed indexing (using \
>                 Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is \
>                 twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. \
> SolrOutputFormat consumes data produced by reduce tasks directly, without storing \
> it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing \
> task is split into as many parts as there are reducers, and the data to be indexed \
> is not sent over the network. Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in \
> turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an \
> EmbeddedSolrServer, and it also instantiates an implementation of \
> SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a \
> SolrInputDocument. This data is then added to a batch, which is periodically \
> submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat \
> is closed, SolrRecordWriter calls commit() and optimize() on the \
> EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing \
> solr.home directory, from which the conf/ and lib/ files will be taken. This \
> process results in the creation of as many partial Solr home directories as there \
> were reduce tasks. The output shards are placed in the output directory on the \
> default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N \
> shard servers. Additionally, users can specify the number of reduce tasks, in \
> particular 1 reduce task, in which case the output will consist of a single shard. \
> An example application is provided that processes large CSV files and uses this \
> API. It uses a custom CSV processing to avoid (de)serialization overhead. This \
> patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you \
>                 should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and \
> approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic