'[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    [jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's
From:       "Mark Miller (JIRA)" <jira () apache ! org>
Date:       2013-08-31 21:38:52
Message-ID: JIRA.12431129.1248261840966.65454.1377985132417 () arcas
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755613#comment-13755613 \
] 

Mark Miller commented on SOLR-1301:
-----------------------------------

Another thing I have not looked at: The final jar that is created in the dist for the \
MapReduceIndexerTool - it likely still needs tweaking.  
> Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
> ---------------------------------------------------------------------------------
> 
> Key: SOLR-1301
> URL: https://issues.apache.org/jira/browse/SOLR-1301
> Project: Solr
> Issue Type: New Feature
> Reporter: Andrzej Bialecki 
> Assignee: Mark Miller
> Fix For: 4.5, 5.0
> 
> Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, \
> hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, \
> hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, \
> SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, \
> SOLR-1301.patch, SolrRecordWriter.java 
> 
> This patch contains  a contrib module that provides distributed indexing (using \
>                 Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is \
>                 twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. \
> SolrOutputFormat consumes data produced by reduce tasks directly, without storing \
> it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing \
> task is split into as many parts as there are reducers, and the data to be indexed \
> is not sent over the network. Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in \
> turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an \
> EmbeddedSolrServer, and it also instantiates an implementation of \
> SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a \
> SolrInputDocument. This data is then added to a batch, which is periodically \
> submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat \
> is closed, SolrRecordWriter calls commit() and optimize() on the \
> EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing \
> solr.home directory, from which the conf/ and lib/ files will be taken. This \
> process results in the creation of as many partial Solr home directories as there \
> were reduce tasks. The output shards are placed in the output directory on the \
> default filesystem (e.g. HDFS). Such part-NNNNN directories can be used to run N \
> shard servers. Additionally, users can specify the number of reduce tasks, in \
> particular 1 reduce task, in which case the output will consist of a single shard. \
> An example application is provided that processes large CSV files and uses this \
> API. It uses a custom CSV processing to avoid (de)serialization overhead. This \
> patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you \
>                 should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor and \
> approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic