[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-commits
Subject: [Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax
From: Apache Wiki <wikidiffs () apache ! org>
Date: 2007-12-31 10:20:29
Message-ID: 20071231102029.24882.65209 () eos ! apache ! org
[Download RAW message or body]
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for \
change notification.
The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF
The comment on the change is:
Stopped.
------------------------------------------------------------------------------
+ deleted
- [[TableOfContents(4)]]
- ----
- == HbaseRDF, a Planet-Scale RDF Data Store ==
- We have started to think about storing and querying RDF data in Hbase. But we'll \
jump into its implementation after prudence investigation.
-
- We introduce an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + \
!MapReduce to store RDF data and execute queries (e.g., SPARQL) on \
them.
- We can store very sparse RDF data in a single table in Hbase, with as many columns \
as
- they need. For example, we might make a row for each RDF subject in a table and \
store all the properties and their values as columns in the table.
- This reduces costly self-joins in answering queries asking questions on the same \
subject, which results in efficient processing of queries, although we still need \
self-joins to answer RDF path queries.
-
- We can further accelerate query performance by using !MapReduce for
- parallel, distributed query processing.
-
- === Related projects ===
-
- * [:Hbase/HbaseShell: Hbase Shell] provides a command line tool in which we can \
manipulate tables in Hbase. We are also planning to use !HbaseShell to manipulate and \
query RDF data stored in Hbase.
- * [http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423 A forum at \
Aduna/Sesame] would be interested in working with this group.
-
- === Initial Contributors ===
-
- * [:udanax:Edward Yoon] (R&D center, NHN corp.)
- * [:InchulSong: Inchul Song] (Database Lab, KAIST)
-
- ----
- == Some Ideas ==
- When we store RDF data in a single Hbase table and process queries on them, an \
important issue we have to consider is how to efficiently perform costly self-joins \
needed to process RDF path queries.
-
- To speed up these costly self-joins, it is natural to think about using
- the !MapReduce framework we already have. However, in the Sawzall paper from \
Google, the authors say that the !MapReduce framework is
- not good, or inappropriate for performing table joins.
- It is possible, but while we are reading one table in map
- or reduce functions, we have to read other tables on the fly, which
- results in less parallelized join processing.
-
- There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 07).
- The paper provides Map-Reduce-Merge, which is an extended version of the !MapReduce \
framework,
- that implements several relational operators, including joins. They have extended \
the
- !MapReduce framework with an additional Merge phase to implement efficient data \
relationship processing.
- See the Paper section below for more information. -- Thanks stack.
- (Edward is now implementing join operators using the !MapReduce framework.)
-
- But the problem is that there is an initial delay in executing !MapReduce jobs due \
to
- the time spent in assigning the computations to multiple machines. This
- might take far more time than necessary, thus hurt query response time. So, \
parallelism obtained by using !MapReduce is best enjoyable for queries over huge \
amount of RDF data, where it takes much time to process them.
- We might consider a selective parallelism where
- people can decide whether to use !MapReduce or not to process their queries, as in
- "select ... '''in parallel'''".
-
- Now that we have two sets of join algorithms, non-parallel versions and parallel \
versions with !MapReduceMerge,
- we are ready to do some massive parallel query processing on tremendous amount of \
RDF data.
- Currently, C-Store shows the best query performance on RDF data.
- However, we, armed with Hbase and !MapReduceMerge, can do even better.
- ----
- == Resources ==
- * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a \
candidate recommendation of W3C as of 14 June 2007.
- * A test suit for SPARQL can be found at \
http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF data, \
SPARQL queries, and expected results.
- * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj \
SPARQL Grammer in JavaCC] - from Jena ARQ
- * [http://esw.w3.org/topic/LargeTripleStores Large triple stores]
- * [http://web.mit.edu/dna/www/abadirdf.pdf Scalable Semantic Web Data Management \
Using Vertical Partitioning] Good summary of techniques storing RDF \
in RDBMS.
-
- == Architecture Sketch ==
-
- === HbaseRDF Data Loader ===
- HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data
- into a Hbase table in such a way that efficient query processing is possible. In \
Hbase, we can store everything in a single table.
- The sparsicy of RDF data is not a problem, because Hbase, which is
- a column-based storage and adopts various compression techniques,
- is very good at dealing with nulls in the table
-
- === HbaseRDF Query Processor ===
- HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase \
table.
- It translates RDF queries into API calls to Hbase, or !MapReduce jobs, gathers and \
returns the results
- to the user.
-
- Query processing steps are as follows:
-
- {{{
- SPARQL query -> Parse tree -> Logical operator tree
- -> Physical operator tree -> Execution
- }}}
-
- Implemenation of each step may proceed as an individual issue.
-
- === HbaseRDF Data Materializer ===
- HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the \
results
- into a Hbase table. Later, HQP uses those materialized data for efficient \
processing of
- RDF path queries.
-
- === Hbase Shell Extension ===
-
- {{{
- Hbase > rdf;
-
- Hbase RDF version 0.0.1
- Type 'help;' for help.
-
- Hbase.RDF > SELECT ?title
- --> FROM rdf_table
- --> WHERE { ?book author ‘‘Fox, Joe''
- --> ?book copyright ‘‘2001''
- --> ?book title ?title }
-
- results here.
-
- Hbase.RDF > exit;
-
- Hbase >
- }}}
- ----
- == Alternatives For RDF Storage ==
- * A triples table stores RDF triples in a single table with three attributes, \
subject, property, and object.
- * A property table. Put properties frequently queried togather into a single table \
to reduce costly self-joins. Used in Jena and Oracle.
- * A dicomposed storage model (DSM), one table for each property, sorted by the \
subject. Used in C-Store.
- ----
- == Papers ==
-
- * OSDI 2004, ''!MapReduce: Simplified Data Processing on Large Clusters'', \
proposes a very simple, but powerfull, and highly parallelized data processing \
technique.
- * CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf \
Column-Stores For Wide and Sparse Data]'', discusses the benefits of using C-Store to \
store RDF and XML data.
- * VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable \
Semantic Web Data Management Using Vertical Partitoning]'', proposes an efficient \
method to store RDF data in table projections (i.e., columns) and executes queries on \
them.
- * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large \
Clusters'', !MapReduce implementation of several relational \
operators.
-
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic