'[Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-commits
Subject:    [Lucene-hadoop Wiki] Update of "Hbase/RDF" by udanax
From:       Apache Wiki <wikidiffs () apache ! org>
Date:       2007-12-31 10:20:29
Message-ID: 20071231102029.24882.65209 () eos ! apache ! org
[Download RAW message or body]

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for \
change notification.

The following page has been changed by udanax:
http://wiki.apache.org/lucene-hadoop/Hbase/RDF

The comment on the change is:
Stopped.

------------------------------------------------------------------------------
+ deleted
- [[TableOfContents(4)]]
- ----
- == HbaseRDF, a Planet-Scale RDF Data Store ==
  
- We have started to think about storing and querying RDF data in Hbase. But we'll \
                jump into its implementation after prudence investigation. 
- 
- We introduce an Hbase subsystem for RDF, called HbaseRDF, which uses Hbase + \
                !MapReduce to store RDF data and execute queries (e.g., SPARQL) on \
                them.
- We can store very sparse RDF data in a single table in Hbase, with as many columns \
                as 
- they need. For example, we might make a row for each RDF subject in a table and \
                store all the properties and their values as columns in the table. 
- This reduces costly self-joins in answering queries asking questions on the same \
subject, which results in efficient processing of queries, although we still need \
                self-joins to answer RDF path queries.
- 
- We can further accelerate query performance by using !MapReduce for 
- parallel, distributed query processing. 
- 
- === Related projects ===
- 
-  * [:Hbase/HbaseShell: Hbase Shell] provides a command line tool in which we can \
manipulate tables in Hbase. We are also planning to use !HbaseShell to manipulate and \
                query RDF data stored in Hbase.
-  * [http://www.openrdf.org/forum/mvnforum/viewthread?thread=1423 A forum at \
                Aduna/Sesame] would be interested in working with this group.
-  
- === Initial Contributors ===
- 
-  * [:udanax:Edward Yoon] (R&D center, NHN corp.)
-  * [:InchulSong: Inchul Song] (Database Lab, KAIST) 
- 
- ----
- == Some Ideas ==
- When we store RDF data in a single Hbase table and process queries on them, an \
important issue we have to consider is how to efficiently perform costly self-joins \
                needed to process RDF path queries. 
- 
- To speed up these costly self-joins, it is natural to think about using 
- the !MapReduce framework we already have. However, in the Sawzall paper from \
                Google, the authors say that the !MapReduce framework is 
- not good, or inappropriate for performing table joins. 
- It is possible, but while we are reading one table in map 
- or reduce functions, we have to read other tables on the fly, which
- results in less parallelized join processing.
- 
- There is a paper on this subject written by Yang et al., from Yahoo (SIGMOD 07). 
- The paper provides Map-Reduce-Merge, which is an extended version of the !MapReduce \
                framework, 
- that implements several relational operators, including joins. They have extended \
                the 
- !MapReduce framework with an additional Merge phase to implement efficient data \
                relationship processing.
- See the Paper section below for more information. -- Thanks stack.
- (Edward is now implementing join operators using the !MapReduce framework.)
- 
- But the problem is that there is an initial delay in executing !MapReduce jobs due \
                to 
- the time spent in assigning the computations to multiple machines. This 
- might take far more time than necessary, thus hurt query response time. So, \
parallelism obtained by using !MapReduce is best enjoyable for queries over huge \
                amount of RDF data, where it takes much time to process them. 
- We might consider a selective parallelism where 
- people can decide whether to use !MapReduce or not to process their queries, as in 
- "select ... '''in parallel'''".
- 
- Now that we have two sets of join algorithms, non-parallel versions and parallel \
                versions with !MapReduceMerge,
- we are ready to do some massive parallel query processing on tremendous amount of \
                RDF data.
- Currently, C-Store shows the best query performance on RDF data.
- However, we, armed with Hbase and !MapReduceMerge, can do even better.
- ----
- == Resources ==
-  * http://www.w3.org/TR/rdf-sparql-query/ - The SPARQL RDF Query Language, a \
                candidate recommendation of W3C as of 14 June 2007.
-  * A test suit for SPARQL can be found at \
http://www.w3.org/2001/sw/DataAccess/tests/r2. The web page provides test RDF data, \
                SPARQL queries, and expected results.
-  * [https://jena.svn.sourceforge.net/svnroot/jena/ARQ/trunk/Grammar/sparql.jj \
                SPARQL Grammer in JavaCC] - from Jena ARQ
-  * [http://esw.w3.org/topic/LargeTripleStores Large triple stores]
-  * [http://web.mit.edu/dna/www/abadirdf.pdf Scalable Semantic Web Data Management \
                Using Vertical Partitioning] Good summary of techniques storing RDF \
                in RDBMS.
- 
- == Architecture Sketch ==
- 
- === HbaseRDF Data Loader ===
- HbaseRDF Data Loader (HDL) reads RDF data from a file, and organizes the data 
- into a Hbase table in such a way that efficient query processing is possible. In \
                Hbase, we can store everything in a single table.
- The sparsicy of RDF data is not a problem, because Hbase, which is 
- a column-based storage and adopts various compression techniques, 
- is very good at dealing with nulls in the table
- 
- === HbaseRDF Query Processor ===
- HbaseRDF Query Processor (HQP) executes RDF queries on RDF data stored in a Hbase \
                table. 
- It translates RDF queries into API calls to Hbase, or !MapReduce jobs, gathers and \
                returns the results
- to the user. 
- 
- Query processing steps are as follows:
- 
- {{{
- SPARQL query -> Parse tree -> Logical operator tree 
- -> Physical operator tree -> Execution
- }}}
- 
- Implemenation of each step may proceed as an individual issue. 
- 
- === HbaseRDF Data Materializer ===
- HbaseRDF Data Materializer (HDM) pre-computes RDF path queries and stores the \
                results
- into a Hbase table. Later, HQP uses those materialized data for efficient \
                processing of 
- RDF path queries. 
- 
- === Hbase Shell Extension ===
- 
- {{{
- Hbase > rdf;
- 
- Hbase RDF version 0.0.1
- Type 'help;' for help.
- 
- Hbase.RDF > SELECT ?title
-         --> FROM rdf_table
-         --> WHERE { ?book author ‘‘Fox, Joe''
-         -->         ?book copyright ‘‘2001''
-         -->         ?book title ?title }
- 
- results here.
- 
- Hbase.RDF > exit;
- 
- Hbase > 
- }}}
- ----
- == Alternatives For RDF Storage ==
-  * A triples table stores RDF triples in a single table with three attributes, \
                subject, property, and object.
-  * A property table. Put properties frequently queried togather into a single table \
                to reduce costly self-joins. Used in Jena and Oracle. 
-  * A dicomposed storage model (DSM), one table for each property, sorted by the \
                subject. Used in C-Store.
- ----
- == Papers ==
- 
-  * OSDI 2004, ''!MapReduce: Simplified Data Processing on Large Clusters'', \
proposes a very simple, but powerfull, and highly parallelized data processing \
                technique.
-  * CIDR 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadicidr07.pdf \
Column-Stores For Wide and Sparse Data]'', discusses the benefits of using C-Store to \
                store RDF and XML data.
-  * VLDB 2007, ''[http://db.lcs.mit.edu/projects/cstore/abadirdf.pdf Scalable \
Semantic Web Data Management Using Vertical Partitoning]'', proposes an efficient \
method to store RDF data in table projections (i.e., columns) and executes queries on \
                them.
-  * SIGMOD 2007, ''Map-Reduce-Merge: Simplified Relational Data Processing on Large \
                Clusters'', !MapReduce implementation of several relational \
                operators.
- 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic