[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: Which [open-souce] SQL engine atop Hadoop?
From:       Daniel Haviv <danielrulez () gmail ! com>
Date:       2015-01-27 9:13:29
Message-ID: DE9BDB62-9273-495B-917C-CA6ACF95A12B () gmail ! com
[Download RAW message or body]

Can you elaborate on why you prefer Tajo?

Daniel

> On 27 ×‘×™× ×•×³ 2015, at 10:35, Azuryy Yu <azuryyyu@gmail.com> wrote:
> 
> You almost list all open sourced MPP real time SQL-ON-Hadoop.
> 
> I prefer Tajo, which was relased by 0.9.0 recently, and still working in progress \
> for 1.0 
> 
> > On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <samuelmarks@gmail.com> wrote:
> > Since Hadoop came out, there have been various commercial and/or open-source \
> > attempts to expose some compatibility with SQL. 
> > I am seeking one which is good for low-latency querying, and supports the most \
> > common CRUD, including [the basics!] along these lines: CREATE TABLE, INSERT \
> > INTO, SELECT * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. 
> > I will be utilising them from Python, however there does seem to be a Python JDBC \
> > wrapper. Additionally it needs to be scalable for big and small data (starting on \
> > a single-node "cluster"). 
> > Here is what I've found thus far:
> > 
> > Apache Hive (SQL-like, with interactive SQL thanks to the Stinger initiative)
> > Apache Drill (ANSI SQL support)
> > Apache Spark (Spark SQL, queries only, add data via Hive, RDD or Paraquet)
> > Apache Phoenix (built atop Apache HBase, lacks full transaction support, \
> > relational operators and some built-in functions) Presto from Facebook (can query \
> > Hive, Cassandra, relational DBs &etc. Doesn't seem to be designed for low-latency \
> > responses across small clusters, or support UPDATE operations. It is optimized \
> > for data warehousing or analytics ¹) SQL-Hadoop via MapR community edition (seems \
> > to be a packaging of Hive, HP Vertica, SparkSQL, Drill and a native ODBC wrapper) \
> > Apache Kylin from Ebay (provides an SQL interface and multi-dimensional analysis \
> > [OLAP], "… offers ANSI SQL on Hadoop and supports most ANSI SQL query \
> > functions". It depends on HDFS, MapReduce, Hive and HBase; and seems targeted at \
> > very large data-sets though maintains low query latency) Apache Tajo (ANSI/ISO \
> > SQL standard compliance with JDBC driver support [benchmarks against Hive and \
> > Impala]) Cascading's Lingual ² ("Lingual provides JDBC Drivers, a SQL command \
> > shell, and a catalog manager for publishing files [or any resource] as schemas \
> > and tables.") Which—from this list or elsewhere—would you recommend, and why?
> > 
> > Thanks for all suggestions,
> > 
> > Samuel Marks
> > http://linkedin.com/in/samuelmarks
> 


[Attachment #3 (text/html)]

<html><head><meta http-equiv="content-type" content="text/html; \
charset=utf-8"></head><body dir="auto"><div><div style="direction: ltr;">Can you \
elaborate on why you prefer Tajo?</div><br><div style="direction: \
ltr;">Daniel</div></div><div><div style="direction: rtl;"><br></div>On 27 ×‘×™× ×•×³ \
2015, at 10:35, Azuryy Yu &lt;<a \
href="mailto:azuryyyu@gmail.com">azuryyyu@gmail.com</a>&gt; \
wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr"><div>You almost list \
all open sourced MPP real time SQL-ON-Hadoop.</div><div><br></div><div>I prefer Tajo, \
which was relased by 0.9.0 recently, and still working in progress for \
1.0</div><div><br></div></div><div class="gmail_extra"><br><div \
class="gmail_quote">On Mon, Jan 26, 2015 at 10:19 PM, Samuel Marks <span \
dir="ltr">&lt;<a href="mailto:samuelmarks@gmail.com" \
target="_blank">samuelmarks@gmail.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr"><div>

        <p>Since <a href="https://hive.apache.org" target="_blank" \
rel="nofollow">Hadoop</a> came out, there have been various commercial and/or \
open-source attempts to expose some compatibility with <a \
href="http://drill.apache.org" target="_blank" rel="nofollow">SQL</a>.</p>

<p>I am seeking one which is good for low-latency querying, and supports the most \
common <a href="https://spark.apache.org" target="_blank" rel="nofollow">CRUD</a>, \
including [the basics!] along these lines: <code>CREATE TABLE</code>, <code>INSERT \
INTO</code>, <code>SELECT * FROM</code>, <code>UPDATE Table SET C1=2 WHERE</code>, \
<code>DELETE FROM</code>, and <code>DROP TABLE</code>.</p>

<p>I will be utilising them from Python, however there does seem to be a <a \
href="https://spark.apache.org/sql" target="_blank" rel="nofollow">Python JDBC \
wrapper</a>. Additionally it needs to be scalable for big and small data (starting on \
a single-node "cluster").</p>

<p>Here is what I've found thus far:</p>

<ul><li><a href="https://hive.apache.org" target="_blank" rel="nofollow">Apache \
Hive</a> (SQL-like, with interactive SQL thanks to the Stinger initiative)</li><li><a \
href="http://drill.apache.org" target="_blank" rel="nofollow">Apache Drill</a> (ANSI \
SQL support)</li><li><a href="https://spark.apache.org" target="_blank" \
rel="nofollow">Apache Spark</a> (<a href="https://spark.apache.org/sql" \
target="_blank" rel="nofollow">Spark SQL</a>, queries only, add data via Hive, <a \
href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD" \
target="_blank" rel="nofollow">RDD</a> or <a href="http://parquet.io/" \
target="_blank" rel="nofollow">Paraquet</a>)</li><li><a \
href="http://phoenix.apache.org" target="_blank" rel="nofollow">Apache Phoenix</a> \
(built atop <a href="http://hbase.apache.org" target="_blank" rel="nofollow">Apache \
HBase</a>, lacks full <a href="http://en.wikipedia.org/wiki/Database_transaction" \
target="_blank" rel="nofollow">transaction</a> support, <a \
href="http://en.wikipedia.org/wiki/Relational_operators" target="_blank" \
rel="nofollow">relational operators</a> and some built-in functions)</li><li><a \
href="https://github.com/facebook/presto" target="_blank" rel="nofollow">Presto</a> \
from Facebook (can query Hive, <a href="http://cassandra.apache.org" target="_blank" \
rel="nofollow">Cassandra</a>, relational DBs &amp;etc. Doesn't seem to be designed \
for low-latency responses across small clusters, or support <code>UPDATE</code> \
operations. It is optimized for data warehousing or analytics<a \
href="http://prestodb.io/docs/current/overview/use-cases.html" target="_blank" \
rel="nofollow"> ¹</a>)</li><li><a href="https://www.mapr.com/why-hadoop/sql-hadoop" \
target="_blank" rel="nofollow">SQL-Hadoop</a> via <a \
href="https://www.mapr.com/products/hadoop-download" target="_blank" \
rel="nofollow">MapR community edition</a> (seems to be a packaging of Hive, <a \
href="http://www.vertica.com/hp-vertica-products/sqlonhadoop" target="_blank" \
rel="nofollow">HP Vertica</a>, SparkSQL, Drill and a <a \
href="http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC" target="_blank" \
rel="nofollow">native ODBC wrapper</a>)</li><li><a href="http://www.kylin.io" \
target="_blank" rel="nofollow">Apache Kylin</a> from Ebay (provides an SQL interface \
and multi-dimensional analysis [<a href="http://en.wikipedia.org/wiki/OLAP" \
target="_blank" rel="nofollow">OLAP</a>],  "… offers ANSI SQL on Hadoop and \
supports most ANSI SQL query  functions". It depends on HDFS, MapReduce, Hive and \
HBase; and seems  targeted at very large data-sets though maintains low query \
latency)</li><li><a href="http://tajo.apache.org" target="_blank" \
rel="nofollow">Apache Tajo</a> (ANSI/ISO SQL standard compliance with <a \
href="http://en.wikipedia.org/wiki/JDBC" target="_blank" rel="nofollow">JDBC</a> \
driver support [<a href="http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space" \
target="_blank" rel="nofollow">benchmarks against Hive and Impala</a>])</li><li><a \
href="http://en.wikipedia.org/wiki/Cascading_%28software%29" target="_blank" \
rel="nofollow">Cascading</a>'s <a href="http://docs.cascading.org/lingual/1.0/" \
target="_blank" rel="nofollow">Lingual</a><a \
href="http://docs.cascading.org/lingual/1.0/#sql-support" target="_blank" \
rel="nofollow"> ²</a>  ("Lingual provides JDBC Drivers, a SQL command shell, and a \
catalog  manager for publishing files [or any resource] as schemas and \
tables.")</li></ul>

<p></p><p>Which—from this list or elsewhere—would you recommend, and why?</p>

    </div><div><div>Thanks for all suggestions,<br><br>Samuel Marks<br><a \
href="http://linkedin.com/in/samuelmarks" \
target="_blank">http://linkedin.com/in/samuelmarks</a></div></div> </div>
</blockquote></div><br></div>
</div></blockquote></body></html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic