'Re: [sqlite] SQLite vs. Oracle (parallelized)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sqlite-users
Subject:    Re: [sqlite] SQLite vs. Oracle (parallelized)
From:       Allan Edwards <wallanedwards () gmail ! com>
Date:       2009-02-27 20:57:39
Message-ID: 7dad920902271257n2f15dbc1j94a2384f3ec83c3d () mail ! gmail ! com
[Download RAW message or body]

Here is the reality.... distribute your data processing across as many
hard drives as you can.  The key is the more parallel drives you can
have working, it does NOT matter what database you utilize.  If you
build your broker (server process that interacts with Sqlite... or
other database and controls the entire parallel processing) properly,
you can add or remove hard drives with a simple configuration change.
Depending on how much data load you have you could also subnet banks
of servers that run your sqlite data storage to provide for optimal
network communication to your sqlite data processing nodes.

Trick question....

You setup 2 seperate machines, sqlite on one, and Oracle on another...
they both have the exact make and model hard drives and OS.  Which one
can write data faster to the hard drive?

....................................

I think you want to look into parallel data processing algorithms.  Of
course, to me that kind of stuff is common sense but I find most guys
don't truly study their processor specifications as they should and
therefore miss the low level details that would imply optimal
decisions in their parallel algorithms design.

For example... most people say I want to multi thread so my software
runs "faster".  That is not properly said.  You multi thread to
utilize the same computer speed but just more of the processor
bandwidth per time slice.  So, for example, if your system had 4
threads to schedule and each had 25% equal processing time... all
threads run in 1 minute.  So that would be 15 seconds per thread.  If
you multi thread and utilize 2 threads to process data in that same
minute, you have doubled your processing time by 100% by utilizing 50%
of the total 4 threads.  You did not speed up the computer, you
utilized more processor time per round robin thread run.

So the bottom line is, when you design out your parallel algorithm,
the database itself is not all that important but being able to
configure as many as you need quickly IS.  But knowing the speed of
your drives, knowing the speed of your network, configuring and
setting up to scale out to more nodes (network, desks, computers) IS
the key.  And of course, your broker architecture (the piece that
controls the show collectively) must be written and carefully tested
so it loses as little efficiency system wide as it has to make
decisions in parselling work out to the various processing nodes.

IF you utilize Sqlite for such an endeavor, I do believe you would be
better off.  Why?  because Sqlite is so simple and has NO server
process you control the software performance totally above the data
storage below.  So in essence in this way Sqlite IS a great choice.
You can xcopy the db to a node and role!  With database "servers" you
will have to cotend with the variability of a closed source base you
have no control over that takes up threads and processor time on the
machines in which it runs.  Less IS FOR SURE sometimes more.

I have scene systems in the past that used Oracle, Informatica, etc.
to do massive data processing... I mean hundreds of millions of rows
per week.  The reality is most of the teams that build this stuff buy
a massively expensive RISC based system when you could spend the money
on development of software that would "efficiently" scale a simpler
solution like Sqlite across many nodes.  The money is spent mostly on
butt covering....  The technical decision was driven more by an
ignorant sales person versus an in the know technology guy.

-A-

On Fri, Feb 27, 2009 at 2:35 PM,  <python@bdurham.com> wrote:
> Allan,
> 
> Thanks for your reply. I'm new to SQLite, but have used a similar list
> of databases (with an emphasis on server vs. client side databases)
> professionally for years. My background is designing and building
> enterprise BI, ETL, and data warehouse systems using databases like
> Oracle, DB2, SQL Server and ETL tools like Informatica, Ab Initio, BO
> Data Integrator, etc.
> 
> My goal is to be able to offer our customers cost effective, high
> performance alternatives to traditional commercial ETL solutions. We
> recently re-implemented a mid-sized Oracle/Informatica based ETL system
> in Python using in memory data structures (dicts) and improved the
> performance by a factor of 20x. We're now looking to expand this
> approach to SQLite for environments where the volume of data is greater
> than available memory.
> 
> > When you talk about performance comparisons your question really needs \
> > to be squared up to what the database is used for. Are you on a server, \
> > workstation, or embedded? How large in the database? How are the \
> > internal algos of the database engine lined up? What is the memory \
> > footprint? How did you configure what is configurable in the database?
> 
> I'm interested in exploring whether or not SQLite can be used as an ETL
> tool with large data sets (~80+ Gb). In this context, SQLite would be
> run on 64-bit Intel servers with lots of RAM (16-64 Gb). The data would
> be stored/processed on server local SCSI drives vs. located on a SAN.
> File access would be via a single process per SQLite database. The
> interface language would most likely be Python.
> 
> > The reality is I BET and I would love to know myself Sqlite is just as \
> > fast on inserts to the same hard drive as oracle.
> 
> I would love to see such a benchmark as well.
> 
> > Are you some kind of political Oracle covering biggot?
> 
> LOL. No. I try to be as database agnostic as possible.
> 
> > Or, do you want another chance to post something that states you are \
> > trying to find the best data storage solution to deliver sincere value \
> > to your client in terms of a database choice?
> 
> Subtle point here ... I'm not looking for a long term data storage
> solution - I'm exploring the possibility of using SQLite as a way to
> validate, transform, and pre-aggregate raw data that would in turn be
> exported to raw text files and imported (in final form) to a customer's
> data warehouse.
> 
> > Sometimes that choice is Sqlite, but in some cases it is Oracle \
> > instead.
> 
> Agreed.
> 
> Regards,
> Malcolm
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 

-- 
W Allan Edwards
214-289-2959
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[prev in list] [next in list] [prev in thread] [next in thread]