'Re: [sqlite] transparent compression implementations for SQLite?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sqlite-users
Subject:    Re: [sqlite] transparent compression implementations for SQLite?
From:       Simon Slavin <slavins () bigfraud ! org>
Date:       2011-12-31 21:03:19
Message-ID: 68AB554E-BB88-4FC4-ABE9-E9680A6F26B4 () bigfraud ! org
[Download RAW message or body]

On 31 Dec 2011, at 8:37pm, Ivan Shmakov wrote:

>>>>>> Simon Slavin <slavins@bigfraud.org> writes:
>>>>>> On 31 Dec 2011, at 4:56pm, Ivan Shmakov wrote:
> =

>>> The integers could take up to 32 bits long, but I deem them likely
>>> to =93cluster=94, like, e. g.: 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 101, 101,
>>> 102, 102, 102, 103, 103, 768, 768, etc.  My guess is that such
>>> sequences should be quite compressible, but the point is that
>>> there'll only be a few such numbers per row, and millions of such
>>> rows in the table.
> =

>> Thing is, an integer like 103 takes two bytes to store,
> =

> 	Those were given as examples only.  The real values are going to
> 	use up to 28 bits (roughly, 4 octets) or so.
> =

>> and one of those bytes is the byte that indicates what kind of value
>> is being stored.  So it's really only taking one byte per value.
> =

> 	Is it?  I mean, if I know for sure that I'd only ever need, say,
> 	eight 32-bit integers per row, and pack them into a single BLOB
> 	instead of having them stored as proper INTEGER's, won't I be
> 	able to save a few bytes per row?  (For a performance penalty
> 	when accessing these fields.  Also, I'd have to forget about
> 	indexes on these, other than a UNIQUE one on the whole
> 	BLOB-coded tuple, I guess.)
> =

> 	(With a few functions to unpack these values back out of BLOB's,
> 	I'd even be able to CREATE a suitable VIEW.  Or may be a VIRTUAL
> 	TABLE.)

So yes, it depends on whether you need to do searches or sorting on these c=
olumns.  If you don't care about the individual fields then all you need is=
 a function which turns either integers into the shortest BLOB it can.

>> And you can't improve on this by compressing individual values, only
>> by compressing at the table level.  And if you're compressing at
>> table level, making any changes is rather slow.
> =

> 	For now, I'm interested mostly in INSERT's.  (And SELECT's, of
> 	course.)

Well, you obviously realise your trade-offs here.  You can compress at the =
field level with minimal trade-off.  You can compress at the row level, whi=
ch slows down changes and queries, but not much.  You can compress at the c=
olumn or table levels, which slow down changes a great deal.  And because o=
f the way SQLite works internally you can compress at the page level.

So you're really trying to figure out which trade-off is best for you.  It =
all depends on how big your databases are and whether they live on just mac=
hine or need to be sent places frequently.  But disk space is so cheap thes=
e days it's cheaper for me to go out and buy a Terabyte disk than to spend =
a week writing clever compression code.

>>> [snip]is there a way to determine the filesystem space occupied by a
>>> particular table, or index, in SQLite?  It now seems to me that the
>>> problem is different to what I've initially presumed.
> =

>> Not inside SQLite's API functions.  You can do it by reading the file
>> format and multiplying the page size by how many pages are used.
> =

> 	ACK, thanks.  Do I understand it correctly that the
> 	sqlite-analyzer (suggested elsewhere in this thread) does this
> 	very thing?

Yes.  It prints out lots of information about your database including what =
you asked for.  But it's a stand-alone app, not part of the SQLite API.

Simon.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[prev in list] [next in list] [prev in thread] [next in thread]