'Re: KSpread data model (~format storage)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    Re: KSpread data model (~format storage)
From:       Sébastien_de_Menten_de_Horne <sdementen () skynet ! be>
Date:       2005-05-17 17:50:53
Message-ID: 200505171950.53407.sdementen () skynet ! be
[Download RAW message or body]

On Tuesday 17 May 2005 09:32, Tomas Mecir wrote:
> Hi there !
>
> On 5/17/05, Sébastien de Menten de Horne <sdementen@skynet.be> wrote:
> > Hi,
> > Well, after this small introduction presenting my motivation, let's start
> > with my point: kspread internal data mode (or format storage?)
> > I read with interest the kspread/DESIGN.html in order to understand the
> > internals of kspread with still a certain level of abstraction (BTW, is
> > there more documentation of this kind for kspread  (ie at the same level
> > of abstraction)? )
>
> Hm, there isn't really much documentation ... And that which exists is
> at lower abstraction level (describes classes and methods). Also, the
> design.html describes mostly planned features - ie. the new storage or
> manipulators are not in yet.
>

Too sad to miss a "big picture" document for kspread. This document could be 
great for FOSDEM or other events with "coding sprint" or even "design sprint"
Are there other threads (like the "KSpread Format Storage - the next 
generation" one) I am not aware that discuss kspread design ?

> > In fact, if I take the example for form storage in DESIGN:
> >   Range    | Formatting Piece
> >   Column B | Bold on
> >   Row 2    | Italics on
> >   A1:C5    | Yellow background
>
> I personally wasn't even quite sure whether this would work well
> enough, although Ariya was pretty confident that it would - as I've
> said, it's all in the thoughts phase as of now.

I share Ariya confidence :-)
>
> > I think it can be conceptually extended to data as
> >   Range    | Content/Content
> >   Column B | 1
> >   Row 2    | "hello world"
> >   A1:C5    | data_block[0]
> >   D1:D5    | =sum(R[-3,0]:R[-1,0])
> >
> > Now, I explain the special meaning of data_block[0] and
> > =sum(R[-3,0]:R[-1,0]). * data_block[0] is a block of data of size 5x3
> > stored in memory at data_block[0]. This has the advantage of a very
> > efficient representation as all element in data_block[0] shares the same
> > type (double, float, string,...) * =sum(R[-3,0]:R[-1,0]) is a formula
> > expressed in relative position where R[-3,0] means Relative cell with
> > offset -3 in column and offset 0 in row. Again the representation is
> > terse. It is also possible to specify absolute cells with a A[1,1]:A[4,1]
> > notation.
>
> Hmmm ... Well as far as I can see this, there are two different scenarios:
> 1. the cells share data
> 2. the cells share a formula
> As for first case, I am rather sceptical. I mean, how many
> spreadsheets have exactly the same values in a whole block of cells ?
> Probably not many ... Your statistical data, for instance, would hold
> a different value in each cell, right ? Hence this approach would lead
> to memory usage being even higher than currently, due to all the extra
> structures.

Well,
> >   Column B | 1
> >   Row 2    | "hello world"
examples are more for "fun". I don't see a real use of those cases but I 
showed them to illustrate the generality of the solution.
However,
> >   A1:C5    | data_block[0]
example is more interesting in my point of view. Just to be really clear about 
the meaning of this syntax, I explain it again. This structure says that 
value in range A1:C5 are mapped to values in data_block[0] array, i.e.
A1 = data_block[0][0][0]
A2 = data_block[0][1][0]
A3 = data_block[0][2][0]
...
B1 = data_block[0][0][1]
B2 = data_block[0][0][2]
...
C5 = data_block[0][2][4]
The first [0] index is the index in big vector of data_block that stores all 
the data of a file or worksheet. Well, in my discussion, I think it added 
noise, sorry :-(

So, for instance, statistical data would be stored in one big double** array. 
The advantage is that data type is the same for all elements in the same 
data_block (memory saving and cpu saving as do not need type check on each 
element => this is what I mean by parallel processing like operations)

> The second case is more interesting, though. I can well imagine the
> same formula being stored in hundreds of cells, and then several
> interesting things could be done to speed things up. But then, having
> a depth tree to store this doesn't sound like the best idea ... It
> might be better to simply have a container for all the formulas in a
> sheet (or in a document, but I'd prefer per-sheet things), keeping
> only some sort of index in the cell itself. Interesting ... Although
> different from your idea, but oh well ;)
> What do you think ?

The depth tree (implemented as a quad-tree or any other efficient algorithm) 
approach sound like a very nice way to have a lot of structured knowledge 
about the data in the worksheet.

>
> > The benefit of this approach are multiples:
> >  * possibility of applying functions to block of data.
> > "=sum(R[-3,0]:R[-1,0])" can be computed on all the range D1:D5 in
> > parallel instead of cell per cell
>
> Only if D1..D5 hold the same value, otherwise we end up hving to do
> exactly the same thing that we do now - as explained above.

Well, in parallel, I mean that the evaluator (bytecode interpretation) can 
iterate internally on the data block.
For instance, let's say we want: "=R[-1,0] + exp(R[-2,0] / R[-3,0])"
The evaluator will parse the formula once for all and then apply it in 
parallel, ie do (in pseudo language):

for i in range:
  tmp1[i] = R[-1,0][i]
for i in range:
  tmp2[i] = R[-2,0][i] / R[-3,0][i]
for i in range:
  tmp3[i] = exp(tmp2[i])
for i in range:
  tmp4[i] = tmp1[i] / tmp3[i]

instead of

for i in range:
  tmp1 = R[-1,0][i]
  tmp2 = R[-2,0][i] / R[-3,0][i]
  tmp3 = exp(tmp2)
  tmp4[i] = tmp1 / tmp3

Where in those loop, I presume that the most expensive part is interpreting 
what to compute in tmp1, tmp2, tmp3 and tmp4 more than really do the 
computations.
Hence in parallel, we have 4 evaluation of what to compute and in series, we 
have 4 * length of range evaluation of what to compute.
Hence, it is more about avoiding the code interpretation cost.

An other way to see this is just asking for the "formula engine virtual 
machine" to allow operations (like bytecodes for Function, Add, Sub, Mul, Div 
and Pow, Neg) on vectors instead of just scalars. In fact, the same idea as 
putting MMX instructions in a CPU.

>
> >  * operations on sheet like insertion of row/columns can be done by
> > updating the range information as well as the range definition in
> > formulas
>
> Yes, that's one of the nice consequences of this - formulas are
> separate, thus this is easier.
>
> >  * importing results of simulations (hundred of scenarios with thousands
> > of statistics in each scenario). Here the data_block idead would save a
> > lot of memory.
>
> Would it really ? Each cell would hold a DIFFERENT value, right ?

Maybe I was not clear enough in my explanation (see above). 
Memory saving come from avoiding to store the value type in the cell as the 
full data_block shares the same type (int, double, time, formula).
Is it a big improvment in memory use ? I don't know, it would be necessary to 
look at the precise memory consumption of cell value and cell metadata (type 
info at least)

>
> >  * doing identical computations on each scenario or on each statistics.
> > Again, formulas were the same for a lot of cells and computing those
> > formulas in block could help.
>
> This would only really help reduce memory needed to store the formulas
> - we cannot really compute in blocks, as the DATA are not the same for
> many cells.
>
see above the parallel execution of bytecode in the function engine virtual 
machine :-)

> > Now, a last remark about a topic from this DESIGN.html document: the
> > dependency manager. Why is there a dependency manager per sheet ?
>
> Because it was the easiest solution at that time :D And I didn't feel
> like adding inter-file dependencies. Actually, I still don't quite
> understand how are these supposed to work - I mena, what if document A
> depends on document B, and then, you only open document B and change
> it ? Doc. A would know nothing about the change. Then if you close B,
> open A, it could only be updated by auto-opening B again - I don't
> like this...
>

Totally agree, it may make no sense at all ! However, I know that sometimes in 
Excel (shame on me :-) ), I would have been happy to be able to link data 
from different opened workbooks.
The auto-opening of the workbook may be a clean solution (first asking the 
user if he wants to, for sure)

> >  * this idea is valuable to pursue (I may code a prototype in python to
> > test it more carefully for corner cases).
>
> Heh, it is valuable :) Although as you can see, I kinda twisted the
> idea to look completely different :DDD
>

I try to play (in python) with the idea and let you know if I come to 
something.
_______________________________________________
koffice-devel mailing list
koffice-devel@kde.org
https://mail.kde.org/mailman/listinfo/koffice-devel

[prev in list] [next in list] [prev in thread] [next in thread]