'KSpread data model (~format storage)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    KSpread data model (~format storage)
From:       Sébastien_de_Menten_de_Horne <sdementen () skynet ! be>
Date:       2005-05-16 22:01:05
Message-ID: 200505170001.06127.sdementen () skynet ! be
[Download RAW message or body]

Hi,

As a long term user and fan of KDE, I follow with interest the evolution of 
koffice and more specifically of kspread. I think kspread may be used very 
easily as a GUI for some king of applications (finance, simple linear 
programming, basic statistic, ) through kpart or DCOP. This GUI offers (or 
will offer) efficient tabular data (input/output), charts, plenty of already 
implemented functions (sort, sum, ...), easy way to load/save configurations 
(stored in worksheets). I developed applications in python with an excel GUI 
on windows and I must say it was a tremendous time saver ! But on linux, I 
think it could be done more easily with the current stack of technologies.

Well, after this small introduction presenting my motivation, let's start with 
my point: kspread internal data mode (or format storage?)
I read with interest the kspread/DESIGN.html in order to understand the 
internals of kspread with still a certain level of abstraction (BTW, is there 
more documentation of this kind for kspread  (ie at the same level of 
abstraction)? )

I have some specific and general comments on it.

Now, what I have understood basically is that kspread stores information on 
each cell on a cell per cell basis for its content/value/formula (i.e. create 
a cell object when it is non empty with this information) and in another way 
(ala "raytracing"/depth-buffer) for format storage.
Before reading this file, I had an idea for a efficient spreadsheet data model 
that I wanted to test in python. After my reading, I may say that it was more 
in the spirit of the format storage depicted in DESIGN.html but for content.

In fact, if I take the example for form storage in DESIGN:
  Range    | Formatting Piece
  Column B | Bold on
  Row 2    | Italics on
  A1:C5    | Yellow background
I think it can be conceptually extended to data as
  Range    | Content/Content
  Column B | 1                       
  Row 2    | "hello world"           
  A1:C5    | data_block[0]           
  D1:D5    | =sum(R[-3,0]:R[-1,0])   

This would mean quite naturally:
Column B has values 1 in each cell
Row 2 has values "hello world" in each cell (overwriting old B2=1)
A1:C5 has values taken from a block stored in data_block[0]
D1:D5 has values =sum(R[-3,0]:R[-1,0]) in each cell

Now, I explain the special meaning of data_block[0] and =sum(R[-3,0]:R[-1,0]).
 * data_block[0] is a block of data of size 5x3 stored in memory at 
data_block[0]. This has the advantage of a very efficient representation as 
all element in data_block[0] shares the same type (double, float, string,...)
 * =sum(R[-3,0]:R[-1,0]) is a formula expressed in relative position where 
R[-3,0] means Relative cell with offset -3 in column and offset 0 in row. 
Again the representation is terse. It is also possible to specify absolute 
cells with a A[1,1]:A[4,1] notation.

The benefit of this approach are multiples:
 * efficient memory usage
 * possibility of applying functions to block of data. "=sum(R[-3,0]:R[-1,0])" 
can be computed on all the range D1:D5 in parallel instead of cell per cell
 * same algorithm for display as the one used for "Format storage", 
possibility to use quad-tree for efficient implementation
 * operations on sheet like insertion of row/columns can be done by updating 
the range information as well as the range definition in formulas
 * dependencies computation can be computed again in parallel by looking at 
those range/value pairs

Finally, building this value table can be done similarly to format storage by 
using manipulators (sequence fill, copy range, etc). Well, I did not get 
precisely what is behind manipulators so I may be wrong here.

While this approach may not be worth for small worksheets, I think for big 
worksheets it could help a lot. An example of a big worksheet appearing in my 
application where excel was the GUI was:
 * importing results of simulations (hundred of scenarios with thousands of 
statistics in each scenario). Here the data_block idead would save a lot of 
memory.
 * doing identical computations on each scenario or on each statistics. Again, 
formulas were the same for a lot of cells and computing those formulas in 
block could help.

Now, a last remark about a topic from this DESIGN.html document: the 
dependency manager. Why is there a dependency manager per sheet ? 
I mean why not:
 a) a dependency manager for a kspread session. This would allow having 
dependencies from one file to another (not only between sheets from the same 
file)
 b) a dependency manager for a file. It would take care of all dependencies in 
the same file (sheets and chart)
 c) a dependency manager for each cells (this was the first approach with 
dependencies stored in each cell)
 d) a dependency manager for each block of data (this is related to the 
previous discussion if we describe the sheet in a "Format storage" way)

Ok, some alternatives in this question are too generic (a and b), already 
tried (c) or hypothetic (d) but my point is more about the arbitrary (at 
least to me :-) ) decision of one dependency manager per sheet.

Enough for now, could you be so kind to tell me if:
 * I was intelligible in this email :-) I may try to explain again my idea if 
it was too obscurve/confusing ...
 * this idea has already been beaten to death thousand of times. If it was, 
point me to some litterature on the topic
 * this idea is valuable to pursue (I may code a prototype in python to test 
it more carefully for corner cases).

To finish, thank you, thank you and thank you all for this amazing desktop !!!
Keep up this good work.

Sebastien

PS: could you reply to my address as I am not subscribed to koffice-devel ? TX
_______________________________________________
koffice-devel mailing list
koffice-devel@kde.org
https://mail.kde.org/mailman/listinfo/koffice-devel
[prev in list] [next in list] [prev in thread] [next in thread]