[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gnuplot-info-beta
Subject:    [Fwd: volunteering for gnuplot work?]
From:       Lars Hecking <lhecking () users ! sourceforge ! net>
Date:       2005-12-01 9:50:52
Message-ID: 20051201095052.GB20124 () cypress ! com
[Download RAW message or body]

----- Forwarded message from Thomas Mattison <mattison@phas.ubc.ca> -----

To: lhecking@users.sourceforge.net, broeker@users.sourceforge.net,
	cgaylord@users.sourceforge.net
From: Thomas Mattison <mattison@phas.ubc.ca>
Subject: volunteering for gnuplot work?
Date: Wed, 30 Nov 2005 13:21:24 -0800

Hi

I'm a physics professor, and I use gnuplot for teaching and my own 
work. There are some annoyances about the fitting aspects of gnuplot 
that I would like to volunteer to fix.

The major annoyance is that the errors from fits are not done in what I 
consider to be the correct way.  There exists a definition for the 
error of a fit that is independent of the goodness of the fit, although 
of course it depends on the function, the distribution of the data, and 
the error bars on the data.  Gnuplot fits do not report this result.  
Internally, gnuplot does calculate this error, but it then rescales it 
by the goodness of the fit.

If the function fits the data perfectly at every point, the error 
returned by gnuplot is zero.  This is clearly nonsense.

Consider the simplest possible case of a single data point with a y 
value and error, and fitting a single-parameter function whose value is 
just the parameter.  The value of the fit parameter should be the y 
value, and the error on the fit parameter should be the error on the 
data point.  A gnuplot fit returns the right parameter value, but since 
the goodness of fit is perfect, gnuplot says the error on the fit 
parameter is zero.

The rescaling that gnuplot does to the fit errors is equivalent to 
calculating a single scale factor on the data point errors from the 
residuals of the fit, and using the rescaled data errors in the fit.

There is some value to this rescaling in many common circumstances.  
The standard formula for fit parameter errors requires that valid data 
errors be input to the fit.  If the user does not have them, the 
gnuplot algorithm is equivalent to computing a uniform value for the 
data errors from the residuals of the fit.  I don't have a better 
suggestion for what to do in such cases.

My proposed solution would be to report both versions of the error: the 
conventional error that is independent of the goodness of fit, and the 
rescaled error that gnuplot presently reports.  I would also explicitly 
report the factor by which the data errors have been effectively 
rescaled.  If the user does not include errors in the fit, I would make 
both answers be the present gnuplot answer.  In this case, the above 
error-rescaling factor is in fact the effective error that gnuplot has 
calculated for the data points.  If the data were plotted with this 
constant error value, then they should agree with the fitted function 
within this error about 2/3 of the time.

A feature of this fix is that it would change the format of the 
parameter and error report from the fits.  I don't have as good an idea 
as you folks do about whether users would care about format changes.  
Perhaps a backward-compatibility switch could be put in, so the 
old-format report could still be available.

Another annoyance is the lack of a nice way to record fit error values 
(it would also be nice to record parameter errors from multiple similar 
fits to different data sets).

I would add a feature: create a parameter-log file that would contain 
gnuplot-readable fit summary information: chisquare, parm-A, 
normal-errorA, rescaled-error-A, parm-B, regular-errorB, 
rescaled-error-B, ...  Each fit would append to the end of the file, 
similar to the present fit log file.  I would precede each line with a 
copy of the fit command that produced the fit (with a # in front so 
gnuplot would consider it to be a comment).  Probably I would also have 
a commented line giving the time of the fit.  It would also be possible 
to create a comment line containing headers to show which column means 
what for the fit, using the user's names for the fit variables.

An example of using this feature is an experiment students do in my 
lab.  They fit the current-voltage relationship for a diode, which is 
I(V) = I0 * (exp(V/V0) - 1).  They do this with the diode dunked in 
liquid nitrogen, dry ice and alcohol, ice water, and boiling water.  
The I0 and V0 parameters depend on temperature, and I want the students 
to plot and fit the parameters vs temperature.  The students could take 
this new parameter-log file, add a temperature column to the file with 
a text editor, then plot and fit from this file.

I also have some minor annoyances that I think are worth fixing.  I 
would change the default FIT_START_LAMBDA to be 0.01 as recommended by 
Numerical Recipes rather than the method used now  (I tell my students 
to do this and it frequently helps).  I would make the default value 
for uninitialized variables in fits to be 0.01 rather than 1e-30 (the 
numerical derivatives algorithm tends to break with such a tiny 
default).  I would fix the numerical derivatives algorithm so it would 
not break if the initial value of a parameter is zero.

There is a more major feature addition that would be nice.  Frequently, 
data has errors not only on the y-variable, but also on the x-variable. 
 Gnuplot (nicely) handles plotting data with both x and y errors, but 
it doesn't know how to fit such data.  The rigorously correct way 
involves fitting for adjustments to all the x-variable values, as well 
as the parameters.  The naive way to do this produces impractically big 
matrices when there is a lot of data, even if the fit function is 
simple (the computation time goes as data-points cubed instead of 
proportional to points).  There is a more efficient but still rigorous 
way to do it (goes as parameters squared times points), and an even 
more efficient way (just a few times longer than the conventional fit, 
with not much extra coding required).

Very few commonly available programs that do fits deal correctly with 
errors in x variables as well as y variables.  I would be willing to 
try adding this to gnuplot.

Note that there is a user-interface consistency issue.

The following lines do analogous things.

plot     'file' using 1:2:3 with err
fit f(x) 'file' using 1:2:3          via a,b,c

But the next two lines don't do analogous things

plot       'file' using 1:2:3:4 with xyerr
fit f(x,y) 'file' using 1:2:3:4             via a,b,c

The plot command uses columns 3 and 4 for x and y errors, and the fit 
command uses columns 3 and 4 for z and z errors.

One solution would be to make the interpretation of the columns in a 
fit command depend on the number of variables in the function.  In this 
case,

fit f(x  ) 'file' using 1:2:3:4             via a,b,c
fit f(x,y) 'file' using 1:2:3:4             via a,b,c

would do different things.  The first one would be y vs x with x and y 
errors,
the second would be z vs (x,y) with z errors.

Another solution would be to allow fit commands to "with" options like 
plot commands.  If the "fit with" options were absent, gnuplot would do 
what it does now.  If the "fit with" options were present, they would 
override the current default.

I have my own c-language fitting package for nonlinear fits using 
numerical derivatives that has all the capability of the gnuplot 
fitting internals.  It would probably be easier for me to add xy error 
fitting by replacing the fitting guts of gnuplot with my own code than 
to reverse-engineer what's there now well enough to add what would be 
required (yes I have looked at it).  I would be willing to do bug-fixes 
and maintenance to the fitting stuff afterwards.

So what do you think of the possibility of me getting involved in this?


                                        Cheers

Prof. Thomas Mattison, Dept. of Physics & Astronomy, Univ. of British 
Columbia

Present Address: Stanford Linear Accelerator Center
2575 Sand Hill Road, Menlo Park, CA, 94025
Building 48 (Research Office Building), Mail Station MS35
Office:  ROB-231   Phone: 650-926-5342   Fax: 650-926-8522


----- End forwarded message -----


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
gnuplot-beta mailing list
gnuplot-beta@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gnuplot-beta
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic