[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bacula-bugs
Subject:    [Bacula-bugs] [bacula 0001666]: Bscan recreates jobmedia records
From:       Mantis Bug Tracker <nobody () baculabugs ! unixathome ! org>
Date:       2011-03-29 19:22:33
Message-ID: 093f89c6fb6f79aba2460f655bdb35a5 () bugs ! bacula ! org
[Download RAW message or body]


A NOTE has been added to this issue. 
====================================================================== 
http://bugs.bacula.org/view.php?id=1666 
====================================================================== 
Reported By:                craigmiskell
Assigned To:                kern
====================================================================== 
Project:                    bacula
Issue ID:                   1666
Category:                   bscan
Reproducibility:            always
Severity:                   minor
Priority:                   normal
Status:                     assigned
====================================================================== 
Date Submitted:             2010-11-22 21:08 GMT
Last Modified:              2011-03-29 19:22 GMT
====================================================================== 
Summary:                    Bscan recreates jobmedia records unnecessarily, and
only creates one
Description: 
When bscan is run on a volume with a job which has hit it's File Retention
period, but not the Job Retention period, bscan will create a single new
jobmedia record, even though there are existing jobmedia records in the
database.  Further, this record covers the entire section of the media where the
job is, rather than inserting one record per on-media 'file', as is done by the
backup job.    

Consequently, when running a restore from files on that job, there will be
entries in the bsr as many times as there are jobmedia records that cover the
required fileindexes (2, after a backup and a single bscan; one added for each
subsequent bscan).  The restore will complete, but often using the newly
inserted jobmedia records rather than the more efficient ones from the original
backup, which can be very slow due to having to scan the whole tape rather than
fast forwarding.  The final log for the restore also reports: "warning file
count mismatch", and the expected file count differs from the restored file
count, which can be concerning to some operators.

The attached patch creates jobmedia records in the database equivalent to the
original backup, but only if they don't already exist.  In creating this patch,
I saw what appears to be some oddness in the block numbers inserted into these
records by the original backup, which I'll raise in another ticket.  As arrogant
as it sounds, I think the values inserted by this patch are more correct than
the originals.  I think that worries me slightly; either there's a true bug in
the original record creation, or I don't understand what I'm doing.

The patch has been reasonably well tested in some simple cases involving:
a) single job on one volume
b) three jobs (two clients) on one volume
varying from a File Retention purge, missing JobMedia records, through to a full
Purge of the original volume.  
All jobs were small (~30GB or less).  I'm not in a position to easily test a
backup that spans volumes, although I've attempted to handle that correctly and
hope that it will work. 

Please let me know if you require a contributor agreement be signed for this
patch; it could well be too large to just accept, although is still just a patch
and is well contained.  I'm happy either way.

Steps to Reproduce: 
Perform a backup.
Allow that backup to have it's File records pruned by File Retention, but not
the Job records. (or just delete the job specific records from the File table
for that job; I believe that's equivalent)
bscan the media
Look in the jobmedia table for that jobid to see the single new additional
record.  Prepare a restore and see that the bsr has multiple entries for the
same file, one with limited VolAddr values, one with expansive values.

Additional Information: 
The misleading status message, from restoring a single file, after having run
bscan twice (records got nuked by File Retention again after the first bscan):

19-Nov 10:17 ausv01-dir JobId 553: Bacula ausv01-dir 5.0.0 (26Jan10): 19-Nov-201
> 0 10:17:39
>   Build OS:               i486-pc-linux-gnu debian 4.0
>   JobId:                  553
>   Job:                    RestoreFiles.2010-11-19_09.30.55_11
>   Restore Client:         ausv01
>   Start time:             19-Nov-2010 09:30:57
>   End time:               19-Nov-2010 10:17:39
>   Files Expected:         3
>   Files Restored:         1
>   Bytes Restored:         127,488
>   Rate:                   0.0 KB/s
>   FD Errors:              0
>   FD termination status:  OK
>   SD termination status:  OK
>   Termination:            Restore OK -- warning file count mismatch
>
====================================================================== 

---------------------------------------------------------------------- 
 (0005670) craigmiskell (reporter) - 2010-11-22 21:26
 http://bugs.bacula.org/view.php?id=1666#c5670 
---------------------------------------------------------------------- 
New version of the patch uploaded; who knew you needed semicolons at the end of
*every* line :)

(Got messed up in copying the patch from my existing dev code to my git
repository; I'm fixing my processes now :)) 

---------------------------------------------------------------------- 
 (0005775) kern (administrator) - 2011-02-25 10:00
 http://bugs.bacula.org/view.php?id=1666#c5775 
---------------------------------------------------------------------- 
After looking at this, I have the following comments:

1. Yes, something this size will require an FLA
2. Yes, I agree, bscan could use a bit of work. It is really meant to be a 
   last resort tool, so I haven't put much time into it.
3. You have a good point that bscan should see if there are already JobMedia
   records and not add new ones in that case.
4. Your variable names are very hard to read because they do not follow either
   of our two conventions to separate parts with _ or by using judicious 
   capitalization.
5. You have a number of globals such as prevmediafileindex, jobmedialastindex,
   ... This is doomed to failure, because bscan must simulate multiple 
   simultaneous jobs, thus these variables must be "local" to the job, which
   is normally done by having them in the jcr.
6. It is *very* dangerous to attempt to compute indexes and blocks outsize the
   core code that does it correctly at least in writes. For reads these indexes
   and block numbers are not used much so they may need some tuning.  Unless it
   is totally impossible to do this with our core routines (read_record), I 
   wouldn't seriously consider including code that attempts to compute those
   values differently.
7. db_get_num_job_media... seems to me so specialized that it is far better
   to do it via a special SQL statement as we do for all such cases it
   Bacula core code.  Otherwise we will end up with too many SQL API calls
   that are one-shot only.
8. I am slightly undecided on the following: I question the wisdom of
   writing a lot of code for properly restoring all the JobMedia records. 
   Having one big one is certainly inefficient, but it gets the job done,
   and the number of people using bscan should really be minimal -- just
   increase your retention periods, the extra disk space is negligible these
   days.
9. I see that you have put a lot of time and effort into this so please don't
   get discouraged by the above comments.

Bottom line: 
- We would appreciate a patch that just corrects the problem of
doubling up the JobMedia records.  
- For the rest, you will need to convince me as the code looks like it would 
be more work to maintain than it is worth, and given the static variables 
I don't think it will work for anyone who used multiple simultaneous 
backup jobs. 

---------------------------------------------------------------------- 
 (0005780) craigmiskell (reporter) - 2011-02-28 22:51
 http://bugs.bacula.org/view.php?id=1666#c5780 
---------------------------------------------------------------------- 
Thanks for the feedback, I appreciate it's constructive nature.  

Speaking philosophically to start with:  bscan isn't quite as "last resort" for
us as you consider it.  For largely local political reasons, the additional disk
space requirement for upping the retention time isn't negligible or low cost. 
It sucks, but it's true for us.  When restoring off a bscan'd volume/job, we've
already read the tape through once, taking X hours (3 or 4 is not unheard of on
full LTO-4 tapes).  Then on the inevitable restore, we have, on average, read
half the tape (taking X/2) hours to get to the bit we're interested in.
Sometimes more.  As opposed to having all the JobMedia records and
fast-forwarding to the correct block and reading it in really quickly.  

If this takes more than a working day (and with 3-4 hours per operation, it's
not unlikely), this can impact on the ability to run the next nights backups (or
make our people stay late etc.). Plus it's quicker/better for the user who's
waiting for some probably important data.  So I'm really rather keen on getting
a patch together that is acceptable.  

I have a bunch of other stuff that's slightly higher in my queue first though,
so I'd appreciate it if you could keep this bug open for a week or two.

Then I'll submit:
1) A patch that just avoids the doubling up of records.  I *think* that can be
pulled out as separate, and it's well worth getting in while we argue^Wdiscuss
the merits of the more substantial change:
2) A better patch that addresses your various concerns.  This is what will take
me some time; I've obviously missed some available existing code (block
calculations), mainly due to lack of general familiarity with the entire code
base.  I'll hunt around, and may have some more questions; where's the best
forum for asking them? 

---------------------------------------------------------------------- 
 (0005782) craigmiskell (reporter) - 2011-03-02 23:01
 http://bugs.bacula.org/view.php?id=1666#c5782 
---------------------------------------------------------------------- 
Ok, I've been looking at the code a bit more, and I'm getting what you mean
about the variables in the JCR now (multiple jobs on a tape). It appears to me
that I'm going to have to check whether there are existing job media records on
a per-job basis, around the time that we find the SOS_LABEL, and then store
that, presumably in the JCR.  

So, is it ok to add bscan specific fields to the JCR?  My gut instinct is to add
a pointer to a structure, defined and owned by bscan, to keep the JCR clear of
extraneous cruftage.  It's not so bad for this one field, but the other code I
hope to get in would need a bunch of fields that are *very* bscan specific.  I'm
interested in your opinion on this approach. 

---------------------------------------------------------------------- 
 (0005785) kern (administrator) - 2011-03-03 07:12
 http://bugs.bacula.org/view.php?id=1666#c5785 
---------------------------------------------------------------------- 
I suggest the following things to improve your chances of success:

1. If possible use git and format-patch.  Otherwise a diff -u is OK.
2. Try to keep patches small.  It is better to submit a number of small patches
   than one large one.
3. For development purposes, add variables if needed to the JCR to simplify
   development and testing,  but I believe that the JCR already has enough 
   variables to do what you want.
   We can discuss the details once you have some working code.
4. Be *very* careful to write the minimum amount of code in bscan.  If you try
   to put all the code necessary tracking code to create JobMedia records in
   bscan it will probably not be approved.  The proper way, is to make minimum
   changes to the read-record code to track where you are and add an appropriate
   callback when and EOF is read so that JobMedia records can be created.  I.e.
   try to simulate the write_block_to_dev code in the read-record code. 

---------------------------------------------------------------------- 
 (0005789) craigmiskell (reporter) - 2011-03-06 22:57
 http://bugs.bacula.org/view.php?id=1666#c5789 
---------------------------------------------------------------------- 
New patch is just uploaded.  This addresses just the "duplicate jobmedia
records" issue.  Per your previous suggestion, I've just added a horribly named
arbitrary bool to the JCR to keep track of this new state.  That will need to
change, but to what, I don't know (something local to bscan, but not a single
static var?)

Let me know what you think; if I'm on the right track/style with this code I'll
have a better chance of getting the next patch right. 

---------------------------------------------------------------------- 
 (0005794) kern (administrator) - 2011-03-07 16:32
 http://bugs.bacula.org/view.php?id=1666#c5794 
---------------------------------------------------------------------- 
OK, I have applied your patch, with one change. I moved the new jcr bool down to
the Storage daemon section.  I have pushed updated code to the bacula.org git
repo, and it is in Branch-5.1, which will be the next release in about a month.

Have you signed and sent in an FLA?  If not, please do so.  This patch is large
enough that I really need one.

For the next time, please pay attention to small details such as:
1. Not adding white space at the end of lines.  Gig complains when applying the
patch.  Not serious, ...
2. We always put a space after if but not after (.  So instead of:
    if( xxx === yyy )    write
    if (xxx == yyy)

Many thanks for the patch.  Let me know what you plan for the next one :-) 

---------------------------------------------------------------------- 
 (0005817) craigmiskell (reporter) - 2011-03-29 19:22
 http://bugs.bacula.org/view.php?id=1666#c5817 
---------------------------------------------------------------------- 
Hi,
Due to an upcoming change in employment, I'm highly unlikely to get any time to
spend on any more patches.  Given that we've solved the only actual bug and the
remaining work is only an enhancement, it might be worth closing this bug,
unless someone else wants to take it up. 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2010-11-22 21:08 craigmiskell   New Issue                                    
2010-11-22 21:08 craigmiskell   File Added:
0001-Patch-to-make-bscan-do-better-with-jobmedia-records.patch                  
 
2010-11-22 21:25 craigmiskell   File Added:
0001-Patch-to-make-bscan-do-better-with-jobmedia-records.patch2                 
  
2010-11-22 21:26 craigmiskell   Note Added: 0005670                          
2010-11-23 08:49 mnalis         Issue Monitored: mnalis                      
2010-12-30 11:31 thrasherbcn    Issue Monitored: thrasherbcn                    
2011-01-18 21:56 slords         Issue Monitored: slords                      
2011-02-25 10:00 kern           Note Added: 0005775                          
2011-02-25 10:00 kern           Status                   new => feedback     
2011-02-28 22:51 craigmiskell   Note Added: 0005780                          
2011-02-28 22:51 craigmiskell   Status                   feedback => new     
2011-03-02 23:01 craigmiskell   Note Added: 0005782                          
2011-03-03 07:12 kern           Note Added: 0005785                          
2011-03-03 07:12 kern           Assigned To               => kern            
2011-03-03 07:12 kern           Status                   new => feedback     
2011-03-06 22:53 craigmiskell   File Added:
0001-Fix-for-duplicate-jobmedia-records-per-bug-1666.patch                    
2011-03-06 22:57 craigmiskell   Note Added: 0005789                          
2011-03-06 22:57 craigmiskell   Status                   feedback => assigned
2011-03-07 16:32 kern           Note Added: 0005794                          
2011-03-07 16:32 kern           Status                   assigned => feedback
2011-03-29 19:22 craigmiskell   Note Added: 0005817                          
2011-03-29 19:22 craigmiskell   Status                   feedback => assigned
======================================================================


------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Bacula-bugs mailing list
Bacula-bugs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-bugs
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic