'[Bacula-bugs] [bacula 0001032]: Mount of a new tape fails after'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bacula-bugs
Subject:    [Bacula-bugs] [bacula 0001032]: Mount of a new tape fails after
From:       bacula-bugs () lists ! sourceforge ! net
Date:       2008-01-25 9:37:48
Message-ID: 46d859a33b339ade80f8c33f298eeea9 () bugs ! bacula ! org
[Download RAW message or body]


The following issue has been CLOSED 
====================================================================== 
http://bugs.bacula.org/view.php?id=1032 
====================================================================== 
Reported By:                neuvti
Assigned To:                
====================================================================== 
Project:                    bacula
Issue ID:                   1032
Category:                   Storage Daemon
Reproducibility:            always
Severity:                   major
Priority:                   normal
Status:                     closed
Resolution:                 fixed
Fixed in Version:           2.2.8
====================================================================== 
Date Submitted:             01-02-2008 11:05 UTC
Last Modified:              01-25-2008 09:37 UTC
====================================================================== 
Summary:                    Mount of a new tape fails after "max volume
duration" of the previous tape expired
Description: 
When the "Max Volume Duration" parameter (in my case, set to 35 days)
expires, and the autochanger changes a new tape into the tape drive, the
mount of the new tape fails. 


====================================================================== 

---------------------------------------------------------------------- 
 kern - 01-02-08 11:22  
---------------------------------------------------------------------- 
Please include the full job output report for the job that failed. 

---------------------------------------------------------------------- 
 neuvti - 01-02-08 11:27  
---------------------------------------------------------------------- 
Is this information enough:
http://marc.info/?l=bacula-users&m=119399828527068&w=2 

---------------------------------------------------------------------- 
 neuvti - 01-02-08 11:40  
---------------------------------------------------------------------- 
Attached file "Bacula-tape_exp_fail-log.txt" contains log output from the
last succesful job before tape expiration, and for the failed jobs of the
night when tape expired. 

---------------------------------------------------------------------- 
 kern - 01-02-08 13:42  
---------------------------------------------------------------------- 
Both outputs were fine and showed me what I wanted to see.

It is possible this is a bug, but much more likely your mtx script is not
correctly configured.  

First, we *strongly* recommend against running with "Always Open = no".

Please try the following test and report the results:

1. Stop Bacula or at least the SD.  You could also simply unmount the
   drive, but must remember to mount it when done testing.

2. Ensure a tape is loaded in the drive.

3. Do "mt -f /dev/nstN  status"
   where /dev/nstN is your tape drive.

4. Attach the output of the mt command to the bug report or copy and
paste
   it into a bug note.

5. Upload your mtx-changer script to the bug report. 

---------------------------------------------------------------------- 
 neuvti - 01-02-08 14:39  
---------------------------------------------------------------------- 
About AlwaysOpen=No... I have some bad experiencies about tapes getting
corrupted due to improper system shutdown. Yes, this should never be done
but... so I do prefer having the tapes unmounted while they are not in
use.

*** mt -f ***

[root@dogbert ~]# mt -f /dev/nst0  status
SCSI 2 tape drive:
File number=95, block number=0, partition=0.
Tape block size 0 bytes. Density code 0x82 (DLT 20GB uncompressed).
Soft error count since last status=0
General status bits on (81010000):
 EOF ONLINE IM_REP_EN
[root@dogbert ~]# 

---------------------------------------------------------------------- 
 kern - 01-02-08 15:29  
---------------------------------------------------------------------- 
Thanks. Everything with mt and your mtx-changer script looks OK, so it is
looking a lot more like a bug. I'll try to reproduce it here ... 

---------------------------------------------------------------------- 
 kern - 01-03-08 14:18  
---------------------------------------------------------------------- 
I've tried to duplicate this condition here, and for me every time, it
works perfectly. 

As a consequence, I think that it is a configuration problem with your
mtx-changer script.  A couple of suggestions:

1. Verify that the script that you sent me is *really* the script that the
SD is using -- maybe there is a subtle difference in the path or something
...

2. If the script is correct, then enable the "sleep" command at roughly
line 180 in the mtx-changer script. I recommend making it sleep 60
seconds.

If the problem still persists and you can isolate it in a way that I could
reproduce it, please reopen the bug and provide the new information. 

---------------------------------------------------------------------- 
 kern - 01-03-08 14:26  
---------------------------------------------------------------------- 
I've re-opened this, because after additional testing with more jobs
accessing the drive, something seems to be wrong.  The failure I am getting
is not the same as the failure you see, but until I understand this, I will
keep the bug open. 

---------------------------------------------------------------------- 
 kern - 01-04-08 20:20  
---------------------------------------------------------------------- 
I've pretty much verified (a few more tests) that this is a kernel or scsi
driver problem in the CentOS 5.1 kernel.  The problem does not exist in the
CentOS 5.0 kernel.  

The problem looks *very* much like one I saw about a year ago on SuSE, and
that particular problem could lead to system crashes.  On one of my
machines, it totally destroyed the hard disk. 

I would recommend that you take this bug very SERIOUSLY.

The standard CentOS 5.1 kernel is:
   kernel-2.6.18-53.1.4.el5
and it fails my tape regression tests in various different ways, some
identical to what you report.

The prior CentOS 5.0 kernel is:
   kernel-2.6.18-8.1.15.el5
and it seems to work perfectly fine (not all the tests are finished yet).
I am not sure what the consequences of running this kernel are. It
installed on my system with a bunch of errors -- mainly missing userids or
something, but it seems to run perfectly well.

As soon as I completely verify this, I will send out a general message to
the Bacula users, and I'll see if I can report it to Red Hat or CentOS.

 

---------------------------------------------------------------------- 
 kern - 01-04-08 20:34  
---------------------------------------------------------------------- 
I would appreciate if you would check your /var/log/messages file for scsi
errors reported at the time you have the problems with Bacula.  This is an
example of the kinds of errors I am seeing:

Jan  4 17:59:54 rufus kernel: st0: Async command already active.
Jan  4 17:59:59 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 17:59:59 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:01:27 rufus kernel: st0: Async command already active.
Jan  4 18:01:32 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:01:32 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:14:44 rufus kernel: st0: Async command already active.
Jan  4 18:14:49 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:14:49 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:16:18 rufus kernel: st0: Async command already active.
Jan  4 18:16:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:16:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:23:48 rufus kernel: st0: Async command already active.
Jan  4 18:23:53 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:23:53 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:25:22 rufus kernel: st0: Async command already active.
Jan  4 18:25:27 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:25:27 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:31:18 rufus kernel: st0: Async command already active.
Jan  4 18:31:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:31:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:42:43 rufus kernel: st0: Async command already active.
Jan  4 18:42:48 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:42:48 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:48:59 rufus kernel: st0: Async command already active.
Jan  4 18:49:04 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:49:04 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:51:18 rufus kernel: st0: Async command already active.
Jan  4 18:51:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 18:51:23 rufus kernel: st: append_to_buffer offset overflow.
Jan  4 20:29:59 rufus kernel: st0: Async command already active.
Jan  4 20:48:45 rufus kernel: st0: Async command already active.
Jan  4 20:51:25 rufus kernel: st0: Async command already active. 

---------------------------------------------------------------------- 
 neuvti - 01-08-08 09:12  
---------------------------------------------------------------------- 
There were no kernel (nor other) messages in /var/messages file from the
time I last time met this problem. At that time, I was running
bacula-2.2.6, and CentOS was already upgraded to 5.1 and it had
kernel-2.6.18-53.

After that, I have upgraded to bacula-2.2.7 and I have not met yet the
situation when tape duration would expire in normal operation. My guess is
that expiration will happen within a few days (unless I won't touch the
volume duration settings to make it happen sooner). I'm a little bit
cautious about touching the configuration too much, since this is a system
in real production use.

My configuration is obviously relatively safe, it does not allow
concurrent jobs (your note mentions "more jobs accessing the drive").
Btw, you may have noticed this, but in the 2nd link given in "Additional
Information" section
(http://marc.info/?l=bacula-users&m=119403257915537&w=2) Arno Lehmann
mentioned he had met a similar problem which disappeared in an unresolved
way when upgrading from 2.2.4 to 2.2.5

Yesterday, I ran a lot of tests similar to my regular setup, with a
temporary test pool, with volume duration set from 15 mins to 2-5 hours,
with longest times the backups run really contained considerable amount of
data too. In those tests, I was not able to reproduce the problem that by
now had appeared in production use for several months every time the tape
duration expires. This makes me think if something has changed from 2.2.6
to 2.2.7 to fix this in an unplanned way. The only differences in these
tests were shorter volume durations (order of an hour instead of a month),
another pool, and the fact that all the jobs were started manually and not
scheduled. I'll try to test this again with scheduled jobs, but I propably
won't be able to do it before Monday the 7th.

(update on Tue 8th): Now I've had a scheduled test job running since
yesterday, making backups of about 1.2 GB in 5 min intervals, and tapes
expiring after 30 mins, 4 tapes in a pool. No errors so far... Next thing
I'll do is to see if the problem reappears when a tape in my original conf
next time expires within a week.

 

---------------------------------------------------------------------- 
 neuvti - 01-14-08 08:45  
---------------------------------------------------------------------- 
Recently, a tape expired in a regular (ie. not for testing purposes) backup
job. This resulted in a similar error situation that I've experienced every
time a tape expires in pruduction runs during the past several months.

File Bacula-tape_exp_fail-log-2.txt contains logs from
- a job that was run after my test jobs last week, this production job (id
587) requires a tape change since the currently loaded tape belongs to a
wrong pool. Here the tape change/mount works nicely. Immediatedly
thereafter, two more jobs are run, using the same tape, with no problems.
- on 12th of January a job 590 is a similar one like the first one of the
three jobs above, and it needs a tape from the same pool. However, now the
old tape expires, and another tape from the same pool is needed. This time
the tape is changed, and the mount fails. Job 591 was scheduled for the
same time, with similar timeout, so it gets canceled with no try. Job 592
(catalog backup) is tried thereafter, the correct tape is already in the
drive, but it fails too.

By now, I haven't touched the system manually in any way (just read the
messages from bconsole) so if there is something special I could check
(like tape drive status) after the failed jobs above, it is still possible.
Unfortunately, I can't keep the system without backups for a long time so
at latest on tomorrow (Tue) I'll need to manually solve the situation
again. 

---------------------------------------------------------------------- 
 kern - 01-17-08 19:50  
---------------------------------------------------------------------- 
Please try installing the BETA 2.2.8 (be prepared to back it out if you
have problems). There is a small chance that it will fix this bug. 

Feedback would be appreciated, but please be sure to specify *exactly*
what version/patches you are running.

I believe that I have a complete fix for this problem, but it represents a
rather big change in the reservations system, so I consider it too risky to
release.  I recommend trying 2.2.8 first, then if you *really* want to try
the new code, it is in the trunk SVN. 

---------------------------------------------------------------------- 
 neuvti - 01-24-08 08:35  
---------------------------------------------------------------------- 
I made some experiments to build a 2.2.8-beta rpm but unfortunately it goes
beyond my capabilities.

Since my only autochanger is in a production system, and I have a kind of
policy requiring installed software to use rpm packages in those systems,
this testing looks difficult right now.

However, I've done some extensive testing with 2.2.7 and I'm still unable
to reproduce the problem with any artifical setup, although it appears
every time the tapes expire in real life. So propably the next time I'll
met this problem will be in mid-February. Unless I don't find any way to
have 2.2.8 installed by then, I'll propably add a really significant wait
time to my current mtx-changer script to check out that possible source of
problem. 

---------------------------------------------------------------------- 
 kern - 01-25-08 09:37  
---------------------------------------------------------------------- 
I'm closing this bug because I think there is a good chance that it is
fixed in 2.2.8, though 2.2.8 is not a complete solution. I understand your
problem with rpms -- I hope to officially release 2.2.8 in the near future,
and thus the rpms will follow.  If 2.2.8 does not fix your problem (rather
long cycle), please either open a new bug report or re-open this one.

Yes, this bug is very hard to reproduce in an artificial environment. I
think I have done so, and if it does reproduce your problem (not sure)
2.2.8 will not correct the problem, but I do have code in the trunk that
does correct my test case designed to duplicate your problem, so hopefully
in the not to far future we can put that into production (perhaps 2.2.9 or
2.2.10 ...). 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
01-02-08 11:05  neuvti         New Issue                                    
01-02-08 11:05  neuvti         File Added: bacula-dir.conf                    
01-02-08 11:22  kern           Note Added: 0003046                          
01-02-08 11:22  kern           Status                   new => feedback     
01-02-08 11:23  neuvti         File Added: bacula-sd.conf                    
01-02-08 11:27  neuvti         Note Added: 0003047                          
01-02-08 11:40  neuvti         File Added: Bacula-tape_exp_fail-log.txt         
          
01-02-08 11:40  neuvti         Note Added: 0003048                          
01-02-08 13:42  kern           Note Added: 0003049                          
01-02-08 14:39  neuvti         Note Added: 0003050                          
01-02-08 14:40  neuvti         File Added: mtx-changer                      
01-02-08 15:29  kern           Note Added: 0003051                          
01-03-08 14:18  kern           Note Added: 0003055                          
01-03-08 14:18  kern           Status                   feedback => closed  
01-03-08 14:18  kern           Resolution               open => unable to
duplicate
01-03-08 14:26  kern           Note Added: 0003057                          
01-03-08 14:26  kern           Status                   closed => acknowledged
01-03-08 14:26  kern           Resolution               unable to duplicate =>
reopened
01-04-08 20:19  kern           Note Added: 0003058                          
01-04-08 20:19  kern           Status                   acknowledged => feedback
01-04-08 20:19  kern           Resolution               reopened => open    
01-04-08 20:20  kern           Note Edited: 0003058                         
01-04-08 20:34  kern           Note Added: 0003059                          
01-05-08 10:14  neuvti         Note Added: 0003060                          
01-08-08 09:12  neuvti         Note Edited: 0003060                         
01-14-08 08:20  neuvti         File Added: Bacula-tape_exp_fail-log-2.txt       
            
01-14-08 08:45  neuvti         Note Added: 0003071                          
01-17-08 19:50  kern           Note Added: 0003082                          
01-24-08 08:35  neuvti         Note Added: 0003087                          
01-25-08 09:37  kern           Note Added: 0003095                          
01-25-08 09:37  kern           Status                   feedback => closed  
01-25-08 09:37  kern           Resolution               open => fixed       
01-25-08 09:37  kern           Fixed in Version          => 2.2.8           
======================================================================


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-bugs mailing list
Bacula-bugs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-bugs
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic