[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bacula-bugs
Subject:    [Bacula-bugs] [bacula 0001596]: job fails with "Fatal error:
From:       Mantis Bug Tracker <nobody () baculabugs ! unixathome ! org>
Date:       2011-03-07 14:15:29
Message-ID: 213a2f84237456aa6a9df831e8dff7af () bugs ! bacula ! org
[Download RAW message or body]


The following issue has been REOPENED. 
====================================================================== 
http://bugs.bacula.org/view.php?id=1596 
====================================================================== 
Reported By:                mnalis
Assigned To:                ebollengier
====================================================================== 
Project:                    bacula
Issue ID:                   1596
Category:                   Storage Daemon
Reproducibility:            unable to reproduce
Severity:                   minor
Priority:                   normal
Status:                     feedback
====================================================================== 
Date Submitted:             2010-06-15 15:08 GMT
Last Modified:              2011-03-07 14:15 GMT
====================================================================== 
Summary:                    job fails with "Fatal error: askdir.c:339 NULL
Volume name. This shouldn't happen!!!"
Description: 
During automated nightly (mostly incremental) backup of about 150 hosts, one of
the jobs failed with "Fatal Error". Attached is full job error.

Steps to Reproduce: 
I was unable to reproduce it; this is the first time I've seen it. It is
standard bacula backup run which worked mostly the same for weeks (and months in
previous versions).

Additional Information: 
Unless it starts happen more often, it is not a problem for us, but I still
report it in case it is interesting to developers due to "This shouldn't
happen!!!" text...

config files if needed are mostly the same as conf_20100422.tgz in closed
bug#1553 (altohugh a few hosts have been added and same schedule times changed).
I can produce fresh ones with little effort if they are needed.
====================================================================== 

---------------------------------------------------------------------- 
 (0005467) mnalis (reporter) - 2010-06-28 11:00
 http://bugs.bacula.org/view.php?id=1596#c5467 
---------------------------------------------------------------------- 
just FYI, it did happen again (when running differental backups, so somewhat
bigger load that usual):


2010-06-27 02:35:03
birdun-dir
Start Backup JobId 43299, Job=kbczg.2010-06-27_02.35.00_39
2010-06-27 02:36:27
birdun-dir
There are no more Jobs associated with Volume "A00067L4". Marking it purged.


New Pool is: Scratch


All records pruned from Volume "A00067L4"; marking it "Purged"
2010-06-27 02:37:09
birdun-dir
Recycled volume "A00053L4"


Using Volume "A00053L4" from 'Scratch' pool.


Using Device "drv6_lto4"
2010-06-27 02:37:42
birdun-sd
Fatal error: askdir.c:339 NULL Volume name. This shouldn't happen!!!


Spooling data ...


Job write elapsed time = 00:00:01, Transfer rate = 0  Bytes/second
2010-06-27 02:39:13
kbczg-fd
Fatal error: backup.c:1019 Network send error to SD. ERR=Connection reset by
peer


Let me know if I can provide more information. 

---------------------------------------------------------------------- 
 (0005502) kern (administrator) - 2010-07-19 16:46
 http://bugs.bacula.org/view.php?id=1596#c5502 
---------------------------------------------------------------------- 
Without knowing more about what happened we cannot do anything about this.  Did
you cancel a job (or in particular that job)?

It looks like you deleted parts of the job report.

Clearly something went wrong with that job, but as I say, without more
information there isn't much we can do. 

---------------------------------------------------------------------- 
 (0005509) mnalis (reporter) - 2010-07-20 12:43
 http://bugs.bacula.org/view.php?id=1596#c5509 
---------------------------------------------------------------------- 
No, we didn't manually cancel the job; nor was it to be automatically canceled
due to time restrictions (in both cases, although different clients, job was
scheduled on 2:35, and failed 4-5 minutes later, and they had only "Max Run
Sched Time = 19800" specified).

So in both cases it was completely unattended nightly run. I cannot guarantee
there weren't any network issues, although I do not think it likely (none of the
dozen other servers in the same network running at the same time exhibited
similar behavior).

I've attached bigger excerpt from bacula log file for second time it happened
(job 43299) in null_volume_case2_full.txt, does that look complete? 

Browsing through the logs, I find one more instance (attached in
null_volume_case3_39066_full.txt), with similar characteristics (failing just a
few minutes after start and having "NULL Volume name" error, also unattended run
at 02:35, and also a third different host). 

Client bacula-fd versions were also different: 2.4.4 and 5.0.1.

we have IBM3584 tape changer library with barcodes and several tape drives
(4xLTO2, 4xLTO4) and due to concurrency there are usually 2-6 appendable volumes
in each pool, and bacula automatically gets more from Scratch (where expired
tapes are being put) if needed. Scratch pool is also well stocked (about 10 LTO4
and 30 LTO2 purged tapes ATM). No new tapes were being introduced into the
library for months, so all of them in Scratch are all previously used and purged
tapes.

that's it about the generals, let me know if you're after some specific
information, or if I can help by trying some patch which puts out more debug
info in such case or in some other way. 

---------------------------------------------------------------------- 
 (0005510) kern (administrator) - 2010-07-20 12:54
 http://bugs.bacula.org/view.php?id=1596#c5510 
---------------------------------------------------------------------- 
For Jobs that fail, are your FD, Dir, and SD all running the same version?  If
not what are the versions? 

---------------------------------------------------------------------- 
 (0005511) mnalis (reporter) - 2010-07-20 12:59
 http://bugs.bacula.org/view.php?id=1596#c5511 
---------------------------------------------------------------------- 
No, Dir and SD (running on the same host, birdun) are same version of course
(GIT Branch-5.0 as of 20100525), and failing clients were:

flower-fd - 2.4.4 (from default debian lenny package)
leo-fd - 2.4.4 (from default debian lenny package)
kbczg-fd - 5.0.1 (from debian lenny backports) 

---------------------------------------------------------------------- 
 (0005512) kern (administrator) - 2010-07-20 13:18
 http://bugs.bacula.org/view.php?id=1596#c5512 
---------------------------------------------------------------------- 
OK, some more questions:

Do you have "Prefer Mounted Volumes = no" set in your bacula-dir.conf?

Would you look and see if a volume is being recycled within a minute or two of
all the failures you are getting?

I recommend you move your 2.4.4 FDs up to version 5.0.x. This doesn't seem to be
the problem, but there are some significant differences in the way the Dir, SD,
and FD communicate between version 2.4.4 and 5.0.x 

---------------------------------------------------------------------- 
 (0005513) mnalis (reporter) - 2010-07-20 14:04
 http://bugs.bacula.org/view.php?id=1596#c5513 
---------------------------------------------------------------------- 
We have explicit "Prefer Mounted Volumes = yes" for all jobs (as we were warned
that "No" might cause deadlocks, we instead achieved needed tape parallelization
by limiting SD concurrent jobs to few jobs per LTO tape).

I can see volume being recycled only on one of the three cases (those for
27-Jun-2010 2:36-2:39 -- those are included in "null_volume_case2_full.txt" file
attached for full info):

birdun:/var/log/bacula# egrep -i '^(27-Jun|01-Jun|15-Jun).*(recycl|purg|prun)'
log.* | egrep -v 'bckp_mycat|(End auto prune\.|Begin pruning Jobs\.|Begin
pruning Jobs older than 5 years \.|No Jobs found to prune\.|No Files found to
prune\.)$'

log.1:27-Jun 02:36 birdun-dir JobId 43299: There are no more Jobs associated
with Volume "A00067L4". Marking it purged.
log.1:27-Jun 02:36 birdun-dir JobId 43299: All records pruned from Volume
"A00067L4"; marking it "Purged"
log.1:27-Jun 02:37 birdun-dir JobId 43299: Recycled volume "A00053L4"
log.1:27-Jun 02:39 birdun-sd JobId 43275: Recycled volume "A00053L4" on device
"drv6_lto4" (/dev/tape/by-label/lto4_2), all previous data lost.
log.1:27-Jun 03:03 birdun-dir JobId 43335: Recycled volume "A00054L4"
log.1:27-Jun 03:03 birdun-sd JobId 43335: Recycled volume "A00054L4" on device
"drv4_lto4" (/dev/tape/by-label/lto4_0), all previous data lost.

birdun:/var/log/bacula# 


As for the recommendation for upgrading, since this bug was filed, we did
actually upgrade all our FDs to 5.0.x (mostly 5.0.2), so the (eventual) new bug
reports will be all from 5.0.x series. 

---------------------------------------------------------------------- 
 (0005554) kern (administrator) - 2010-08-08 16:13
 http://bugs.bacula.org/view.php?id=1596#c5554 
---------------------------------------------------------------------- 
I believe this is fixed in 5.0.3.  If not please supply new information. 

---------------------------------------------------------------------- 
 (0005750) mnalis (reporter) - 2011-02-15 16:07
 http://bugs.bacula.org/view.php?id=1596#c5750 
---------------------------------------------------------------------- 
Id did happen again; however it seems output only went to console which did
/etc/init.d/bacula* start, a not to bacula log file too, so I cannot provide
relevant parts of log file...

This time, bacula-sd (and bacula-director) were version 5.0.3 + updates from GIT
Branch-5.0 as of 2010-11-22, and clients which were running at a time (which is
sometime on 13th or 14th Feb 2011, as it didn't have timestamping I cannot be
more precise) should all be 5.0.2 (although few are 5.0.0, 5.0.1 and 5.0.3)

I've now setup bacula-sd and bacula-dir to run under supervise, so any output to
stdout/stderr would be timestamped and captured to log file.

(just to let you know that it appears not to be solved in 5.0.3 as suggested) 

---------------------------------------------------------------------- 
 (0005753) mnalis (reporter) - 2011-02-17 12:32
 http://bugs.bacula.org/view.php?id=1596#c5753 
---------------------------------------------------------------------- 
I've log some more info, those are all lines that were outputed to stdout/stderr
that day:

2011-02-17 02:05:26.468099500 birdun-sd: append.c:273 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:26.468123500 birdun-sd: append.c:274-0 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:26.468124500 birdun-sd: append.c:289 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:26.468124500 birdun-sd: append.c:290-0 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:56.638464500 birdun-sd: append.c:273 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:56.638477500 birdun-sd: append.c:274-0 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:56.638544500 birdun-sd: append.c:289 NULL Volume name. This
shouldn't happen!!!
2011-02-17 02:05:56.638549500 birdun-sd: append.c:290-0 NULL Volume name. This
shouldn't happen!!!

At 2:00 there is a batch of new jobs (mostly windows clients) being run at the
same time, so that may have triggered it. Attached is full bacula log for those
times (and some time before/after).

note that bacula log file seems inteleaved, sometimes in the middle of the line
-- that is just the way bacula wrote it. It might be good idea to make bacula
log line buffered instead of fully buffered (using setvbuf(3) or similar)

Let me know if I can provide more info, or if you have patch which might log
more relevant information when bug occurs. 

---------------------------------------------------------------------- 
 (0005754) ebollengier (administrator) - 2011-02-18 07:56
 http://bugs.bacula.org/view.php?id=1596#c5754 
---------------------------------------------------------------------- 
You new log file doesn't print any "NULL Volume.." message, this is not very
helpful, it looks to be fixed.

If you still have this problem, you can debug it yourself by dumping the call
stack where this message is printed (it can help). The following code can print
the stack in stdout. I will close this issue waiting for new information.

#include <cxxabi.h>
#include <execinfo.h>
void trace()
{
   const size_t max_depth = 100;
   size_t stack_depth;
   void *stack_addrs[max_depth];
   char **stack_strings;
   
   stack_depth = backtrace(stack_addrs, max_depth);
   stack_strings = backtrace_symbols(stack_addrs, stack_depth);
   
   for (size_t i = 0; i < stack_depth; i++) {
      size_t sz = 200; // just a guess, template names will go much wider
      char *function = (char *)malloc(sz);
      char *begin = 0, *end = 0;
      // find the parentheses and address offset surrounding the mangled name
      for (char *j = stack_strings[i]; *j; ++j) {
         if (*j == '(') {
            begin = j;
         }
         else if (*j == '+') {
            end = j;
         }
      }
      if (begin && end) {
         *begin++ = '\0';
         *end = '\0';
         // found our mangled name, now in [begin, end)
         
         int status;
         char *ret = abi::__cxa_demangle(begin, function, &sz, &status);
         if (ret) {
            // return value may be a realloc() of the input
            function = ret;
         }
         else {
            // demangling failed, just pretend it's a C function with no args
            strncpy(function, begin, sz);
            strncat(function, "()", sz);
            function[sz-1] = '\0';
         }
         printf("    %s:%s\n", stack_strings[i], function);
      }
      else
      {
         // didn't find the mangled name, just print the whole line
         printf("    %s\n", stack_strings[i]);
      }
      free(function);
   }
   free(stack_strings); // malloc()ed by backtrace_symbols
} 

---------------------------------------------------------------------- 
 (0005790) mnalis (reporter) - 2011-03-07 14:15
 http://bugs.bacula.org/view.php?id=1596#c5790 
---------------------------------------------------------------------- 
It happened again. Few notices first:

- it might not be the exactly same bug as the one originally reported, although
much of the message "NULL Volume name. This shouldn't happen!!!" is the same.
There appears to be multiple sources of that message, the one fixed was in
askdir.c, and this one seems to be in append.c

- it also explains why the message does NOT go in bacula log file, but only gets
printed to stdout (askdir.c used Jmsg(), and append.c only uses Dmsg() and
Pmsg())

- I followed your instructions, and added trace() function to the append.c file
and calls to it after the message get printed. However, that leads to bacula-sd
crashing, although it did produce some more output (don't know how usable,
though -- hope it helps)

So attached in NULL_Volume_20110307.tgz are:

- append.c - my modified version of append.c with your trace() code added (so
you can track line numbers and such). The rest of the code is standard bacula
5.0.3 + latest GIT Branch-5.0 as of 20110301)
- bacula-sd.output -- stdout/stderr of the bacula-sd showing "NULL volume" error
- bacula_log - the bacula log file around the times when error occured
- bacula.2919.traceback & birdun-sd.2919.bactrace  -- files produced when
trace() caused bacula-sd to crash 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2010-06-15 15:08 mnalis         New Issue                                    
2010-06-15 15:08 mnalis         File Added: null_volume_name.txt                
   
2010-06-15 15:09 mnalis         Issue Monitored: mnalis                      
2010-06-28 11:00 mnalis         Note Added: 0005467                          
2010-07-19 16:46 kern           Note Added: 0005502                          
2010-07-19 16:46 kern           Status                   new => feedback     
2010-07-20 12:43 mnalis         Note Added: 0005509                          
2010-07-20 12:51 mnalis         File Added: null_volume_case2_full.txt          
         
2010-07-20 12:51 mnalis         File Added: null_volume_case3_39066_full.txt    
               
2010-07-20 12:54 kern           Note Added: 0005510                          
2010-07-20 12:59 mnalis         Note Added: 0005511                          
2010-07-20 13:18 kern           Note Added: 0005512                          
2010-07-20 14:04 mnalis         Note Added: 0005513                          
2010-08-08 16:13 kern           Note Added: 0005554                          
2010-08-08 16:13 kern           Status                   feedback => resolved
2010-08-08 16:13 kern           Resolution               open => fixed       
2010-08-08 16:13 kern           Status                   resolved => closed  
2011-02-15 16:07 mnalis         Note Added: 0005750                          
2011-02-15 16:07 mnalis         Status                   closed => feedback  
2011-02-15 16:07 mnalis         Resolution               fixed => reopened   
2011-02-17 12:32 mnalis         Note Added: 0005753                          
2011-02-17 12:32 mnalis         Status                   feedback => new     
2011-02-17 12:45 mnalis         File Added: bacula_log_NULL_Volume_20100217.txt 
                  
2011-02-18 07:56 ebollengier    Note Added: 0005754                          
2011-02-18 07:56 ebollengier    Status                   new => closed       
2011-02-18 07:56 ebollengier    Assigned To               => ebollengier     
2011-02-18 07:56 ebollengier    Resolution               reopened => unable to
reproduce
2011-03-07 14:15 mnalis         Note Added: 0005790                          
2011-03-07 14:15 mnalis         Status                   closed => feedback  
2011-03-07 14:15 mnalis         Resolution               unable to reproduce =>
reopened
======================================================================


------------------------------------------------------------------------------
What You Don't Know About Data Connectivity CAN Hurt You
This paper provides an overview of data connectivity, details
its effect on application quality, and explores various alternative
solutions. http://p.sf.net/sfu/progress-d2d
_______________________________________________
Bacula-bugs mailing list
Bacula-bugs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-bugs
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic