[prev in list] [next in list] [prev in thread] [next in thread] 

List:       bacula-users
Subject:    [Bacula-users] large file system with lustre,
From:       Gauthier DELERCE <gauthier () delerce ! fr>
Date:       2008-03-31 16:00:11
Message-ID: 47F10A8B.4010906 () delerce ! fr
[Download RAW message or body]

Hello,

I'm setting up a new installation for two weeks and I would like to 
share my setup in order to have your feedback and  I hope advices for a 
better strategy of practise.
Also I had a crash for a backup which I can't explain.

First my setup :
director named BACULA, debian etch virtualized machine (32bits), 256MB 
of ram, mysql-5 with customized my.cnf ( from my-large.cnf with fex 
values divided by two). Packages : bacula-common   2.2.8-4~bpo40+1 - 
bacula-console  2.2.8-4~bpo40+1 -bacula-director-common   
2.2.8-4~bpo40+1  -bacula-director-mysql  2.2.8-4~bpo40+1                

sd and fd on AZURITE, Scientific linux5 64bits, 16GB of memory, 2 quad 
core xeon.bacula-mtx-2.2.8-2 and bacula-mysql-2.2.8-2
Azurite is a lustre client, the filesystem to backup is 20TB large, 
presently with only 10TB, and more than 10 millions of files.

the library is an overland neo 2000 with an FCO3 card and two LTO3 ( 
will be exchanged for LTO4 tomorrow) HP 960 also connected via FC.
the FCO3 and the two LTO-3 are connected via a FC switch to the emulex 
HBA on azurite. HBA link speed is 4GB and both LTO3 at 2GB/s

Lustre file system performance are impressive when we work with large 
files and several clients, or in our case there is millions of small 
files ( source code ) to backup and the read performances can go down to 
10MB/s for most of folders when the FS read them.

I divided the whole file system in three jobs, I created two pools for 
data spooling ( on the same lustre fs ) and I'm facing two problems :
the low read performances ( to spool) then the low performances during 
despooling ( 65MB/s without compression, 40MB/s with compression)

I attached my dir and sd configuration files if someone see something wrong in it 

As example of the performance, you can see the error message at the end 
of the email and also this output of the current status :
Running Jobs:
JobId 23 Job lustreArgile.2008-03-28_22.33.33 is running.
    Backup Job started: 28-Mar-08 22:32
    Files=3,403,844 Bytes=4,195,243,945,301 Bytes/sec=17,631,075 Errors=0
    Files Examined=3,403,844
    Processing file: /mnt/lustre/home/argile/somedata........
    SDReadSeqNo=5 fd=12
Director connected at: 31-Mar-08 17:38
====

Here a small output of the fs performances in the spool folder
************************
[root@azurite test]# dd if=/dev/zero of=10G bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB) copied, 43.1407 seconds, 243 MB/s
[root@azurite test]# date&&cp 10G 10G.1&&sync && date
Mon Mar 31 17:32:32 CEST 2008
Mon Mar 31 17:33:19 CEST 2008
[root@azurite test]# ls -lh
total 20G
-rw-r--r-- 1 root root 9.8G Mar 31 17:32 10G
-rw-r--r-- 1 root root 9.8G Mar 31 17:33 10G.1
************************

I also had a crash of a job and I'm still wondering why, here few lines 
from the log:

28-mar 00:44 bacula-dir JobId 16: No prior Full backup Job record found.
28-mar 00:44 bacula-dir JobId 16: No prior or suitable Full backup found in catalog. \
Doing FULL backup. 28-mar 00:45 bacula-dir JobId 16: Start Backup JobId 16, \
Job=lustre.2008-03-28_00.44.05 28-mar 00:45 bacula-dir JobId 16: Using Device \
"Drive-1" 28-mar 00:44 azurite-sd JobId 16: Spooling data ...
....
30-mar 01:16 azurite-sd JobId 16: Spooling data again ...
30-mar 03:12 azurite JobId 16: Fatal error: backup.c:1051 Network send error to SD. \
ERR=Connection reset by peer 30-mar 03:12 azurite JobId 16: Error: bsock.c:306 Write \
error sending 11 bytes to Storage daemon:azurite.andra.fr:9103: ERR=Connection reset \
by peer 30-mar 03:14 bacula-dir JobId 16: Error: Bacula bacula-dir 2.2.8 (26Jan08): \
30-mar-2008 03:14:12  Build OS:               i486-pc-linux-gnu debian 4.0
  JobId:                  16
  Job:                    lustre.2008-03-28_00.44.05
  Backup Level:           Full (upgraded from Incremental)
  Client:                 "azurite-fd" 2.2.8 (26Jan08) \
x86_64-redhat-linux-gnu,redhat,  FileSet:                "lustre" 2008-03-28 00:44:58
  Pool:                   "Default" (From Job resource)
  Storage:                "Autochanger" (From Job resource)
  Scheduled time:         28-mar-2008 00:44:57
  Start time:             28-mar-2008 00:45:00
  End time:               30-mar-2008 03:14:12
  Elapsed time:           2 days 1 hour 29 mins 12 secs
  Priority:               10
  FD Files Written:       5,562,241
  SD Files Written:       0
  FD Bytes Written:       1,963,406,238,087 (1.963 TB)
  SD Bytes Written:       0 (0 B)
  Rate:                   11021.0 KB/s
  Software Compression:   None
  VSS:                    no
  Storage Encryption:     no
  Volume name(s):         KN9884L3|KN9897L3|KN9891L3
  Volume Session Id:      6
  Volume Session Time:    1206660491
  Last Volume Bytes:      431,657,017,344 (431.6 GB)
  Non-fatal FD errors:    1
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Error
  Termination:            *** Backup Error ***


I also saw an another error related to the library in the log for a different backup \
which still runs :

30-mar 12:24 azurite-sd JobId 23: End of medium on Volume "KN9892L3" \
Bytes=889,312,693,248 Blocks=13,785,228 at 30-Mar-2008 12:24. 30-mar 12:24 azurite-sd \
JobId 23: 3307 Issuing autochanger "unload slot 7, drive 1" command. 30-mar 12:24 \
azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive 0" command. \
30-mar 12:25 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 6, drive 1" \
command. 30-mar 12:26 azurite-sd JobId 23: 3305 Autochanger "load slot 6, drive 1", \
status is OK. 30-mar 12:26 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? \
drive 1" command. 30-mar 12:26 azurite-sd JobId 23: 3991 Bad autochanger "loaded? \
drive 1" command: ERR=Child exited with code 1. Results=mtx: Request Sense: Long \
                Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
READ ELEMENT STATUS Command Failed

30-mar 12:26 azurite-sd JobId 23: Volume "KN9891L3" previously written, moving to end \
of data. 30-mar 12:27 azurite-sd JobId 23: Error: Bacula cannot write on tape Volume \
"KN9891L3" because: The number of files mismatch! Volume=432 Catalog=431
30-mar 12:27 azurite-sd JobId 23: Marking Volume "KN9891L3" in Error in Catalog.
30-mar 12:29 bacula-dir JobId 23: Using Volume "KN9898L3" from 'Scratch' pool.
30-mar 12:27 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? drive 1" command.
30-mar 12:27 azurite-sd JobId 23: 3302 Autochanger "loaded? drive 1", result is Slot \
6. 30-mar 12:27 azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive \
1" command. 30-mar 12:27 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 8, \
drive 1" command. 30-mar 12:28 azurite-sd JobId 23: 3305 Autochanger "load slot 8, \
drive 1", status is OK. 30-mar 12:28 azurite-sd JobId 23: 3301 Issuing autochanger \
"loaded? drive 1" command. 30-mar 12:28 azurite-sd JobId 23: 3991 Bad autochanger \
"loaded? drive 1" command: ERR=Child exited with code 1. Results=mtx: Request Sense: \
                Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no



Thanks for any comments-help ( and of course for this nice piece of software :-) )

Gauthier



 


["bacula-dir.conf" (text/plain)]

#
# Default Bacula Director Configuration file
#
#  The only thing that MUST be changed is to add one or more
#   file or directory names in the Include directive of the
#   FileSet resource.
#
#  For Bacula release 2.2.8 (26 January 2008) -- debian 4.0
#
#  You might also want to change the default email address
#   from root to your address.  See the "mail" and "operator"
#   directives in the Messages resource.
#

Director {                            # define myself
  Name = bacula-dir
  DIRport = 9101                # where we listen for UA connections
  QueryFile = "/etc/bacula/scripts/query.sql"
  WorkingDirectory = "/var/lib/bacula"
  PidDirectory = "/var/run/bacula"
  Maximum Concurrent Jobs = 10
  Password = "***"         # Console password
  Messages = Daemon
  DirAddress = 172.20.0.75
}

JobDefs {
  Name = "lustre"
  Type = Backup
  Level = Incremental
  Client = azurite-fd 
#  FileSet = "lustre1"
  Schedule = "WeeklyCycle"
  Storage = Autochanger
  Messages = Standard
  Pool = Default
  Priority = 10
SpoolData = yes
Maximum Concurrent Jobs = 1
Prefer Mounted Volumes = no
}
JobDefs {
  Name = "DefaultJob"
  Type = Backup
  Level = Incremental
  Client = azurite-fd
  FileSet = "Full Set"
  Schedule = "WeeklyCycle"
  Storage = Autochanger
  #Storage = File_1
  Messages = Standard
  Pool = Default
  Priority = 10
}
JobDefs {
  Name = "Azurite12"
  Type = Backup
  Level = Incremental
  FileSet = "FullAzurite12"
  Schedule = "WeeklyCycle"
  Storage = Autochanger
  Messages = Standard
  Pool = Default
  Priority = 10
}

#
# Define the main nightly save backup job
#   By default, this job will back up to disk in 
Job {
  Name = "lustreAlliances"
  JobDefs = "lustre"
  Write Bootstrap = "/var/lib/bacula/Client1.bsr"
  FileSet = "lustreAlliances"
}
Job {
  Name = "lustreArgile"
  JobDefs = "lustre"
  Write Bootstrap = "/var/lib/bacula/Client1.bsr"
  FileSet = "lustreArgile"
}
Job {
  Name = "lustre"
  JobDefs = "lustre"
  Write Bootstrap = "/var/lib/bacula/Client1.bsr"
  FileSet = "lustre"
}
Job {
  Name = "azurite1"
  Client = azurite1-fd
  JobDefs = "Azurite12"
  Write Bootstrap = "/var/lib/bacula/%c_%n.bsr"
}
Job {
 Name = "azurite2"
  Client = azurite2-fd
  JobDefs = "Azurite12"
  Write Bootstrap = "/var/lib/bacula/%c_%n.bsr"
}
# Backup the catalog database (after the nightly save)
Job {
  Name = "BackupCatalog"
  JobDefs = "DefaultJob"
  Level = Full
  FileSet="Catalog"
  Schedule = "WeeklyCycleAfterBackup"
  # This creates an ASCII copy of the catalog
  RunBeforeJob = "/etc/bacula/scripts/make_catalog_backup bacula bacula bacula"
  # This deletes the copy of the catalog
  RunAfterJob  = "/etc/bacula/scripts/delete_catalog_backup"
  Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
  Priority = 11                   # run after main backup
}

#
# Standard Restore template, to be changed by Console program
#  Only one such job is needed for all Jobs/Clients/Storage ...
#
Job {
  Name = "RestoreFiles"
  Type = Restore
  Client=azurite-fd                 
  FileSet="Full Set"                  
  Storage = Autochanger                      
  Pool = Default
  Messages = Standard
  Where = /mnt/lustre/bacula/restore
}


# List of files to be backed up
FileSet {
  Name = "Full Set"
  Include {
    Options {
      signature = MD5
    }
#    
#  Put your list of files here, preceded by 'File =', one per line
#    or include an external list with:
#
#    File = <file-name
#
#  Note: / backs up everything on the root partition.
#    if you have other partitons such as /usr or /home
#    you will probably want to add them too.
#
#  By default this is defined to point to the Bacula build
#    directory to give a reasonable FileSet to backup to
#    disk storage during initial testing.
#
    File = /home/alfie/Debian/backport/bacula/2.2.8-4/blub/bacula-2.2.8/debian/tmp-build-sqlite
  }

#
# If you backup the root directory, the following two excluded
#   files can be useful
#
  Exclude {
    File = /proc
    File = /tmp
    File = /.journal
    File = /.fsck
  }
}
FileSet {
  Name = "lustreAlliances"
  Include {
    Options {
#       compression=gzip
      signature = MD5
    }
   File = /mnt/lustre/home/alliance
   File = /mnt/lustre/Alliances
}
}
FileSet {
  Name = "lustreArgile"
  Include {
    Options {
      signature = MD5
#       compression=gzip
    }
  File = /mnt/lustre/home/argile
}
}
FileSet {
  Name = "lustre"
  Include {
    Options {
      signature = MD5
#       compression=gzip
    }
   File = /mnt/lustre
        }
        Exclude {
        File = /mnt/lustre/home/alliance
        File = /mnt/lustre/home/argile
        File = /mnt/lustre/VTL
        File = /mnt/lustre/bacula
        }
}

FileSet {
  Name = "FullAzurite12"
  Include {
    Options {
      signature = MD5
    }
    File = /boot
    File = /scratch
    File = /tmp
    File = /var
    File = /c/apps
    File = /c/data
    File = /c/batch
    File = /
  }

#
# If you backup the root directory, the following two excluded
#   files can be useful
#
  Exclude {
    File = /proc
    File = /.journal
    File = /.fsck
        File = /mnt
        File = /apps
  }
}
# When to do the backups, full backup on first sunday of the month,
#  differential (i.e. incremental since full) every other sunday,
#  and incremental backups other days
Schedule {
  Name = "WeeklyCycle"
  Run = Full 1st sun at 23:05
  Run = Differential 2nd-5th sun at 23:05
  Run = Incremental mon-sat at 23:05
}

# This schedule does the catalog. It starts after the WeeklyCycle
Schedule {
  Name = "WeeklyCycleAfterBackup"
  Run = Full sun-sat at 23:10
}

# This is the backup of the catalog
FileSet {
  Name = "Catalog"
  Include {
    Options {
      signature = MD5
    }
    File = /var/lib/bacula/bacula.sql
  }
}

# Client (File Services) to backup
Client {
  Name = azurite-fd
  Address = azurite.andra.fr
  FDPort = 9102
  Catalog = MyCatalog
  Password = "bacula-azurite-fd"          # password for FileDaemon
  File Retention = 30 days            # 30 days
  Job Retention = 6 months            # six months
  AutoPrune = yes                     # Prune expired Jobs/Files
 Maximum Concurrent Jobs = 8
}
Client {
  Name = azurite1-fd
  Address = azurite1.andra.fr
  FDPort = 9102
  Catalog = MyCatalog
  Password = "bacula-azurite1"          # password for FileDaemon
  File Retention = 30 days            # 30 days
  Job Retention = 6 months            # six months
  AutoPrune = yes                     # Prune expired Jobs/Files
  Maximum Concurrent Jobs = 4
}
Client {
  Name = azurite2-fd
  Address = azurite2.andra.fr
  FDPort = 9102
  Catalog = MyCatalog
  Password = "bacula-azurite2"          # password for FileDaemon
  File Retention = 30 days            # 30 days
  Job Retention = 6 months            # six months
  AutoPrune = yes                     # Prune expired Jobs/Files
  Maximum Concurrent Jobs = 4
}

# Definition of LTO-3 tape storage device
Storage {
  Name = Autochanger
  Address = azurite.andra.fr
  SDPort = 9103
  Password = "bacula-azurite-sd"
  Device = Autochanger                      # must be same as Device in Storage
  Media Type = LTO-3                   # must be same as MediaType in Storage
  Autochanger = yes                   # enable for autochanger device
  Maximum Concurrent Jobs = 16
}
Storage {
  Name = LTO1
  Address = azurite.andra.fr
  SDPort = 9103
  Password = "bacula-azurite-sd"
  Device = Drive-1                      # must be same as Device in Storage
  Media Type = LTO-3                   # must be same as MediaType in Storage
  Autochanger = no                   # enable for autochanger device
}
Storage {
  Name = LTO2
  Address = azurite.andra.fr
  SDPort = 9103
  Password = "bacula-azurite-sd"
  Device = Drive-2                      # must be same as Device in Storage
  Media Type = LTO-3                   # must be same as MediaType in Storage
  Autochanger = no                   # enable for autochanger device
}


# Generic catalog service
Catalog {
  Name = MyCatalog
  dbname = bacula; DB Address = ""; user = bacula; password = "bacula"
}

# Reasonable message delivery -- send most everything to email address
#  and to the console
Messages {
  Name = Standard
#
# NOTE! If you send to two email or more email addresses, you will need
#  to replace the %r in the from field (-f part) with a single valid
#  email address in both the mailcommand and the operatorcommand.
#  What this does is, it sets the email address that emails would display
#  in the FROM field, which is by default the same email as they're being
#  sent to.  However, if you send email to more than one address, then
#  you'll have to set the FROM address manually, to a single address. 
#  for example, a 'no-reply@mydomain.com', is better since that tends to
#  tell (most) people that its coming from an automated source.

#
 mailcommand = "/usr/lib/bacula/bsmtp -h geaster.andra.fr -f \"\(Bacula\) \<%r\>\" -s \
\"Bacula: %t %e of %c %l\" %r"  operatorcommand = "/usr/lib/bacula/bsmtp -h \
geaster.andra.fr -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" \
%r"  mail = gauthier@delerce.fr = all, !skipped
  operator = gauthier@delerce.fr = mount

  console = all, !skipped, !saved
#
# WARNING! the following will create a file that you must cycle from
#          time to time as it will grow indefinitely. However, it will
#          also keep all your messages if they scroll off the console.
#
  append = "/var/lib/bacula/log" = all, !skipped
}


#
# Message delivery for daemon messages (no job).
Messages {
  Name = Daemon
  mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \
\"Bacula daemon message\" %r"  mail = root@localhost = all, !skipped            
  console = all, !skipped, !saved
  append = "/var/lib/bacula/log" = all, !skipped
}



    
# Default pool definition
Pool {
  Name = Default
  Pool Type = Backup
  Recycle = yes                       # Bacula can automatically recycle Volumes
  AutoPrune = yes                     # Prune expired volumes
  Volume Retention = 365 days         # one year
}

# Scratch pool definition
Pool {
  Name = Scratch
  Pool Type = Backup
}

#
# Restricted console used by tray-monitor to get the status of the director
#
Console {
  Name = bacula.andra.fr-mon
  Password = "simple"
  CommandACL = status, .status
}


["bacula-sd.conf" (text/plain)]

#
# Default Bacula Storage Daemon Configuration file
#
#  For Bacula release 2.2.8 (26 January 2008) -- redhat 
#
# You may need to change the name of your tape drive
#   on the "Archive Device" directive in the Device
#   resource.  If you change the Name and/or the 
#   "Media Type" in the Device resource, please ensure
#   that dird.conf has corresponding changes.
#

Storage {                             # definition of myself
  Name = azurite-sd
  SDPort = 9103                  # Director's port      
  WorkingDirectory = "/var/lib/bacula"
  Pid Directory = "/var/run"
  Maximum Concurrent Jobs = 20
}

#
# List Directors who are permitted to contact Storage daemon
#
Director {
  Name = bacula-dir
  Password = "bacula-azurite-sd"
}

#
# Restricted Director, used by tray-monitor to get the
#   status of the storage daemon
#
Director {
  Name = bacula-mon
  Password = "simple"
  Monitor = yes
}

#
# An autochanger device with two drives
#
Autochanger {
  Name = Autochanger
  Device = Drive-1
  Device = Drive-2
  Changer Command = "/etc/bacula/mtx-changer %c %o %S %a %d"
  Changer Device = /dev/neo
}

Device {
  Name = Drive-1                      #
  Drive Index = 0
  Media Type = LTO-3
  Archive Device = /dev/nst1
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = yes;
  RemovableMedia = yes;
  RandomAccess = no;
  AutoChanger = yes
  SpoolDirectory = /mnt/lustre/bacula/spool/spool1
  Autoselect = yes
  #Maximum Job Spool Size = 100g
  Maximum Job Spool Size = 100000000
  # Enable the Alert command only if you have the mtx package loaded
#  Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
#  #
#  # Enable the Alert command only if you have the mtx package loaded
#  # Note, apparently on some systems, tapeinfo resets the SCSI controller
#  #  thus if you turn this on, make sure it does not reset your SCSI 
#  #  controller.  I have never had any problems, and smartctl does
#  #  not seem to cause such problems.
#  #
#  Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
#  If you have smartctl, enable this, it has more info than tapeinfo 
#  Alert Command = "sh -c 'smartctl -H -l error %c'"  
}

Device {
  Name = Drive-2                      #
  Drive Index = 1
  Media Type = LTO-3
  Archive Device = /dev/nst0
  AutomaticMount = yes;               # when device opened, read it
  AlwaysOpen = yes;
  RemovableMedia = yes;
  RandomAccess = no;
  AutoChanger = yes
  SpoolDirectory = /mnt/lustre/bacula/spool/spool2
  Autoselect = yes
  Maximum Job Spool Size = 100000000
  # Enable the Alert command only if you have the mtx package loaded
#  Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
#  If you have smartctl, enable this, it has more info than tapeinfo 
#  Alert Command = "sh -c 'smartctl -H -l error %c'"  
}

#
Messages {
  Name = Standard
  director = bacula-dir = all
}


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic