[prev in list] [next in list] [prev in thread] [next in thread]
List: bacula-users
Subject: [Bacula-users] large file system with lustre,
From: Gauthier DELERCE <gauthier () delerce ! fr>
Date: 2008-03-31 16:00:11
Message-ID: 47F10A8B.4010906 () delerce ! fr
[Download RAW message or body]
Hello,
I'm setting up a new installation for two weeks and I would like to
share my setup in order to have your feedback and I hope advices for a
better strategy of practise.
Also I had a crash for a backup which I can't explain.
First my setup :
director named BACULA, debian etch virtualized machine (32bits), 256MB
of ram, mysql-5 with customized my.cnf ( from my-large.cnf with fex
values divided by two). Packages : bacula-common 2.2.8-4~bpo40+1 -
bacula-console 2.2.8-4~bpo40+1 -bacula-director-common
2.2.8-4~bpo40+1 -bacula-director-mysql 2.2.8-4~bpo40+1
sd and fd on AZURITE, Scientific linux5 64bits, 16GB of memory, 2 quad
core xeon.bacula-mtx-2.2.8-2 and bacula-mysql-2.2.8-2
Azurite is a lustre client, the filesystem to backup is 20TB large,
presently with only 10TB, and more than 10 millions of files.
the library is an overland neo 2000 with an FCO3 card and two LTO3 (
will be exchanged for LTO4 tomorrow) HP 960 also connected via FC.
the FCO3 and the two LTO-3 are connected via a FC switch to the emulex
HBA on azurite. HBA link speed is 4GB and both LTO3 at 2GB/s
Lustre file system performance are impressive when we work with large
files and several clients, or in our case there is millions of small
files ( source code ) to backup and the read performances can go down to
10MB/s for most of folders when the FS read them.
I divided the whole file system in three jobs, I created two pools for
data spooling ( on the same lustre fs ) and I'm facing two problems :
the low read performances ( to spool) then the low performances during
despooling ( 65MB/s without compression, 40MB/s with compression)
I attached my dir and sd configuration files if someone see something wrong in it
As example of the performance, you can see the error message at the end
of the email and also this output of the current status :
Running Jobs:
JobId 23 Job lustreArgile.2008-03-28_22.33.33 is running.
Backup Job started: 28-Mar-08 22:32
Files=3,403,844 Bytes=4,195,243,945,301 Bytes/sec=17,631,075 Errors=0
Files Examined=3,403,844
Processing file: /mnt/lustre/home/argile/somedata........
SDReadSeqNo=5 fd=12
Director connected at: 31-Mar-08 17:38
====
Here a small output of the fs performances in the spool folder
************************
[root@azurite test]# dd if=/dev/zero of=10G bs=10M count=1000
1000+0 records in
1000+0 records out
10485760000 bytes (10 GB) copied, 43.1407 seconds, 243 MB/s
[root@azurite test]# date&&cp 10G 10G.1&&sync && date
Mon Mar 31 17:32:32 CEST 2008
Mon Mar 31 17:33:19 CEST 2008
[root@azurite test]# ls -lh
total 20G
-rw-r--r-- 1 root root 9.8G Mar 31 17:32 10G
-rw-r--r-- 1 root root 9.8G Mar 31 17:33 10G.1
************************
I also had a crash of a job and I'm still wondering why, here few lines
from the log:
28-mar 00:44 bacula-dir JobId 16: No prior Full backup Job record found.
28-mar 00:44 bacula-dir JobId 16: No prior or suitable Full backup found in catalog. \
Doing FULL backup. 28-mar 00:45 bacula-dir JobId 16: Start Backup JobId 16, \
Job=lustre.2008-03-28_00.44.05 28-mar 00:45 bacula-dir JobId 16: Using Device \
"Drive-1" 28-mar 00:44 azurite-sd JobId 16: Spooling data ...
....
30-mar 01:16 azurite-sd JobId 16: Spooling data again ...
30-mar 03:12 azurite JobId 16: Fatal error: backup.c:1051 Network send error to SD. \
ERR=Connection reset by peer 30-mar 03:12 azurite JobId 16: Error: bsock.c:306 Write \
error sending 11 bytes to Storage daemon:azurite.andra.fr:9103: ERR=Connection reset \
by peer 30-mar 03:14 bacula-dir JobId 16: Error: Bacula bacula-dir 2.2.8 (26Jan08): \
30-mar-2008 03:14:12 Build OS: i486-pc-linux-gnu debian 4.0
JobId: 16
Job: lustre.2008-03-28_00.44.05
Backup Level: Full (upgraded from Incremental)
Client: "azurite-fd" 2.2.8 (26Jan08) \
x86_64-redhat-linux-gnu,redhat, FileSet: "lustre" 2008-03-28 00:44:58
Pool: "Default" (From Job resource)
Storage: "Autochanger" (From Job resource)
Scheduled time: 28-mar-2008 00:44:57
Start time: 28-mar-2008 00:45:00
End time: 30-mar-2008 03:14:12
Elapsed time: 2 days 1 hour 29 mins 12 secs
Priority: 10
FD Files Written: 5,562,241
SD Files Written: 0
FD Bytes Written: 1,963,406,238,087 (1.963 TB)
SD Bytes Written: 0 (0 B)
Rate: 11021.0 KB/s
Software Compression: None
VSS: no
Storage Encryption: no
Volume name(s): KN9884L3|KN9897L3|KN9891L3
Volume Session Id: 6
Volume Session Time: 1206660491
Last Volume Bytes: 431,657,017,344 (431.6 GB)
Non-fatal FD errors: 1
SD Errors: 0
FD termination status: Error
SD termination status: Error
Termination: *** Backup Error ***
I also saw an another error related to the library in the log for a different backup \
which still runs :
30-mar 12:24 azurite-sd JobId 23: End of medium on Volume "KN9892L3" \
Bytes=889,312,693,248 Blocks=13,785,228 at 30-Mar-2008 12:24. 30-mar 12:24 azurite-sd \
JobId 23: 3307 Issuing autochanger "unload slot 7, drive 1" command. 30-mar 12:24 \
azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive 0" command. \
30-mar 12:25 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 6, drive 1" \
command. 30-mar 12:26 azurite-sd JobId 23: 3305 Autochanger "load slot 6, drive 1", \
status is OK. 30-mar 12:26 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? \
drive 1" command. 30-mar 12:26 azurite-sd JobId 23: 3991 Bad autochanger "loaded? \
drive 1" command: ERR=Child exited with code 1. Results=mtx: Request Sense: Long \
Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
mtx: Request Sense: SKSV=no
READ ELEMENT STATUS Command Failed
30-mar 12:26 azurite-sd JobId 23: Volume "KN9891L3" previously written, moving to end \
of data. 30-mar 12:27 azurite-sd JobId 23: Error: Bacula cannot write on tape Volume \
"KN9891L3" because: The number of files mismatch! Volume=432 Catalog=431
30-mar 12:27 azurite-sd JobId 23: Marking Volume "KN9891L3" in Error in Catalog.
30-mar 12:29 bacula-dir JobId 23: Using Volume "KN9898L3" from 'Scratch' pool.
30-mar 12:27 azurite-sd JobId 23: 3301 Issuing autochanger "loaded? drive 1" command.
30-mar 12:27 azurite-sd JobId 23: 3302 Autochanger "loaded? drive 1", result is Slot \
6. 30-mar 12:27 azurite-sd JobId 23: 3307 Issuing autochanger "unload slot 6, drive \
1" command. 30-mar 12:27 azurite-sd JobId 23: 3304 Issuing autochanger "load slot 8, \
drive 1" command. 30-mar 12:28 azurite-sd JobId 23: 3305 Autochanger "load slot 8, \
drive 1", status is OK. 30-mar 12:28 azurite-sd JobId 23: 3301 Issuing autochanger \
"loaded? drive 1" command. 30-mar 12:28 azurite-sd JobId 23: 3991 Bad autochanger \
"loaded? drive 1" command: ERR=Child exited with code 1. Results=mtx: Request Sense: \
Long Report=yes
mtx: Request Sense: Valid Residual=no
mtx: Request Sense: Error Code=70 (Current)
mtx: Request Sense: Sense Key=Not Ready
mtx: Request Sense: FileMark=no
mtx: Request Sense: EOM=no
mtx: Request Sense: ILI=no
mtx: Request Sense: Additional Sense Code = 04
mtx: Request Sense: Additional Sense Qualifier = 00
mtx: Request Sense: BPV=no
mtx: Request Sense: Error in CDB=no
Thanks for any comments-help ( and of course for this nice piece of software :-) )
Gauthier
["bacula-dir.conf" (text/plain)]
#
# Default Bacula Director Configuration file
#
# The only thing that MUST be changed is to add one or more
# file or directory names in the Include directive of the
# FileSet resource.
#
# For Bacula release 2.2.8 (26 January 2008) -- debian 4.0
#
# You might also want to change the default email address
# from root to your address. See the "mail" and "operator"
# directives in the Messages resource.
#
Director { # define myself
Name = bacula-dir
DIRport = 9101 # where we listen for UA connections
QueryFile = "/etc/bacula/scripts/query.sql"
WorkingDirectory = "/var/lib/bacula"
PidDirectory = "/var/run/bacula"
Maximum Concurrent Jobs = 10
Password = "***" # Console password
Messages = Daemon
DirAddress = 172.20.0.75
}
JobDefs {
Name = "lustre"
Type = Backup
Level = Incremental
Client = azurite-fd
# FileSet = "lustre1"
Schedule = "WeeklyCycle"
Storage = Autochanger
Messages = Standard
Pool = Default
Priority = 10
SpoolData = yes
Maximum Concurrent Jobs = 1
Prefer Mounted Volumes = no
}
JobDefs {
Name = "DefaultJob"
Type = Backup
Level = Incremental
Client = azurite-fd
FileSet = "Full Set"
Schedule = "WeeklyCycle"
Storage = Autochanger
#Storage = File_1
Messages = Standard
Pool = Default
Priority = 10
}
JobDefs {
Name = "Azurite12"
Type = Backup
Level = Incremental
FileSet = "FullAzurite12"
Schedule = "WeeklyCycle"
Storage = Autochanger
Messages = Standard
Pool = Default
Priority = 10
}
#
# Define the main nightly save backup job
# By default, this job will back up to disk in
Job {
Name = "lustreAlliances"
JobDefs = "lustre"
Write Bootstrap = "/var/lib/bacula/Client1.bsr"
FileSet = "lustreAlliances"
}
Job {
Name = "lustreArgile"
JobDefs = "lustre"
Write Bootstrap = "/var/lib/bacula/Client1.bsr"
FileSet = "lustreArgile"
}
Job {
Name = "lustre"
JobDefs = "lustre"
Write Bootstrap = "/var/lib/bacula/Client1.bsr"
FileSet = "lustre"
}
Job {
Name = "azurite1"
Client = azurite1-fd
JobDefs = "Azurite12"
Write Bootstrap = "/var/lib/bacula/%c_%n.bsr"
}
Job {
Name = "azurite2"
Client = azurite2-fd
JobDefs = "Azurite12"
Write Bootstrap = "/var/lib/bacula/%c_%n.bsr"
}
# Backup the catalog database (after the nightly save)
Job {
Name = "BackupCatalog"
JobDefs = "DefaultJob"
Level = Full
FileSet="Catalog"
Schedule = "WeeklyCycleAfterBackup"
# This creates an ASCII copy of the catalog
RunBeforeJob = "/etc/bacula/scripts/make_catalog_backup bacula bacula bacula"
# This deletes the copy of the catalog
RunAfterJob = "/etc/bacula/scripts/delete_catalog_backup"
Write Bootstrap = "/var/lib/bacula/BackupCatalog.bsr"
Priority = 11 # run after main backup
}
#
# Standard Restore template, to be changed by Console program
# Only one such job is needed for all Jobs/Clients/Storage ...
#
Job {
Name = "RestoreFiles"
Type = Restore
Client=azurite-fd
FileSet="Full Set"
Storage = Autochanger
Pool = Default
Messages = Standard
Where = /mnt/lustre/bacula/restore
}
# List of files to be backed up
FileSet {
Name = "Full Set"
Include {
Options {
signature = MD5
}
#
# Put your list of files here, preceded by 'File =', one per line
# or include an external list with:
#
# File = <file-name
#
# Note: / backs up everything on the root partition.
# if you have other partitons such as /usr or /home
# you will probably want to add them too.
#
# By default this is defined to point to the Bacula build
# directory to give a reasonable FileSet to backup to
# disk storage during initial testing.
#
File = /home/alfie/Debian/backport/bacula/2.2.8-4/blub/bacula-2.2.8/debian/tmp-build-sqlite
}
#
# If you backup the root directory, the following two excluded
# files can be useful
#
Exclude {
File = /proc
File = /tmp
File = /.journal
File = /.fsck
}
}
FileSet {
Name = "lustreAlliances"
Include {
Options {
# compression=gzip
signature = MD5
}
File = /mnt/lustre/home/alliance
File = /mnt/lustre/Alliances
}
}
FileSet {
Name = "lustreArgile"
Include {
Options {
signature = MD5
# compression=gzip
}
File = /mnt/lustre/home/argile
}
}
FileSet {
Name = "lustre"
Include {
Options {
signature = MD5
# compression=gzip
}
File = /mnt/lustre
}
Exclude {
File = /mnt/lustre/home/alliance
File = /mnt/lustre/home/argile
File = /mnt/lustre/VTL
File = /mnt/lustre/bacula
}
}
FileSet {
Name = "FullAzurite12"
Include {
Options {
signature = MD5
}
File = /boot
File = /scratch
File = /tmp
File = /var
File = /c/apps
File = /c/data
File = /c/batch
File = /
}
#
# If you backup the root directory, the following two excluded
# files can be useful
#
Exclude {
File = /proc
File = /.journal
File = /.fsck
File = /mnt
File = /apps
}
}
# When to do the backups, full backup on first sunday of the month,
# differential (i.e. incremental since full) every other sunday,
# and incremental backups other days
Schedule {
Name = "WeeklyCycle"
Run = Full 1st sun at 23:05
Run = Differential 2nd-5th sun at 23:05
Run = Incremental mon-sat at 23:05
}
# This schedule does the catalog. It starts after the WeeklyCycle
Schedule {
Name = "WeeklyCycleAfterBackup"
Run = Full sun-sat at 23:10
}
# This is the backup of the catalog
FileSet {
Name = "Catalog"
Include {
Options {
signature = MD5
}
File = /var/lib/bacula/bacula.sql
}
}
# Client (File Services) to backup
Client {
Name = azurite-fd
Address = azurite.andra.fr
FDPort = 9102
Catalog = MyCatalog
Password = "bacula-azurite-fd" # password for FileDaemon
File Retention = 30 days # 30 days
Job Retention = 6 months # six months
AutoPrune = yes # Prune expired Jobs/Files
Maximum Concurrent Jobs = 8
}
Client {
Name = azurite1-fd
Address = azurite1.andra.fr
FDPort = 9102
Catalog = MyCatalog
Password = "bacula-azurite1" # password for FileDaemon
File Retention = 30 days # 30 days
Job Retention = 6 months # six months
AutoPrune = yes # Prune expired Jobs/Files
Maximum Concurrent Jobs = 4
}
Client {
Name = azurite2-fd
Address = azurite2.andra.fr
FDPort = 9102
Catalog = MyCatalog
Password = "bacula-azurite2" # password for FileDaemon
File Retention = 30 days # 30 days
Job Retention = 6 months # six months
AutoPrune = yes # Prune expired Jobs/Files
Maximum Concurrent Jobs = 4
}
# Definition of LTO-3 tape storage device
Storage {
Name = Autochanger
Address = azurite.andra.fr
SDPort = 9103
Password = "bacula-azurite-sd"
Device = Autochanger # must be same as Device in Storage
Media Type = LTO-3 # must be same as MediaType in Storage
Autochanger = yes # enable for autochanger device
Maximum Concurrent Jobs = 16
}
Storage {
Name = LTO1
Address = azurite.andra.fr
SDPort = 9103
Password = "bacula-azurite-sd"
Device = Drive-1 # must be same as Device in Storage
Media Type = LTO-3 # must be same as MediaType in Storage
Autochanger = no # enable for autochanger device
}
Storage {
Name = LTO2
Address = azurite.andra.fr
SDPort = 9103
Password = "bacula-azurite-sd"
Device = Drive-2 # must be same as Device in Storage
Media Type = LTO-3 # must be same as MediaType in Storage
Autochanger = no # enable for autochanger device
}
# Generic catalog service
Catalog {
Name = MyCatalog
dbname = bacula; DB Address = ""; user = bacula; password = "bacula"
}
# Reasonable message delivery -- send most everything to email address
# and to the console
Messages {
Name = Standard
#
# NOTE! If you send to two email or more email addresses, you will need
# to replace the %r in the from field (-f part) with a single valid
# email address in both the mailcommand and the operatorcommand.
# What this does is, it sets the email address that emails would display
# in the FROM field, which is by default the same email as they're being
# sent to. However, if you send email to more than one address, then
# you'll have to set the FROM address manually, to a single address.
# for example, a 'no-reply@mydomain.com', is better since that tends to
# tell (most) people that its coming from an automated source.
#
mailcommand = "/usr/lib/bacula/bsmtp -h geaster.andra.fr -f \"\(Bacula\) \<%r\>\" -s \
\"Bacula: %t %e of %c %l\" %r" operatorcommand = "/usr/lib/bacula/bsmtp -h \
geaster.andra.fr -f \"\(Bacula\) \<%r\>\" -s \"Bacula: Intervention needed for %j\" \
%r" mail = gauthier@delerce.fr = all, !skipped
operator = gauthier@delerce.fr = mount
console = all, !skipped, !saved
#
# WARNING! the following will create a file that you must cycle from
# time to time as it will grow indefinitely. However, it will
# also keep all your messages if they scroll off the console.
#
append = "/var/lib/bacula/log" = all, !skipped
}
#
# Message delivery for daemon messages (no job).
Messages {
Name = Daemon
mailcommand = "/usr/lib/bacula/bsmtp -h localhost -f \"\(Bacula\) \<%r\>\" -s \
\"Bacula daemon message\" %r" mail = root@localhost = all, !skipped
console = all, !skipped, !saved
append = "/var/lib/bacula/log" = all, !skipped
}
# Default pool definition
Pool {
Name = Default
Pool Type = Backup
Recycle = yes # Bacula can automatically recycle Volumes
AutoPrune = yes # Prune expired volumes
Volume Retention = 365 days # one year
}
# Scratch pool definition
Pool {
Name = Scratch
Pool Type = Backup
}
#
# Restricted console used by tray-monitor to get the status of the director
#
Console {
Name = bacula.andra.fr-mon
Password = "simple"
CommandACL = status, .status
}
["bacula-sd.conf" (text/plain)]
#
# Default Bacula Storage Daemon Configuration file
#
# For Bacula release 2.2.8 (26 January 2008) -- redhat
#
# You may need to change the name of your tape drive
# on the "Archive Device" directive in the Device
# resource. If you change the Name and/or the
# "Media Type" in the Device resource, please ensure
# that dird.conf has corresponding changes.
#
Storage { # definition of myself
Name = azurite-sd
SDPort = 9103 # Director's port
WorkingDirectory = "/var/lib/bacula"
Pid Directory = "/var/run"
Maximum Concurrent Jobs = 20
}
#
# List Directors who are permitted to contact Storage daemon
#
Director {
Name = bacula-dir
Password = "bacula-azurite-sd"
}
#
# Restricted Director, used by tray-monitor to get the
# status of the storage daemon
#
Director {
Name = bacula-mon
Password = "simple"
Monitor = yes
}
#
# An autochanger device with two drives
#
Autochanger {
Name = Autochanger
Device = Drive-1
Device = Drive-2
Changer Command = "/etc/bacula/mtx-changer %c %o %S %a %d"
Changer Device = /dev/neo
}
Device {
Name = Drive-1 #
Drive Index = 0
Media Type = LTO-3
Archive Device = /dev/nst1
AutomaticMount = yes; # when device opened, read it
AlwaysOpen = yes;
RemovableMedia = yes;
RandomAccess = no;
AutoChanger = yes
SpoolDirectory = /mnt/lustre/bacula/spool/spool1
Autoselect = yes
#Maximum Job Spool Size = 100g
Maximum Job Spool Size = 100000000
# Enable the Alert command only if you have the mtx package loaded
# Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
# #
# # Enable the Alert command only if you have the mtx package loaded
# # Note, apparently on some systems, tapeinfo resets the SCSI controller
# # thus if you turn this on, make sure it does not reset your SCSI
# # controller. I have never had any problems, and smartctl does
# # not seem to cause such problems.
# #
# Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
# If you have smartctl, enable this, it has more info than tapeinfo
# Alert Command = "sh -c 'smartctl -H -l error %c'"
}
Device {
Name = Drive-2 #
Drive Index = 1
Media Type = LTO-3
Archive Device = /dev/nst0
AutomaticMount = yes; # when device opened, read it
AlwaysOpen = yes;
RemovableMedia = yes;
RandomAccess = no;
AutoChanger = yes
SpoolDirectory = /mnt/lustre/bacula/spool/spool2
Autoselect = yes
Maximum Job Spool Size = 100000000
# Enable the Alert command only if you have the mtx package loaded
# Alert Command = "sh -c 'tapeinfo -f %c |grep TapeAlert|cat'"
# If you have smartctl, enable this, it has more info than tapeinfo
# Alert Command = "sh -c 'smartctl -H -l error %c'"
}
#
Messages {
Name = Standard
director = bacula-dir = all
}
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic