[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: [lustre-discuss] MDT not mounting after tunefs.lustre changes on ZFS volumes
From: Bob Torgerson <rltorgerson () alaska ! edu>
Date: 2019-05-19 22:05:30
Message-ID: CAJPG0pWxT4VxB01mfSWhumzi=quHzherer+D+X82=ks8_-D00w () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Hello,
This is for a Lustre 2.10.3 file system with a single MDS and three OSSes.
The MDS has a separate MGT and MDT both mounted on it, and each OSS has 5
OSTs that do not failover between the hosts. We use ZFS for the backend
service that the devices live on for each of the Lustre targets.
Here is the layout of the ZFS pool digdug-meta on our MDS server containing
both the MGT and MDT:
NAME USED AVAIL REFER MOUNTPOINT
digdug-meta 268G 453G 96K /digdug-meta
digdug-meta/lustre2-mdt0 266G 453G 266G /digdug-meta/lustre2-mdt0
digdug-meta/mgs 4.10M 453G 4.10M /digdug-meta/mgs
Yesterday, while attempting to add a new MDS server to act as a failover
node for the MGT and MDT, I stopped all of the file system and all of the
targets on the MDS (MGT and MDT) and OSSes. The new MDS server is
192.168.2.13@o2ib1 and the current MDS server is 192.168.2.14@o2ib1 After
which, I ran the following command on the MGT and MDT:
# tunefs.lustre --verbose --writeconf --erase-params
--servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1
digdug-meta/mgs
# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1
--servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1
digdug-meta/lustre2-mdt0
I ran an tunefs.lustre on each of the OSTs too, which followed the pattern:
# tunefs.lustre --verbose --writeconf --erase-params
--mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1
--servicenode=<OSS NID> digdug-ost#/lustre2
After I made that change, I started the MGT and MDT on the original MDS,
which originally worked fine; then I started all of the OSTs, and even
mounted a client, but when I tried to bring up the MGT and MDT on the new
MDS node 192.168.2.13@o2ib1, it didn't work. I decided to just try and
bring up the MGT and MDT back on the original MDS again and figure it out
later, but now I can't get the MDT to mount on the original MDS either. I'm
getting the following set of errors when trying to mount the MDT after the
MGT has been mounted:
May 19 13:53:09 mds02 systemd: Starting SYSV: Part of the lustre file
system....
May 19 13:53:09 mds02 lustre: Mounting digdug-meta/mgs on
/mnt/lustre/local/MGS
May 19 13:53:09 mds02 lustre: mount.lustre: according to /etc/mtab
digdug-meta/mgs is already mounted on /mnt/lustre/local/MGS
May 19 13:53:11 mds02 lustre: Mounting digdug-meta/lustre2-mdt0 on
/mnt/lustre/local/lustre2-MDT0000
May 19 13:53:11 mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed
by user request. All servers must be restarted in order to regenerate the
logs: rc = 0
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) lustre2-MDT0000-osd: bad
log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 (expected 0x10645539)
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar
message
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14@o2ib1:
failed to copy remote log lustre2-MDT0000: rc = -5
May 19 13:53:12 mds02 kernel: LustreError: 13a-8: Failed to get MGS log
lustre2-MDT0000 and no local copy.
May 19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14@o2ib1:
The configuration from log 'lustre2-MDT0000' failed (-2). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start
server lustre2-MDT0000: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start
targets: -2
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000
May 19 13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete
May 19 13:53:12 mds02 kernel: LustreError:
14135:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount (-2)
May 19 13:53:12 mds02 lustre: mount.lustre: mount digdug-meta/lustre2-mdt0
at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or directory
May 19 13:53:12 mds02 lustre: Is the MGS specification correct?
May 19 13:53:12 mds02 lustre: Is the filesystem name correct?
May 19 13:53:12 mds02 lustre: If upgrading, is the copied client log valid?
(see upgrade docs)
May 19 13:53:13 mds02 systemd: lustre.service: control process exited,
code=exited status=2
May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre
file system..
May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed state.
May 19 13:53:13 mds02 systemd: lustre.service failed.
This morning it was also discovered that the ZFS pool that contains the MGT
and MDT has a permanent error that may also be impacting our ability to
mount the MDT:
# zpool status -v digdug-meta
pool: digdug-meta
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: none requested
config:
NAME STATE READ WRITE CKSUM
digdug-meta ONLINE 0 0 70
mirror-0 ONLINE 0 0 141
scsi-35000c5003017156b ONLINE 0 0 141
scsi-35000c500301715e7 ONLINE 0 0 141
scsi-35000c5003017158b ONLINE 0 0 141
scsi-35000c500301716a3 ONLINE 0 0 141
mirror-1 ONLINE 0 0 1
scsi-35000c5003017155f ONLINE 0 0 1
scsi-35000c500301715a7 ONLINE 0 0 1
scsi-35000c5003017159b ONLINE 0 0 1
scsi-35000c5003017158f ONLINE 0 0 1
errors: Permanent errors have been detected in the following files:
digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0
I'm not sure what my next steps would be to recover this file system if at
all possible, and would greatly appreciate any help from this group.
Thank you in advance,
Bob Torgerson
[Attachment #5 (text/html)]
<div dir="ltr"><div dir="ltr" class="gmail_signature" \
data-smartmail="gmail_signature"><div \
dir="ltr"><div>Hello,</div><div><br></div><div>This is for a Lustre 2.10.3 file \
system with a single MDS and three OSSes. The MDS has a separate MGT and MDT both \
mounted on it, and each OSS has 5 OSTs that do not failover between the hosts. We use \
ZFS for the backend service that the devices live on for each of the Lustre \
targets.</div><div><br></div><div>Here is the layout of the ZFS pool digdug-meta on \
our MDS server containing both the MGT and MDT:</div><br>NAME \
USED AVAIL REFER MOUNTPOINT<br>digdug-meta 268G 453G \
96K /digdug-meta<br>digdug-meta/lustre2-mdt0 266G 453G 266G \
/digdug-meta/lustre2-mdt0<br>digdug-meta/mgs 4.10M 453G 4.10M \
/digdug-meta/mgs<div><br></div><div>Yesterday, while attempting to add a new MDS \
server to act as a failover node for the MGT and MDT, I stopped all of the file \
system and all of the targets on the MDS (MGT and MDT) and OSSes. The new MDS server \
is 192.168.2.13@o2ib1 and the current MDS server is 192.168.2.14@o2ib1 After which, I \
ran the following command on the MGT and MDT:</div><div><br></div><div># \
tunefs.lustre --verbose --writeconf --erase-params --servicenode=192.168.2.13@o2ib1 \
--servicenode=192.168.2.14@o2ib1 digdug-meta/mgs<br><br># tunefs.lustre --verbose \
--writeconf --erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 \
--servicenode=192.168.2.13@o2ib1 --servicenode=192.168.2.14@o2ib1 \
digdug-meta/lustre2-mdt0<br><br>I ran an tunefs.lustre on each of the OSTs too, which \
followed the pattern:</div><div><br></div><div># tunefs.lustre --verbose --writeconf \
--erase-params --mgsnode=192.168.2.13@o2ib1 --mgsnode=192.168.2.14@o2ib1 \
--servicenode=<OSS NID> digdug-ost#/lustre2</div><div><br></div><div>After I \
made that change, I started the MGT and MDT on the original MDS, which originally \
worked fine; then I started all of the OSTs, and even mounted a client, but when I \
tried to bring up the MGT and MDT on the new MDS node 192.168.2.13@o2ib1, it \
didn't work. I decided to just try and bring up the MGT and MDT back on the \
original MDS again and figure it out later, but now I can't get the MDT to mount \
on the original MDS either. I'm getting the following set of errors when trying \
to mount the MDT after the MGT has been mounted:<br><br>May 19 13:53:09 mds02 \
systemd: Starting SYSV: Part of the lustre file system....<br>May 19 13:53:09 mds02 \
lustre: Mounting digdug-meta/mgs on /mnt/lustre/local/MGS<br>May 19 13:53:09 mds02 \
lustre: mount.lustre: according to /etc/mtab digdug-meta/mgs is already mounted on \
/mnt/lustre/local/MGS<br>May 19 13:53:11 mds02 lustre: Mounting \
digdug-meta/lustre2-mdt0 on /mnt/lustre/local/lustre2-MDT0000<br>May 19 13:53:11 \
mds02 kernel: Lustre: MGS: Logs for fs lustre2 were removed by user request. All \
servers must be restarted in order to regenerate the logs: rc = 0<br>May 19 13:53:12 \
mds02 kernel: LustreError: 14135:0:(llog_osd.c:262:llog_osd_read_header()) \
lustre2-MDT0000-osd: bad log lustre2-MDT0000 [0xa:0x7b:0x0] header magic: 0x0 \
(expected 0x10645539)<br>May 19 13:53:12 mds02 kernel: LustreError: \
14135:0:(llog_osd.c:262:llog_osd_read_header()) Skipped 1 previous similar \
message<br>May 19 13:53:12 mds02 kernel: LustreError: \
14135:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC192.168.2.14@o2ib1: failed to \
copy remote log lustre2-MDT0000: rc = -5<br>May 19 13:53:12 mds02 kernel: \
LustreError: 13a-8: Failed to get MGS log lustre2-MDT0000 and no local copy.<br>May \
19 13:53:12 mds02 kernel: LustreError: 15c-8: MGC192.168.2.14@o2ib1: The \
configuration from log 'lustre2-MDT0000' failed (-2). This may be the result \
of communication errors between this node and the MGS, a bad configuration, or other \
errors. See the syslog for more information.<br>May 19 13:53:12 mds02 kernel: \
LustreError: 14135:0:(obd_mount_server.c:1373:server_start_targets()) failed to start \
server lustre2-MDT0000: -2<br>May 19 13:53:12 mds02 kernel: LustreError: \
14135:0:(obd_mount_server.c:1866:server_fill_super()) Unable to start targets: \
-2<br>May 19 13:53:12 mds02 kernel: LustreError: \
14135:0:(obd_mount_server.c:1576:server_put_super()) no obd lustre2-MDT0000<br>May 19 \
13:53:12 mds02 kernel: Lustre: server umount lustre2-MDT0000 complete<br>May 19 \
13:53:12 mds02 kernel: LustreError: 14135:0:(obd_mount.c:1506:lustre_fill_super()) \
Unable to mount (-2)<br>May 19 13:53:12 mds02 lustre: mount.lustre: mount \
digdug-meta/lustre2-mdt0 at /mnt/lustre/local/lustre2-MDT0000 failed: No such file or \
directory<br>May 19 13:53:12 mds02 lustre: Is the MGS specification correct?<br>May \
19 13:53:12 mds02 lustre: Is the filesystem name correct?<br>May 19 13:53:12 mds02 \
lustre: If upgrading, is the copied client log valid? (see upgrade docs)<br>May 19 \
13:53:13 mds02 systemd: lustre.service: control process exited, code=exited \
status=2<br>May 19 13:53:13 mds02 systemd: Failed to start SYSV: Part of the lustre \
file system..<br>May 19 13:53:13 mds02 systemd: Unit lustre.service entered failed \
state.<br>May 19 13:53:13 mds02 systemd: lustre.service \
failed.<br></div><div><br></div><div><br></div><div>This morning it was also \
discovered that the ZFS pool that contains the MGT and MDT has a permanent error that \
may also be impacting our ability to mount the MDT:<br><br>
# zpool status -v digdug-meta<br><br> pool: digdug-meta<br><br> state: \
ONLINE<br><br>status: One or more devices has experienced an error resulting in data \
corruption. Applications may be affected.<br><br>action: Restore the file in \
question if possible. Otherwise restore the<br><br>entire pool from backup.<br><br> \
see: <a href="http://zfsonlinux.org/msg/ZFS-8000-8A">http://zfsonlinux.org/msg/ZFS-8000-8A</a><br><br> \
scan: none requested<br><br>config:<br><br><br>NAME \
STATE READ WRITE CKSUM<br><br>digdug-meta ONLINE \
0 0 70<br><br> mirror-0 ONLINE 0 \
0 141<br><br> scsi-35000c5003017156b ONLINE 0 0 \
141<br><br> scsi-35000c500301715e7 ONLINE 0 0 141<br><br> \
scsi-35000c5003017158b ONLINE 0 0 141<br><br> \
scsi-35000c500301716a3 ONLINE 0 0 141<br><br> mirror-1 \
ONLINE 0 0 1<br><br> scsi-35000c5003017155f ONLINE \
0 0 1<br><br> scsi-35000c500301715a7 ONLINE 0 0 \
1<br><br> scsi-35000c5003017159b ONLINE 0 0 1<br><br> \
scsi-35000c5003017158f ONLINE 0 0 1<br><br>errors: Permanent \
errors have been detected in the following files:<br><br> \
digdug-meta/lustre2-mdt0:/oi.10/0xa:0x7b:0x0</div><div><br></div><div><br></div><div>I'm \
not sure what my next steps would be to recover this file system if at all possible, \
and would greatly appreciate any help from this group.</div><div><br></div><div>Thank \
you in advance,</div><div><br>Bob Torgerson</div></div></div></div>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic