[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    Re: [lustre-discuss] How to eliminate zombie OSTs
From:       Alejandro Sierra via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date:       2023-08-10 21:59:33
Message-ID: CAP13dT2+Wh=yji4t=YpSDmAFAwT8p8cTwoxrOMFfZAXmKYx7og () mail ! gmail ! com
[Download RAW message or body]

Following https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost
 today I did the following with apparently no difference

lctl set_param osp.lustre-OST0018-osc-MDT0000.max_create_count=0

but I also did

lctl --device 30 deactivate

and now the 10 zombie OSTs appear as IN, not UP

# lctl dl|grep OST|grep IN
 18 IN osp lustre-OST000a-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 19 IN osp lustre-OST000b-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 20 IN osp lustre-OST000c-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 21 IN osp lustre-OST000d-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 22 IN osp lustre-OST000e-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 29 IN osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 30 IN osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 31 IN osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
 32 IN osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4

I also deactivated the OSTs in the client with

# lctl set_param osc.lustre-OST000b-*.active=0
osc.lustre-OST000b-osc-ffff979fbcc8b800.active=0

but I still get them as errors in the client:

# lfs check osts|grep error
lfs check: error: check 'lustre-OST000a-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST000b-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST000c-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST000d-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST000e-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST0014-osc-ffff979fbcc8b800': Cannot
allocate memory (12)
lfs check: error: check 'lustre-OST0015-osc-ffff979fbcc8b800':
Resource temporarily unavailable (11)
lfs check: error: check 'lustre-OST0016-osc-ffff979fbcc8b800':
Resource temporarily unavailable (11)
lfs check: error: check 'lustre-OST0017-osc-ffff979fbcc8b800':
Resource temporarily unavailable (11)
lfs check: error: check 'lustre-OST0018-osc-ffff979fbcc8b800':
Resource temporarily unavailable (11)

I will keep reading the reference, but if you have any suggestion, I
will appreciate it.

El mié, 9 ago 2023 a la(s) 11:08, Horn, Chris (chris.horn@hpe.com) escribió:
> 
> The error message is stating that ‘-P' is not valid option to the conf_param \
> command. You may be thinking of lctl set_param -P … 
> Did you follow the documented procedure for removing an OST from the filesystem \
> when you "adjust[ed] the configuration"? 
> https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost
>  
> Chris Horn
> 
> 
> 
> From: lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on behalf of \
>                 Alejandro Sierra via lustre-discuss \
>                 <lustre-discuss@lists.lustre.org>
> Date: Wednesday, August 9, 2023 at 11:55 AM
> To: Jeff Johnson <jeff.johnson@aeoncomputing.com>
> Cc: lustre-discuss <lustre-discuss@lists.lustre.org>
> Subject: Re: [lustre-discuss] How to eliminate zombie OSTs
> 
> Yes, it is.
> 
> El mié, 9 ago 2023 a la(s) 10:49, Jeff Johnson
> (jeff.johnson@aeoncomputing.com) escribió:
> > 
> > Alejandro,
> > 
> > Is your MGS located on the same node as your primary MDT? (combined MGS/MDT node)
> > 
> > --Jeff
> > 
> > On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss \
> > <lustre-discuss@lists.lustre.org> wrote:
> > > 
> > > Hello,
> > > 
> > > In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS
> > > with 4 jboxes, each box with 24 disks of 12 TB each, for a total of
> > > nearly 1 PB. In all that time we had power failures and failed raid
> > > controller cards, all of which made us adjust the configuration. After
> > > the last failure, the system keeps sending error messages about OSTs
> > > that are no more in the system. In the MDS I do
> > > 
> > > # lctl dl
> > > 
> > > and I get the 20 currently active OSTs
> > > 
> > > oss01.lanot.unam.mx     -       OST00   /dev/disk/by-label/lustre-OST0000
> > > oss01.lanot.unam.mx     -       OST01   /dev/disk/by-label/lustre-OST0001
> > > oss01.lanot.unam.mx     -       OST02   /dev/disk/by-label/lustre-OST0002
> > > oss01.lanot.unam.mx     -       OST03   /dev/disk/by-label/lustre-OST0003
> > > oss01.lanot.unam.mx     -       OST04   /dev/disk/by-label/lustre-OST0004
> > > oss01.lanot.unam.mx     -       OST05   /dev/disk/by-label/lustre-OST0005
> > > oss01.lanot.unam.mx     -       OST06   /dev/disk/by-label/lustre-OST0006
> > > oss01.lanot.unam.mx     -       OST07   /dev/disk/by-label/lustre-OST0007
> > > oss01.lanot.unam.mx     -       OST08   /dev/disk/by-label/lustre-OST0008
> > > oss01.lanot.unam.mx     -       OST09   /dev/disk/by-label/lustre-OST0009
> > > oss02.lanot.unam.mx     -       OST15   /dev/disk/by-label/lustre-OST000f
> > > oss02.lanot.unam.mx     -       OST16   /dev/disk/by-label/lustre-OST0010
> > > oss02.lanot.unam.mx     -       OST17   /dev/disk/by-label/lustre-OST0011
> > > oss02.lanot.unam.mx     -       OST18   /dev/disk/by-label/lustre-OST0012
> > > oss02.lanot.unam.mx     -       OST19   /dev/disk/by-label/lustre-OST0013
> > > oss02.lanot.unam.mx     -       OST25   /dev/disk/by-label/lustre-OST0019
> > > oss02.lanot.unam.mx     -       OST26   /dev/disk/by-label/lustre-OST001a
> > > oss02.lanot.unam.mx     -       OST27   /dev/disk/by-label/lustre-OST001b
> > > oss02.lanot.unam.mx     -       OST28   /dev/disk/by-label/lustre-OST001c
> > > oss02.lanot.unam.mx     -       OST29   /dev/disk/by-label/lustre-OST001d
> > > 
> > > but I also get 5 that are not currently active, in fact doesn't exist
> > > 
> > > 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> > > 29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> > > 30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> > > 31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> > > 32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4
> > > 
> > > When I try to eliminate them with
> > > 
> > > lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0
> > > 
> > > I get the error
> > > 
> > > conf_param: invalid option -- 'P'
> > > set a permanent config parameter.
> > > This command must be run on the MGS node
> > > usage: conf_param [-d] <target.keyword=val>
> > > -d  Remove the permanent setting.
> > > 
> > > If I do
> > > 
> > > lctl --device 28 deactivate
> > > 
> > > I don't get an error, but nothing changes
> > > 
> > > What can I do?
> > > 
> > > Thank you in advance for any help.
> > > 
> > > --
> > > Alejandro Aguilar Sierra
> > > LANOT, ICAyCC, UNAM
> > > _______________________________________________
> > > lustre-discuss mailing list
> > > lustre-discuss@lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > 
> > 
> > 
> > --
> > ------------------------------
> > Jeff Johnson
> > Co-Founder
> > Aeon Computing
> > 
> > jeff.johnson@aeoncomputing.com
> > http://www.aeoncomputing.com
> > t: 858-412-3810 x1001   f: 858-412-3845
> > m: 619-204-9061
> > 
> > 4170 Morena Boulevard, Suite C - San Diego, CA 92117
> > 
> > High-Performance Computing / Lustre Filesystems / Scale-out Storage
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic