'Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?
From:       Cameron Harr via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date:       2024-01-10 20:17:36
Message-ID: eb7b62e0-0d7d-94c8-30e9-00bff842d349 () llnl ! gov
[Download RAW message or body]

On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
> Actually we had MDTs on software raid-1 *connecting two JBODs* for 
> quite some time - worked surprisingly well and stable.

I'm glad it's working for you!


> 
> Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't 
> going towards raidz2 increase data safety, something you don't need if 
> the SSDs anyhow never fail? Doesn't raidz2 protect against failure of 
> *any* two disks - in a pool of mirrors the second failure could 
> destroy one mirror?
> 
With raidz2 you can replace any disk in the raid group, but there's also 
a lot more drives that can fail. With mirrors, there's a 1:1 replacement 
ratio with essentially no rebuild time. Of course that assumes the 2 
drives you lost weren't the 2 drives in the same mirror, but we consider 
that low-probability. ZFS is also smart enough to (try to) suspend the 
pool if if it loses too many devices. And, the striped mirrors may see 
better performance over Z2.

> 
> Regards
> Thomas
> 
> On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:
> > Thomas,
> > 
> > We value management over performance and have knowingly left 
> > performance on the floor in the name of standardization, robustness, 
> > management, etc; while still maintaining our performance targets. We 
> > are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, 
> > which, IMO, is very far behind ZoL in enterprise storage features.
> > 
> > As Jeff mentioned, we have done some tuning (and if you haven't 
> > noticed there are *a lot* of possible ZFS parameters) to further 
> > improve performance and are at a good place performance-wise.
> > 
> > Cameron
> > 
> > On 1/8/24 10:33, Jeff Johnson wrote:
> > > Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
> > > close the gap somewhat with tuning, zfs ashift/recordsize and special
> > > allocation class vdevs. While the IOPs performance favors
> > > nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
> > > of ZFS and the security it provides to the most critical function in a
> > > Lustre file system shouldn't be undervalued. From personal experience,
> > > I'd much rather deal with zfs in the event of a seriously jackknifed
> > > MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
> > > to trying to unscramble a vendor blackbox hwraid volume. ;-)
> > > 
> > > When zfs directio lands and is fully integrated into Lustre the
> > > performance differences *should* be negligible.
> > > 
> > > Just my $.02 worth
> > > 
> > > On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
> > > <lustre-discuss@lists.lustre.org> wrote:
> > > > Hi Cameron,
> > > > 
> > > > did you run a performance comparison between ZFS and mdadm-raid on 
> > > > the MDTs?
> > > > I'm currently doing some tests, and the results favor software 
> > > > raid, in particular when it comes to IOPS.
> > > > 
> > > > Regards
> > > > Thomas
> > > > 
> > > > On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
> > > > > This doesn't answer your question about ldiskfs on zvols, but 
> > > > > we've been running MDTs on ZFS on NVMe in production for a couple 
> > > > > years (and on SAS SSDs for many years prior). Our current 
> > > > > production MDTs using NVMe consist of one zpool/node made up of 3x 
> > > > > 2-drive mirrors, but we've been experimenting lately with using 
> > > > > raidz3 and possibly even raidz2 for MDTs since SSDs have been 
> > > > > pretty reliable for us.
> > > > > 
> > > > > Cameron
> > > > > 
> > > > > On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
> > > > > Inc.] via lustre-discuss wrote:
> > > > > > We are in the process of retiring two long standing LFS's (about 
> > > > > > 8 years old), which we built and managed ourselves.  Both use ZFS 
> > > > > > and have the MDT'S on ssd's in a JBOD that require the kind of 
> > > > > > software-based management you describe, in our case ZFS pools 
> > > > > > built on multipath devices.  The MDT in one is ZFS and the MDT in 
> > > > > > the other LFS is ldiskfs but uses ZFS and a zvol as you describe 
> > > > > > - we build the ldiskfs MDT on top of the zvol.  Generally, this 
> > > > > > has worked well for us, with one big caveat.  If you look for my 
> > > > > > posts to this list and the ZFS list you'll find more details.  
> > > > > > The short version is that we utilize ZFS snapshots and clones to 
> > > > > > do backups of the metadata.  We've run into situations where the 
> > > > > > backup process stalls, leaving a clone hanging around.  We've 
> > > > > > experienced a situation a couple of times where the clone and the 
> > > > > > primary zvol get swapped, effectively rolling back our metadata 
> > > > > > to the point when the clone was created.  I have tried, 
> > > > > > unsuccessfully, to recreate
> > > > > > that in a test environment.  So if you do that kind of setup, 
> > > > > > make sure you have good monitoring in place to detect if your 
> > > > > > backups/clones stall.  We've kept up with lustre and ZFS updates 
> > > > > > over the years and are currently on lustre 2.14 and ZFS 2.1.  
> > > > > > We've seen the gap between our ZFS MDT and ldiskfs performance 
> > > > > > shrink to the point where they are pretty much on par to each 
> > > > > > now.  I think our ZFS MDT performance could be better with more 
> > > > > > hardware and software tuning but our small team hasn't had the 
> > > > > > bandwidth to tackle that.
> > > > > > 
> > > > > > Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at 
> > > > > > liberty to talk about the proprietary way those devices are 
> > > > > > managed.  However, the metadata performance is SO much better 
> > > > > > than our older LFS's, for a lot of reasons, but I'd highly 
> > > > > > recommend NVMe's for your MDT's.
> > > > > > 
> > > > > > -----Original Message-----
> > > > > > From: lustre-discuss <lustre-discuss-bounces@lists.lustre.org 
> > > > > > <mailto:lustre-discuss-bounces@lists.lustre.org>> on behalf of 
> > > > > > Thomas Roth via lustre-discuss <lustre-discuss@lists.lustre.org 
> > > > > > <mailto:lustre-discuss@lists.lustre.org>>
> > > > > > Reply-To: Thomas Roth <t.roth@gsi.de <mailto:t.roth@gsi.de>>
> > > > > > Date: Friday, January 5, 2024 at 9:03 AM
> > > > > > To: Lustre Diskussionsliste <lustre-discuss@lists.lustre.org 
> > > > > > <mailto:lustre-discuss@lists.lustre.org>>
> > > > > > Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
> > > > > > 
> > > > > > 
> > > > > > CAUTION: This email originated from outside of NASA. Please take 
> > > > > > care when clicking links or opening attachments. Use the "Report 
> > > > > > Message" button to report suspicious messages to the NASA SOC.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Dear all,
> > > > > > 
> > > > > > 
> > > > > > considering NVME storage for the next MDS.
> > > > > > 
> > > > > > 
> > > > > > As I understand, NVME disks are bundled in software, not by a 
> > > > > > hardware raid controller.
> > > > > > This would be done using Linux software raid, mdadm, correct?
> > > > > > 
> > > > > > 
> > > > > > We have some experience with ZFS, which we use on our OSTs.
> > > > > > But I would like to stick to ldiskfs for the MDTs, and a zpool 
> > > > > > with a zvol on top which is then formatted with ldiskfs - to much 
> > > > > > voodoo...
> > > > > > 
> > > > > > 
> > > > > > How is this handled elsewhere? Any experiences?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > The available devices are quite large. If I create a raid-10 out 
> > > > > > of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close 
> > > > > > to the 16 TB limit.
> > > > > > So no need for a box with lots of U.3 slots.
> > > > > > 
> > > > > > 
> > > > > > But for MDS operations, we will still need a powerful dual-CPU 
> > > > > > system with lots of RAM.
> > > > > > Then the NVME devices should be distributed between the CPUs?
> > > > > > Is there a way to pinpoint this in a call for tender?
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Best regards,
> > > > > > Thomas
> > > > > > 
> > > > > > 
> > > > > > --------------------------------------------------------------------
> > > > > > Thomas Roth
> > > > > > 
> > > > > > 
> > > > > > GSI Helmholtzzentrum für Schwerionenforschung GmbH
> > > > > > Planckstraße 1, 64291 Darmstadt, Germany, 
> > > > > > https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUb \
> > > > > > mSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ \
> > > > > >  <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOn \
> > > > > > UbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ \
> > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 
> > > > > > 1528
> > > > > > Managing Directors / Geschäftsführung:
> > > > > > Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> > > > > > Chairman of the Supervisory Board / Vorsitzender des 
> > > > > > GSI-Aufsichtsrats:
> > > > > > State Secretary / Staatssekretär Dr. Volkmar Dietz
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > lustre-discuss mailing list
> > > > > > lustre-discuss@lists.lustre.org 
> > > > > > <mailto:lustre-discuss@lists.lustre.org>
> > > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-dis \
> > > > > > cuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > >  <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-d \
> > > > > > iscuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > > 
> > > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > lustre-discuss mailing list
> > > > > > lustre-discuss@lists.lustre.org
> > > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-dis \
> > > > > > cuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > >  
> > > > > _______________________________________________
> > > > > lustre-discuss mailing list
> > > > > lustre-discuss@lists.lustre.org
> > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discu \
> > > > > ss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ \
> > > > >  
> > > > _______________________________________________
> > > > lustre-discuss mailing list
> > > > lustre-discuss@lists.lustre.org
> > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss \
> > > > -lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ \
> > > >  
> > > 
> > > 
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lus \
> > tre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$ \
> > 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustr \
> e.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$ \
> 

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic