[prev in list] [next in list] [prev in thread] [next in thread]
List: lustre-discuss
Subject: Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?
From: Cameron Harr via lustre-discuss <lustre-discuss () lists ! lustre ! org>
Date: 2024-01-10 20:17:36
Message-ID: eb7b62e0-0d7d-94c8-30e9-00bff842d349 () llnl ! gov
[Download RAW message or body]
On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
> Actually we had MDTs on software raid-1 *connecting two JBODs* for
> quite some time - worked surprisingly well and stable.
I'm glad it's working for you!
>
> Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't
> going towards raidz2 increase data safety, something you don't need if
> the SSDs anyhow never fail? Doesn't raidz2 protect against failure of
> *any* two disks - in a pool of mirrors the second failure could
> destroy one mirror?
>
With raidz2 you can replace any disk in the raid group, but there's also
a lot more drives that can fail. With mirrors, there's a 1:1 replacement
ratio with essentially no rebuild time. Of course that assumes the 2
drives you lost weren't the 2 drives in the same mirror, but we consider
that low-probability. ZFS is also smart enough to (try to) suspend the
pool if if it loses too many devices. And, the striped mirrors may see
better performance over Z2.
>
> Regards
> Thomas
>
> On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:
> > Thomas,
> >
> > We value management over performance and have knowingly left
> > performance on the floor in the name of standardization, robustness,
> > management, etc; while still maintaining our performance targets. We
> > are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID,
> > which, IMO, is very far behind ZoL in enterprise storage features.
> >
> > As Jeff mentioned, we have done some tuning (and if you haven't
> > noticed there are *a lot* of possible ZFS parameters) to further
> > improve performance and are at a good place performance-wise.
> >
> > Cameron
> >
> > On 1/8/24 10:33, Jeff Johnson wrote:
> > > Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
> > > close the gap somewhat with tuning, zfs ashift/recordsize and special
> > > allocation class vdevs. While the IOPs performance favors
> > > nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
> > > of ZFS and the security it provides to the most critical function in a
> > > Lustre file system shouldn't be undervalued. From personal experience,
> > > I'd much rather deal with zfs in the event of a seriously jackknifed
> > > MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
> > > to trying to unscramble a vendor blackbox hwraid volume. ;-)
> > >
> > > When zfs directio lands and is fully integrated into Lustre the
> > > performance differences *should* be negligible.
> > >
> > > Just my $.02 worth
> > >
> > > On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
> > > <lustre-discuss@lists.lustre.org> wrote:
> > > > Hi Cameron,
> > > >
> > > > did you run a performance comparison between ZFS and mdadm-raid on
> > > > the MDTs?
> > > > I'm currently doing some tests, and the results favor software
> > > > raid, in particular when it comes to IOPS.
> > > >
> > > > Regards
> > > > Thomas
> > > >
> > > > On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
> > > > > This doesn't answer your question about ldiskfs on zvols, but
> > > > > we've been running MDTs on ZFS on NVMe in production for a couple
> > > > > years (and on SAS SSDs for many years prior). Our current
> > > > > production MDTs using NVMe consist of one zpool/node made up of 3x
> > > > > 2-drive mirrors, but we've been experimenting lately with using
> > > > > raidz3 and possibly even raidz2 for MDTs since SSDs have been
> > > > > pretty reliable for us.
> > > > >
> > > > > Cameron
> > > > >
> > > > > On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology,
> > > > > Inc.] via lustre-discuss wrote:
> > > > > > We are in the process of retiring two long standing LFS's (about
> > > > > > 8 years old), which we built and managed ourselves. Both use ZFS
> > > > > > and have the MDT'S on ssd's in a JBOD that require the kind of
> > > > > > software-based management you describe, in our case ZFS pools
> > > > > > built on multipath devices. The MDT in one is ZFS and the MDT in
> > > > > > the other LFS is ldiskfs but uses ZFS and a zvol as you describe
> > > > > > - we build the ldiskfs MDT on top of the zvol. Generally, this
> > > > > > has worked well for us, with one big caveat. If you look for my
> > > > > > posts to this list and the ZFS list you'll find more details.
> > > > > > The short version is that we utilize ZFS snapshots and clones to
> > > > > > do backups of the metadata. We've run into situations where the
> > > > > > backup process stalls, leaving a clone hanging around. We've
> > > > > > experienced a situation a couple of times where the clone and the
> > > > > > primary zvol get swapped, effectively rolling back our metadata
> > > > > > to the point when the clone was created. I have tried,
> > > > > > unsuccessfully, to recreate
> > > > > > that in a test environment. So if you do that kind of setup,
> > > > > > make sure you have good monitoring in place to detect if your
> > > > > > backups/clones stall. We've kept up with lustre and ZFS updates
> > > > > > over the years and are currently on lustre 2.14 and ZFS 2.1.
> > > > > > We've seen the gap between our ZFS MDT and ldiskfs performance
> > > > > > shrink to the point where they are pretty much on par to each
> > > > > > now. I think our ZFS MDT performance could be better with more
> > > > > > hardware and software tuning but our small team hasn't had the
> > > > > > bandwidth to tackle that.
> > > > > >
> > > > > > Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at
> > > > > > liberty to talk about the proprietary way those devices are
> > > > > > managed. However, the metadata performance is SO much better
> > > > > > than our older LFS's, for a lot of reasons, but I'd highly
> > > > > > recommend NVMe's for your MDT's.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: lustre-discuss <lustre-discuss-bounces@lists.lustre.org
> > > > > > <mailto:lustre-discuss-bounces@lists.lustre.org>> on behalf of
> > > > > > Thomas Roth via lustre-discuss <lustre-discuss@lists.lustre.org
> > > > > > <mailto:lustre-discuss@lists.lustre.org>>
> > > > > > Reply-To: Thomas Roth <t.roth@gsi.de <mailto:t.roth@gsi.de>>
> > > > > > Date: Friday, January 5, 2024 at 9:03 AM
> > > > > > To: Lustre Diskussionsliste <lustre-discuss@lists.lustre.org
> > > > > > <mailto:lustre-discuss@lists.lustre.org>>
> > > > > > Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
> > > > > >
> > > > > >
> > > > > > CAUTION: This email originated from outside of NASA. Please take
> > > > > > care when clicking links or opening attachments. Use the "Report
> > > > > > Message" button to report suspicious messages to the NASA SOC.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Dear all,
> > > > > >
> > > > > >
> > > > > > considering NVME storage for the next MDS.
> > > > > >
> > > > > >
> > > > > > As I understand, NVME disks are bundled in software, not by a
> > > > > > hardware raid controller.
> > > > > > This would be done using Linux software raid, mdadm, correct?
> > > > > >
> > > > > >
> > > > > > We have some experience with ZFS, which we use on our OSTs.
> > > > > > But I would like to stick to ldiskfs for the MDTs, and a zpool
> > > > > > with a zvol on top which is then formatted with ldiskfs - to much
> > > > > > voodoo...
> > > > > >
> > > > > >
> > > > > > How is this handled elsewhere? Any experiences?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > The available devices are quite large. If I create a raid-10 out
> > > > > > of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close
> > > > > > to the 16 TB limit.
> > > > > > So no need for a box with lots of U.3 slots.
> > > > > >
> > > > > >
> > > > > > But for MDS operations, we will still need a powerful dual-CPU
> > > > > > system with lots of RAM.
> > > > > > Then the NVME devices should be distributed between the CPUs?
> > > > > > Is there a way to pinpoint this in a call for tender?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > --------------------------------------------------------------------
> > > > > > Thomas Roth
> > > > > >
> > > > > >
> > > > > > GSI Helmholtzzentrum für Schwerionenforschung GmbH
> > > > > > Planckstraße 1, 64291 Darmstadt, Germany,
> > > > > > https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUb \
> > > > > > mSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ \
> > > > > > <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOn \
> > > > > > UbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ \
> > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB
> > > > > > 1528
> > > > > > Managing Directors / Geschäftsführung:
> > > > > > Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
> > > > > > Chairman of the Supervisory Board / Vorsitzender des
> > > > > > GSI-Aufsichtsrats:
> > > > > > State Secretary / Staatssekretär Dr. Volkmar Dietz
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > lustre-discuss mailing list
> > > > > > lustre-discuss@lists.lustre.org
> > > > > > <mailto:lustre-discuss@lists.lustre.org>
> > > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-dis \
> > > > > > cuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > > <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-d \
> > > > > > iscuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > lustre-discuss mailing list
> > > > > > lustre-discuss@lists.lustre.org
> > > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-dis \
> > > > > > cuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ \
> > > > > >
> > > > > _______________________________________________
> > > > > lustre-discuss mailing list
> > > > > lustre-discuss@lists.lustre.org
> > > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discu \
> > > > > ss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ \
> > > > >
> > > > _______________________________________________
> > > > lustre-discuss mailing list
> > > > lustre-discuss@lists.lustre.org
> > > > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss \
> > > > -lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ \
> > > >
> > >
> > >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lus \
> > tre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$ \
> >
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustr \
> e.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$ \
>
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic