'[ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Re: High ceph_osd_commit_latency_ms on Toshiba MG07ACA14TE HDDs
From:       Igor Fedotov <ifedotov () suse ! de>
Date:       2020-06-24 12:58:47
Message-ID: 207a0eec-7ece-ab24-6520-9659666c14a9 () suse ! de
[Download RAW message or body]

Benoit, thanks for the update.

for the sake of completeness one more experiment please if possible:

turn off write cache for HGST drives and measure commit latency once again.


Kind regards,

Igor

On 6/24/2020 3:53 PM, Benoît Knecht wrote:
> Thank you all for your answers, this was really helpful!
> 
> Stefan Priebe wrote:
> > yes we have the same issues and switched to seagate for those reasons.
> > you can fix at least a big part of it by disabling the write cache of
> > those drives - generally speaking it seems the toshiba firmware is
> > broken.
> > I was not able to find a newer one.
> Good to know that we're not alone :) I also looked for a newer firmware, to no
> avail.
> 
> Igor Fedotov wrote:
> > Benoit, wondering what are the write cache settings in your case?
> > 
> > And do you see any difference after disabling it if any?
> Write cache is enabled on all our OSDs (including the HGST drives that don't
> have a latency issue).
> 
> To see if disabling write cache on the Toshiba drives would help, I turned it
> off on all 12 drives in one of our OSD nodes:
> 
> ```
> for disk in /dev/sd{a..l}; do hdparm -W0 $disk; done
> ```
> 
> and left it on in the remaining nodes. I used `rados bench write` to create
> some load on the cluster, and looked at
> 
> ```
> avg by (hostname) (ceph_osd_commit_latency_ms * on (ceph_daemon) group_left \
> (hostname) ceph_osd_metadata) ```
> 
> in Prometheus. The hosts with write cache _enabled_ had a commit latency around
> 145ms, while the host with write cache _disabled_ had a commit latency around
> 25ms. So it definitely helps!
> 
> Mark Nelson wrote:
> > This isn't the first time I've seen drive cache cause problematic
> > latency issues, and not always from the same manufacturer.
> > Unfortunately it seems like you really have to test the drives you
> > want to use before deploying them them to make sure you don't run into
> > issues.
> That's very true! Data sheets and even public benchmarks can be quite
> deceiving, and two hard drives that seem to have similar performance profiles
> can perform very differently within a Ceph cluster. Lesson learned.
> 
> Cheers,
> 
> --
> Ben
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-leave@ceph.io


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic