'[ceph-users] Changing the failure domain'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Changing the failure domain
From:       drakonstein () gmail ! com (David Turner)
Date:       2017-08-31 14:56:57
Message-ID: CAN-Gep+-jTXOCHZAVfb-T+MGJqwcCGu2Gm=TYeKByxz3kjysHw () mail ! gmail ! com
[Download RAW message or body]

How long are you seeing these blocked requests for?  Initially or
perpetually?  Changing the failure domain causes all PGs to peer at the
same time.  This would be the cause if it happens really quickly.  There is
no way to avoid all of them peering while making a change like this.  After
that, It could easily be caused because a fair majority of your data is
probably set to move around.  I would check what might be causing the
blocked requests during this time.  See if there is an OSD that might be
dying (large backfills have the tendancy to find a couple failing drives)
which could easily cause things to block.  Also checking if your disks or
journals are maxed out with iostat could shine some light on any mitigating
factor.

On Thu, Aug 31, 2017 at 9:01 AM Laszlo Budai <laszlo at componentsoft.eu>
wrote:

> Dear all!
>
> In our Hammer cluster we are planning to switch our failure domain from
> host to chassis. We have performed some simulations, and regardless of the
> settings we have used some slow requests have appeared all the time.
>
> we had the the following settings:
>
>   "osd_max_backfills": "1",
>      "osd_backfill_full_ratio": "0.85",
>      "osd_backfill_retry_interval": "10",
>      "osd_backfill_scan_min": "1",
>      "osd_backfill_scan_max": "4",
>      "osd_kill_backfill_at": "0",
>      "osd_debug_skip_full_check_in_backfill_reservation": "false",
>      "osd_debug_reject_backfill_probability": "0",
>
>     "osd_min_recovery_priority": "0",
>      "osd_allow_recovery_below_min_size": "true",
>      "osd_recovery_threads": "1",
>      "osd_recovery_thread_timeout": "60",
>      "osd_recovery_thread_suicide_timeout": "300",
>      "osd_recovery_delay_start": "0",
>      "osd_recovery_max_active": "1",
>      "osd_recovery_max_single_start": "1",
>      "osd_recovery_max_chunk": "8388608",
>      "osd_recovery_forget_lost_objects": "false",
>      "osd_recovery_op_priority": "1",
>      "osd_recovery_op_warn_multiple": "16",
>
>
> we have also tested it with the CFQ IO scheduler on the OSDs and the
> following params:
>      "osd_disk_thread_ioprio_priority": "7"
>      "osd_disk_thread_ioprio_class": "idle"
>
> and the nodeep-scrub set.
>
> Is there anything else to try? Is there a good way to switch from one kind
> of failure domain to an other without slow requests?
>
> Thank you in advance for any suggestions.
>
> Kind regards,
> Laszlo
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170831/1bd71966/attachment.html>

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic