'[ceph-users] Ceph performances'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ceph-users
Subject:    [ceph-users] Ceph performances
From:       remi-buisson () orange ! fr (=?UTF-8?Q?R=C3=A9mi_BUISSON?=)
Date:       2015-11-30 14:07:19
Message-ID: b9e787a02d08e872e6dc3d4a25e35343 () didiandrems ! fr
[Download RAW message or body]

Hello Robert,

OK. I already tried this but as you said performances decrease. I just 
built the 10.0.0 version and it seems that there are some regressions in 
there. I've now 3.5 Kiops instead of 21 Kiops in 9.2.0 :-/

Thanks.

R?mi

Le 2015-11-25 18:54, Robert LeBlanc a ?crit?:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> I'm really surprised that you are getting 100K IOPs from the Intel
> S3610s. We are already in the process of ordering some to test along
> side other drives, so I should be able to verify that as well. With
> the S3700 and S3500, I was able to only get 20K IOPs when running 8
> threads (that's about where the performance tapered off). With a
> single thread, I was getting 5.5K on the S3500 and 4.4K on the S3700.
> I'll be really happy if we are seeing 100K out of the S3610s, that
> will make some decisions much easier.
> 
> Something else you can try is increasing the number of disk threads in
> the OSD so you get more parallelism to the drive. I've heard from
> someone else that increase the thread count did not do as much as
> partitioning up the drive and having multiple OSDs on the same SSD. He
> found that increases diminished after 4 OSDs. You can try something
> like that.
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWVfXNCRDmVDuy+mK58QAAshgP/RL6sdt2fGdB8OnmyMxs
> 3LZHgXqeBW8fEJx6hW0y9ElENDIZlW4QawUOcu6eclW9vB7ZrcFpyOjAbnqT
> llB6yH+t9NAO0JA7HN/tbME5pzUYM4hI0LxVffnIml+fw5Doj0mm+Lp1tpA4
> K2PyUj3PZSj9TDlrL8zkyx5l9xA9NUrPXB/L/hprcDOI+nK6IRCcm/7g1YuG
> wlLZxVoRemhtL6isLKGv3s79RwpcXf7bbKRN556Ypj53n8ry19USrNcr+hLy
> ZcIeibB9bZIhI5XjA+Fj58D9wqBRM9r0a9yEwXABZ4Sekb/wOWKby3Sr7nJ4
> MPuIjplzWV9AEiGm3D0nvaZVlEpVSHjKVhu5nu3DyIWQvKkkOeOVqvwNe7zZ
> WJHyaQg9c3viLwGSoxYyOBt4YQ2jJoncWtjj9AkBkQrlfZKlGbz3952SeHct
> 32UFbcWbkaODX3xbo92oUyitAuuOJTUcAAxamienyZCj7QUSnRVdDLyTZJvk
> /4SmFGmh+XXipdPlQKcadky9ZZr9Ipzq9vzIMHo9HY3giSrbs8PZ9N3+HnPb
> saqdmQXKaqM/n0i/1Jo85zmxranAEZDEYeR57LwNIBA53IRU2Gbd2ms9KT59
> BIh1fgj6n04fSEM6k706YHqzghfdJ+mlqTyB6jdSyAPMNu5oX6LgqlIWoaGi
> CWOT
> =r3k4
> -----END PGP SIGNATURE-----
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Wed, Nov 25, 2015 at 5:40 AM, R?mi BUISSON <remi-buisson at orange.fr> 
> wrote:
> > 
> > 
> > 
> > 
> > Hello Robert,
> > 
> > Sorry for late answer.
> > 
> > Thanks for your reply. I updated to infernalis and I applied all your
> > recommendations but it doesn't change anything, with or without cache
> > tiering :-/
> > 
> > I also compared XFS to EXT4 and BTRFS but it doesn't make the 
> > difference.
> > 
> > The fio command from Sebastien Han tells me my disks can do 100 Kiops
> > actually, so it's really frustrating :-S
> > 
> > R?mi
> > 
> > Le 2015-11-07 15:59, Robert LeBlanc a ?crit :
> > 
> > You most likely did the wrong test to get baseline Ceph IOPS or of 
> > your
> > ssds. Ceph is really hard on SSDS and it does direct sync writes which
> > drives handle very different even between models of the same brand. 
> > Start
> > with
> > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >  as your base numbers and just realize that hammer still can't use all 
> > those
> > IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge
> > pages, ld_preloading jemalloc (uses a little more RAM but your config 
> > should
> > be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure 
> > to
> > 500, and greatly increasing the network buffers and disabling the slow 
> > tcp
> > startup. We are also using EXT4 which I've found is a bit faster but 
> > it had
> > recently been reported that someone is having deadlocks/crashes with 
> > it. We
> > are having an XFS log issue on one of our clusters causing an OSD or 
> > two to
> > fail every week.
> > 
> > When I tested the same workload in an SSD cache tier the performance 
> > was
> > only 50% of what I was able to achieve on the pure SSD tier (I'm 
> > guessing
> > overhead of the cache tier). And this was with having the entire test 
> > set in
> > the SSD tier so there was no spindle activity.
> > 
> > Short answer is that your will need a lot more SSDS to hit your target 
> > with
> > hammer. Or if you can wait for Jewel you may be able to get by with 
> > only
> > needing a little bit more.
> > 
> > Robert LeBlanc
> > 
> > Sent from a mobile device please excuse any typos.
> > 
> > On Nov 7, 2015 1:24 AM, "R?mi BUISSON" <remi-buisson at orange.fr> wrote:
> > > 
> > > Hi guys,
> > > 
> > > I would need your help to figure out performance issues on my ceph
> > > cluster.
> > > I've read pretty much every thread on the net concerning this topic 
> > > but I
> > > didn't manage to have acceptable performances.
> > > In my company, we are planning to replace our existing virtualization
> > > infrastucture NAS by a ceph cluster in order to improve the global 
> > > platform
> > > performances, scalability and security. The current NAS we have 
> > > handle about
> > > 50k iops.
> > > 
> > > For this we bought:
> > > 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB 
> > > RAM,
> > > 2 x 10Gbps network interfaces (bonding)
> > > 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB 
> > > RAM,
> > > 2 x 10Gbps network interfaces (bonding)
> > > 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 
> > > GB
> > > RAM, 2 x 10Gbps network interfaces (bonding)
> > > 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
> > > 2.40GHz,
> > > 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD 
> > > INTEL
> > > SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces 
> > > (bonding)
> > > 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
> > > 2.40GHz,
> > > 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST
> > > Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network 
> > > interfaces
> > > (bonding)
> > > 
> > > The total of this is 84 OSDs.
> > > 
> > > I created two 4096 pgs pools, one called rbd-cold-storage and the 
> > > other
> > > rbd-hot-storage. As you may guess, the rbd-cold-storage is composed 
> > > of the 4
> > > OSD servers with platter disks and the rbd-hot-storage is composed of 
> > > the 2
> > > OSD servers with SSD disks.
> > > On the rdb-cold-storage, I created an rbd device which is mapped on 
> > > the
> > > NFS server.
> > > 
> > > I benched each of the SSD we have and it can handle 40k iops each. As 
> > > my
> > > replication factor is 2, the theoritical performance of the cluster 
> > > is (2 x
> > > 6 (OSD cache) x 40k) / 2 = 240k iops.
> > > 
> > > I'm currently benching the cluster with fio tool from one NFS server. 
> > > Here
> > > my fio job file:
> > > [global]
> > > ioengine=libaio
> > > iodepth=32
> > > runtime=300
> > > direct=1
> > > filename=/dev/rbd0
> > > group_reporting=1
> > > gtod_reduce=1
> > > randrepeat=1
> > > size=4G
> > > numjobs=1
> > > 
> > > [4k-rand-write]
> > > new_group
> > > bs=4k
> > > rw=randwrite
> > > stonewall
> > > 
> > > The problem is I can't get more than 15k iops for writes. In my 
> > > monitoring
> > > engine, I can see that each of the OSD (cache) SSD are not doing more 
> > > than
> > > 2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't 
> > > expect
> > > to reach the theoritical value but reaching 100k iops would be 
> > > perfect.
> > > 
> > > My cluster is running on debian jessie with ceph Hammer v0.94.5 
> > > debian
> > > package (compiled with --with-jemalloc option, I also tried without). 
> > > Here
> > > is my ceph.conf:
> > > 
> > > 
> > > [global]
> > > fsid = 5046f766-670f-4705-adcc-290f434c8a83
> > > 
> > > # basic settings
> > > mon initial members = a01cepmon001,a01cepmon002,a01cepmon003
> > > mon host = 10.10.69.254,10.10.69.253,10.10.69.252
> > > mon osd allow primary affinity = true
> > > # network settings
> > > public network = 10.10.69.128/25
> > > cluster network = 10.10.69.0/25
> > > 
> > > # auth settings
> > > auth cluster required = cephx
> > > auth service required = cephx
> > > auth client required = cephx
> > > 
> > > # default pools settings
> > > osd pool default size = 2
> > > osd pool default min size = 1
> > > osd pool default pg num = 8192
> > > osd pool default pgp num = 8192
> > > osd crush chooseleaf type = 1
> > > 
> > > # debug settings
> > > debug lockdep = 0/0
> > > debug context = 0/0
> > > debug crush = 0/0
> > > debug buffer = 0/0
> > > debug timer = 0/0
> > > debug journaler = 0/0
> > > debug osd = 0/0
> > > debug optracker = 0/0
> > > debug objclass = 0/0
> > > debug filestore = 0/0
> > > debug journal = 0/0
> > > debug ms = 0/0
> > > debug monc = 0/0
> > > debug tp = 0/0
> > > debug auth = 0/0
> > > debug finisher = 0/0
> > > debug heartbeatmap = 0/0
> > > debug perfcounter = 0/0
> > > debug asok = 0/0
> > > debug throttle = 0/0
> > > 
> > > throttler perf counter = false
> > > osd enable op tracker = false
> > > 
> > > ## OSD settings
> > > [osd]
> > > # OSD FS settings
> > > osd mkfs type = xfs
> > > osd mkfs options xfs = -f -i size=2048
> > > osd mount options xfs = rw,noatime,logbsize=256k,delaylog
> > > 
> > > # OSD journal settings
> > > osd journal block align = true
> > > osd journal aio = true
> > > osd journal dio = true
> > > 
> > > # Performance tuning
> > > filestore xattr use omap = true
> > > filestore merge threshold = 40
> > > filestore split multiple = 8
> > > filestore max sync interval = 10
> > > filestore queue max ops = 100000
> > > filestore queue max bytes = 1GiB
> > > filestore op threads = 20
> > > filestore journal writeahead = true
> > > filestore fd cache size = 10240
> > > osd op threads = 8
> > > 
> > > Disabling throttling doesn't change anything.
> > > So after all I read, I would like to know if, since the few months 
> > > old
> > > threads, someone to fix those kind of problems ? any idea or thoughts 
> > > to
> > > improve this ?
> > > 
> > > Thanks.
> > > 
> > > R?mi
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users at lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic