'[lvm-discuss] Re: Comment: SVM default interlace and resync buffer'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       opensolaris-lvm-discuss
Subject:    [lvm-discuss] Re: Comment: SVM default interlace and resync buffer
From:       Truong.Q.Nguyen () Sun ! COM (Tony Nguyen)
Date:       2005-10-09 23:09:44
Message-ID: 434A046A.8020108 () sun ! com
[Download RAW message or body]

D. Rock wrote:

> Tony Nguyen schrieb:
>
>> Daniel
>>
>> Very interesting. We've not seen this so I'd like to get more 
>> information about this behavior. How large of an interlace value did 
>> you use (512k)?  It would be great if we can understand the procedure 
>> you use to measure/count the probability of full line writes onto the 
>> metadevice.
>>
>> Note that having the fs on top of the RAID5 metadevice will change 
>> the behavior of I/O to the metadevice.  For example, an I/O of 1Mb to 
>> a UFS filesystem will not necessary give you one 1Mb write to the 
>> underlying metadevice since UFS has its own tunable I/O size.
>> -tony
>
>
> Well,
>
> it was partly my own fault.
>
> On x86 hardware the default value of maxphys is ridiculously small: 
> 56kBytes (maybe historical compatibility reasons?). After setting 
> maxphys to a reasonable value (1MByte) normal I/O performance 
> increased with the interlace factor.
>
> For UFS filesystems you mostly don't get enough data for full line 
> writes. You have to get ((n-1) * interlace) blocks - also properly 
> aligned - to have the benefit of full line writes. But I have done 
> another test: Writing directly to the metadevice with no filesystem in 
> between.
>
> Let's begin with:
> # metainit d100 -r c0d0s6 c0d1s6 c1d0s6 c1d1s6 -i 8k
> # dd if=/dev/zero of=/dev/md/rdsk/d100 bs=1024k &
> # iostat -xnz
>   537.8 1074.3  330.8 8863.3  0.0  0.3    0.0    0.2   0  21 c1d0
>   538.4 1074.1  331.1 8861.6  0.0  0.2    0.0    0.1   0  14 c1d1
>   547.8 1091.1  399.3 9001.2  0.0  0.3    0.0    0.2   0  33 c0d0
>   548.0 1090.3  396.4 8994.5  0.0  0.4    0.0    0.2   0  22 c0d1
>     0.0   12.5    0.0 12787.3  0.0  1.0    0.0   78.9   0  99 d100
>
> You can clearly see the -i 8k on the physical devices (8863.3/1074.3) 
> and the bs=1024k on the logical device (12787.3/12.5)
>
> Now the same with -i 64k
>    95.6  189.9  793.6 12202.8  0.0  0.8    0.0    2.7   0  50 c1d0
>    96.2  189.9  793.9 12202.8  0.0  0.7    0.0    2.3   0  54 c1d1
>   109.5  213.8 1559.1 13737.7  0.0  0.9    0.0    2.8   1  66 c0d0
>   109.7  213.0 1533.9 13686.5  0.0  1.1    0.0    3.3   0  65 c0d1
>     0.0   17.7    0.0 18143.7  0.0  1.0    0.0   55.3   0  98 d100
>
> Ok, much better, as expected.
>
> But if you increase your interlace too much, so a full line is larger 
> than bs=1024k (in this case -i 512k)
>    14.6   28.0 7158.5 14323.4  0.0  0.6    0.0   14.4   0  52 c1d0
>    15.0   27.6 7056.5 14118.8  0.0  0.6    0.0   13.9   0  50 c1d1
>    16.2   28.0 7057.1 14323.4  0.0  0.6    0.1   13.2   0  52 c0d0
>    17.0   27.6 7159.7 14118.8  0.0  0.6    0.0   13.7   0  51 c0d1
>     0.0   13.8    0.0 14111.9  0.0  1.0    0.0   71.4   0  98 d100
>
> Better than -i 8k, but worse than -i 64k.
>
> So if you want to use a RAID-5 metadevice as a raw device to a 
> database (bad idea BTW) you shouldn't set the interlace too large.
>
>
>
>
> But now to filesystem benchmarks. Before each test I recreated the 
> RAID-5 and newfs'd a filesystem on top of it with default parameters. 
> Then I extracted usr/src/cmd (tons of small files) from the 
> OpenSolaris source distribution. I measured the time of extraction + 
> sync + umount (the tarball was put in /tmp):
>
> -i 8k        3:40.18
> -i 32k        2:51.54
> -i 64k        2:46.46
> -i 128k        2:47.73
> -i 256k        2:43.30
>
> I was a little surprised by the results. I thought the small file 
> sizes would favour also small interlace factors but I was wrong.
>
> Maybe I can do some more detailed tests in a few weeks. At the moment 
> I don't have spare drives available. The above test setup was not 
> optimal:
> - ATA drives shared as master/slave on the same controller
> - write cache enabled
>
>
>
> Regards,
>
> Daniel

Daniel,

Sorry for a late response. I agree that with your observations. However, 
I would comment that a better I/O performance is achieved if the RAID 5 
metadevice with 512k interlace size has 3 component (so the data per 
line is 1024k).  See my testing below.

d30 -r c1t0d0s3 c1t0d1s3 c1t0d2s3 c1t0d3s3 -k -i 128b (64k interlace)
# dd if=/dev/zero of=/dev/md/rdsk/d30 bs=1024k
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4    0.0    0.2    0.0  0.0  0.0    0.0    0.2   0   0 c2d0
    0.0   16.6    0.0 17034.5  0.0  1.0    0.0   57.3   0  95 d30
  100.0  200.2 1450.0 12864.3  0.0  0.9    0.0    3.1   0  78 c1t0d1
  100.0  200.3 1450.0 12870.8  0.0  1.1    0.0    3.7   0  74 c1t0d0
   89.1  178.1  744.5 11441.4  0.0  1.0    0.0    3.6   0  67 c1t0d2
   89.1  178.1  744.5 11441.4  0.0  0.8    0.0    3.0   0  71 c1t0d3

d40 -r c1t0d0s4 c1t0d1s4 c1t0d2s4 -k -i 1024b (512k interlace)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4    0.0    0.2    0.0  0.0  0.0    0.0    3.7   0   0 c2d0
    0.0   26.1    0.0 26687.8  0.0  0.9    0.0   35.7   0  93 d40
   26.1   52.1   13.0 26700.8  0.0  1.4    0.0   18.3   0  99 c1t0d1
   26.1   52.1   13.0 26700.8  0.0  1.4    0.0   17.6   0  99 c1t0d0
   26.2   52.1   13.1 26700.8  0.0  1.4    0.0   17.6   0  99 c1t0d2

d20 -r c1t0d0s1 c1t0d1s1 c1t0d2s1 c1t0d3s1 c1t0d4s1 -k -i 512b (256k)
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.4    0.0    0.2    0.0  0.0  0.0    0.0    0.1   0   0 c2d0
    0.0   27.3    0.0 27999.1  0.0  0.9    0.0   33.9   0  93 d20
   27.2   54.6   13.6 13987.6  0.0  0.9    0.0   11.5   0  85 c1t0d1
   27.2   54.6   13.6 13987.6  0.0  1.0    0.0   11.9   0  85 c1t0d0
   27.2   54.6   13.6 13987.6  0.0  1.0    0.0   11.8   0  85 c1t0d2
   27.2   54.6   13.6 13987.6  0.0  0.9    0.0   10.6   0  85 c1t0d3
   27.2   54.6   13.6 13987.6  0.0  0.9    0.0   10.6   0  85 c1t0d4

I think we can generalize that we'll get the best raw I/O performance 
when the size of the full line equals the I/O size. Moreover, 
optimization for raw I/O apps with fixed I/O size should probably 
dividing I/O size by the number of data columns to get the best 
interlace size. What do you think?

[Filesystem test]
Your finding is also consistent with the testing I've done. Besides the 
file creation with multiple processes, I've seen comparable or much 
improved performance with larger interlace size for the following tests, 
creation and deletion with a single process, populating large number of 
files in a single directory, lock file creation, directory walks, 
filling a fs and deleting alternate files, filling fragmented fs, and 
read I/O.  I'm not sure why you would expect better I/O for smaller 
files with smaller interlace size. Would you expand?

Which Solaris release are you running?  Yes, the 56k maxphys value is 
there for backward compatibility with old hardware. It's interesting 
since I don't see any I/O performance when changing maxphys to a 1MByte. 
As you can see from my numbers above,  the I/O size to the disks in all 
cases are greater than the default maxphys value. Does this mean SVM 
uses some other means to determine its md_maxphys value? I don't 
actually remeber and will need to look more into this :^)

Regards,

-tony

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic