[prev in list] [next in list] [prev in thread] [next in thread] 

List:       evms-devel
Subject:    Re: [Evms-devel] Just when you thought you were done with options...
From:       Greg Lehey <grog () FreeBSD ! org>
Date:       2001-10-04 22:33:02
[Download RAW message or body]

On Wednesday,  3 October 2001 at 23:30:02 -0600, Andreas Dilger wrote:
> On Oct 04, 2001  12:04 +0930, Greg Lehey wrote:
>>> Granted, it will be a bit of a chore to read/write this metadata (you need
>>> to iterate each runlist to a buffer to write, and detect contiguous runs
>>> and compact them on read), but in general LVM doesn't need to do that very
>>> often, so performance is not critical (and the one-time hit doing this may
>>> very well be offset by the potentially large reduction in memory
>>> usage).
>>
>> I'm not sure that I understand this.  I probably need to go and read
>> the code first, but that would take too long.  I would expect that on
>> startup, EVMS (or the respective plugin) will read the configuration
>> information and transform it into the appropriate structure (which,
>> depending on the volume layout, may be unchanged from the on-disk
>> configuration, or it could be some form of run list).
>
> (sorry in advance for the long email)
>
> <example truncated>

Yes, I think we're on the same page here.  When you get to the
following point, you have a list which is very similar to what Vinum
uses:

> With the PROPOSED runlist-based scheme, we would have it in memory as:
>
> logical_start logical_end	physical_start	physical_end	device
>     0		 5119		    0		 5119		1
>  5120		10239		10240		15459		1
> 10240		15459		 4096		 9215		3

In Vinum, each plex has an array pointing to its subdisks structures.
the information is actually stored in the subdisk structures as plex
offset (logical start), drive offset (physical start), length, and a
pointer to the drive (device) structure.  This does not have the
locality of reference that a single table would have, but I don't
believe that this is a problem, since the object structures need to be
accessed for other information anyway.

> To map the same logical block we do (simple search, could be better if needed):
> 	a) 12345 > 0 && 12345 < 5119		-> no
> 	b) 12345 > 5120 && 12345 < 10239	-> no
> 	c) 12345 > 10240 && 12345 < 15459	-> yes
> 		physical block = 12345 - 10240 + 4096 = 6201;
> 		device = PV3;
>
> While the searching _may_ be slower (especially for fragmented LVs with many
> runlists), in many cases it will be only one or two runlists, so this would
> mean _very little_ searching in practise.  Benefits are:
> a) All of the runlists can easily fit in L1 cache (on the order of 40 bytes
>    per runlist (4 * __u64 + kdev_t/pointer + next pointer), so no L1/L2 cache
>    misses when trying to access a 500kB LE array.

Yes, this appears to be an advantage over the way Vinum does it.  It
depends a lot on the other processing as to whether it really is an
advantage: if you need to access the objects anyway, you may just be
adding another table to a long list.

> b) No 64-bit divide, no 64-bit modulus (many hundreds of CPU cycles).

Nothing in comparison to the many millions you need to wait for the
transfer.  We've discussed this particular issue before, and came to
the conclusion that the cost of a divide would probably go down in the
noise.  Not dividing also increases the likelihood of additional hot
spots on striped drives, which could have a much more drastic hit on
performance.

> Concievably the cost of the searching (if any) would be outweighed by the
> fact that you nearly always have to go to main memory to get the PE table,
> and then you have to do a very slow (on ia32 at least) 64-bit divide/modulus

We disccussed this before:

> On Thursday, 19 July 2001 at 21:34:33 -0600, Andreas Dilger wrote:
>> Greg Lehey writes:
>>> On Friday, 20 July 2001 at 11:44:02 +1000, Andrew Clausen wrote:
>>>> On Fri, Jul 20, 2001 at 10:52:28AM +0930, Greg Lehey wrote:
>>>>> Recall that we're talking here in terms of disk access speeds.  An
>>>>> ia64 can perform a large number of divisions in that time.
>>>>
>>>> So what?  The CPU may be in demand for other things.
>>>
>>> So this doesn't happen very often.  If you're performing, say, 5000
>>> transfers a second, that might translate to 10,000 divisions.  On even
>>> the slowest ia32, this doesn't make it out of the noise.
>>
>> Except that we're talking about 64-bit divisions on ia32 here, and just
>> as a quick check I wrote a program that looks essentially like:
>>
>>       long long result, val, div;
>>       result = val / div;
>>
>> and compiled it with gcc -O2, and then disassembled it.  It gave:
>>
>> 152 asm instructions for __divdi3 and 5 asm instructions for the
>> caller,
>
> Well, OK, including all branches probably.
>
>> so if you are doing 10,000 of these a second (not at all
>> unreasonable on a busy storage server), you use up 1.5MHz just doing
>> the divisions, vs 0.01MHz doing shifts.
>
> Well, assuming you execute all the instructions.  Note that a cache
> miss is probably worth more than 150 instructions.
>
> I tried this on a 750 MHz Athlon.  I didn't count the instructions,
> but the count looked similar.  2**31 divisions and a loop and things
> took 225 seconds.  I make that about 100 ns per iteration, so 10,000
> of them per second would use about 0.1% of the processor.

(end of quote)

> (I don't know if GCC is smart enough to detect that you are doing both and
> cache the modulus from the divide for later use, I think not).

No, it doesn't.  There should be a way to optimize that, though.

> As for generating contiguous I/Os, it may help a bit, but not much.  You
> will potentially need to split I/Os on each PE boundary for the array case,
> while you only need to split I/Os at each runlist boundary in that case.
> Since PEs are relatively large anyways, this will not affect overall
> performance in most cases.

Yes, that's correct.  I was thinking of striping, which is a
completely different issue.

Greg
--
See complete headers for address and phone numbers

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic