'Fwd: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-fsdevel
Subject:    Fwd: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence
From:       Daniel Phillips <phillips () innominate ! de>
Date:       2001-03-12 17:29:12
[Download RAW message or body]



----------  Forwarded Message  ----------
Subject: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence
Date: Mon, 12 Mar 2001 17:12:36 +0000
From: Anton Altaparmakov <aia21@cam.ac.uk>


Daniel,

Any reason for not cc:-ing fsdevel? If not, feel free to forward both your 
mail and this reply to the list. - Hope you don't mind, but I have CC:-ed 
Andries Bouwer as well, as his input would be valuable here.

At 12:45 12/03/01, Daniel Phillips wrote:
>On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> > At 02:44 12/03/2001, Alexander Viro wrote:
> > >On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> > [snip]
> > > > 1) Makes the ext2 filesystem independent of the kernel's BLOCK_SIZE by
> > > > making use of the already defined but for some reason unused
> > > > EXT2_MIN_BLOCK_SIZE. This makes ext2 work on kernels with BLOCK_SIZE !=
> > > > 1024 and anyway, there is no good reason to depend on BLOCK_SIZE 
> being any
> > > > particular value.
> > >
> > >Why would you need to redefine it? Not that I had any objections on the
> > >ext2 side, but... What's the point?
> >
> > Point is to eventually have the kernel working in 512 byte blocks rather
> > than 1024, which makes a lot more sense IMHO, as most block devices (most
> > hard drives) use a sector size of 512 bytes and NTFS for example has a
> > granularity of 512 bytes at the lowest level.
>
>What is sacred about 512 byte blocks?  Talking to hard disk engineers I
>get the impression that many disks are using whatever blocking scheme
>they want internally and doing a translation to create the appearance
>of 512 byte blocks.  The only reason we see 512 byte sectors by
>default is that sony floppy disk controllers worked in 512 byte
>sectors and DOS->VFAT has never been able to understand a hard disk as
>anything other than a big floppy disk.  Right now, Linux is trying to
>be part of the problem, the only difference is that we are trying to
>impose an arbitrary 1K block measure on the world instead of the equally
>arbitrary 512 byte measure.
>
>Instead of continuing to be part of the problem, why don't we try to
>make some forward progress?  It would be so much nicer if we measured
>all device sizes and filesystem sizes in the kernel measured in units of
>device->block_shift, then we could get away completely from the
>awkward 1K *and* 512 byte magic numbers, not to mention fixing at least
>one bug, losing some filesystem/file size limits, making the code more
>efficient and improving readability.

Agreed. I can see one potential problem though: the arrays blk_size and 
blksize_size. blk_size will be in units of device->block_shift, too. Which 
means we have to make sure that any uses of these arrays play ball.

> > My reasoning for making ext2
> > independent of BLOCK_SIZE is that Linus might not want to have the
> > BLOCK_SIZE changed in the kernel but he might accept to have it as a
> > configure option, but for such an option to be viable it means all the 
> code
> > in the kernel has to become independent of BLOCK_SIZE or at least cope 
> with
> > it being different value rather than assume that it equals 1024. I just
> > thought submitting my ext2 patch is as good a place to start as any.
>
>I'd suggest leaving BLOCK_SIZE entirely alone - just rename it to
>BLOCK_SIZE_1024 or similar - and get started on the process of
>generalizing the 42 or so kernel bits one at a time, getting completely
>away from the idea of a fixed BLOCK_SIZE in a series of nice, easy,
>forward/backward-compatible steps.  My point is, it doesn't make
>any sense at all to change the constant BLOCK_SIZE to a different
>constant.  Block size is by nature variable so let's recognize that and
>handle it.

Point taken. But it is in a way already handled by the blk_size and 
blksize_size arrays. It's just that the default size is currently set to 
1024 and not 512. Even if we introduce a device->block_shift this still 
doesn't solve the problem. Many devices have blksize_size == 0 which 
implies to use the default value of BLOCK_SIZE. And this is one of the 
problems I am trying to address. The default is too high at 1024 bytes, it 
needs to be 512.

> > Note that changing BLOCK_SIZE to 512 automagically solves the current
> > problem of not being able to access the last odd sector on a 
> disk/partition
> > for example and it solves software RAID devices on NTFS (NTFS quite 
> happily
> > will split a cluster into two parts to use up the last not cluster size
> > granular number of sectors on a partition, so the cluster will be 
> contained
> > part in one  partition and part in the next partition in RAID array! That
> > really kills us at the moment as software RAID uses BLOCK_SIZE blocks and
> > assumes nobody is crazy enough to split a block in two.)
>
>I noticed that, but using device->block_shift is a better solution.
>If we go to BLOCK_SIZE=512 we get the 2TB limit on everything unless we
>go to long long, which generates horrible code on ARCH=i386.  Using
>device->blocksize_bits is much better - it actually improves the 32
>bit code in many places.

Er, how is a 2TiB limit worse than a 4TiB limit? (I mean, it is worse, but 
not that much; if you are talking 2TiB you will only a few months later be 
talking 4TiB...)

And we will need to go to 64-bit code at some point, anyway. NTFS requires 
it already for sparse files as used by the usn journal for example, and I 
am writing the new driver using __[us]64 everywhere, where it is necessary, 
for that matter. IMHO it doesn't matter if the code is horrid, it's a fact 
of life. Using just shifts, addition/subtraction and logical operators can 
solve a lot of the code ugliness problems, especially in file system/block 
device related code where most numbers are powers of 2. True 64-bit 
multiplication and division are seldom necessary in these scenarios, 
divisions can often be simplified to dividing 64-bit by 32-bit numbers. All 
multiplications can be simplified to a shift left and an add. So there is 
not that much ugliness...

> > [snip]
> > >I don't see the bug in #1. As a matter of taste - why not, but if you
> > >really want to redefine the BLOCK_SIZE you'll most likely find that
> > >places where you want a new value are _seriously_ outnumbered by
> > >places where you'll need to preserve the old one. I.e. it's easier to
> > >replace BLOCK_SIZE with something else in the places where want a
> > >new value.
>
>Here's a pointer to all those references for your browsing enjoyment:
>
>   http://innominate.org/~graichen/projects/lxr/ident?v=v2.4&i=BLOCK_SIZE
>   (BLOCK_SIZE referenced in 42 files)

That is an understatement of the problem. After that, do searches for 
blksize_size (referenced in 59 files) and blk_size (referenced in 43 files) 
and  all of these will need checking/adapting to new scheme. And there are 
some places that use 1024 (or 10 if bits) as a number instead of 
referencing the BLOCK_SIZE which might not be covered with the above searches.

> > Well, but then it would become confusing! BLOCK_SIZE is supposed to be the
> > default kernel block size. (I know we could rename it, but...) I realize
> > that it is not trivial.
>
>It's trivial - BLOCK_SIZE -> BLOCK_SIZE_1024, and don't reuse the old
>symbol.  Then at our leisure we can exterminate most of the
>BLOCK_SIZE_1024's, wherever the underlying device blocksize
>is a better measure (roughly speaking: everywhere but fs->stat, and even
>there we can easily drop the number of refs to 1/function instead of
>~5).
>
> > I know as I have converted all subsystems that I
> > use personally and am running a BLOCK_SIZE = 512 kernel on my VMware setup
> > and have been doing for ages without problems (well, there are a few minor
> > things left to sort out but it does work and is stable, the only thing I
> > get are spurious messages from ll_rw_block about submitting wrong sized
> > requests, and that only when I run fdisk(!), but then again ll_rw_block
> > will be dropping that limitation in the future anyway).
> >
> > I guess the alternative to changing the BLOCK_SIZE would be to fix
> > ll_rw_block and friends so they can return the last sector (by modifying
> > the out of bounds check, jumping out of the fast path and performing a
> > proper check perhaps) as well as software raid and that would suffice but
> > it's a hack.
>
>*Ick*.  I guess that's the reaction you wanted?

Yes; as I said, it is a hack...

>I strongly support the idea that BLOCK_SIZE is broken and needs to be
>fixed but I don't think you're going far enough.

Agreed, it is a good idea to take it further. But we need to work out 
exactly how. One thing I mentioned above which we need to sort out is what 
we do if devices leave their device->block_shift == 0. Or are we going to 
forbid this? There must be default minimum size, we need to have that 
changed to 512 from 1024.

Also I think a better name for device->block_shift would be 
device->blksize_shift to remain consistent with the existing blksize_size 
naming scheme. Do we want to actually have a device->block_shift or do we 
want this to be a static array a-la blksize_size and blk_size? If the 
former, we should probably consider destroying blksize_size and blk_size 
and making them appear in device->blksize_size/blk_size? - I think Andries 
has ideas about blk_size and blksize_size for 2.5 already. Andries?

Comments?

Anton


-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
-------------------------------------------------------

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic