[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-fsdevel
Subject: Fwd: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence
From: Daniel Phillips <phillips () innominate ! de>
Date: 2001-03-12 17:29:12
[Download RAW message or body]
---------- Forwarded Message ----------
Subject: Re: RFC: [PATCH] ext2 BLOCK_SIZE independence
Date: Mon, 12 Mar 2001 17:12:36 +0000
From: Anton Altaparmakov <aia21@cam.ac.uk>
Daniel,
Any reason for not cc:-ing fsdevel? If not, feel free to forward both your
mail and this reply to the list. - Hope you don't mind, but I have CC:-ed
Andries Bouwer as well, as his input would be valuable here.
At 12:45 12/03/01, Daniel Phillips wrote:
>On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> > At 02:44 12/03/2001, Alexander Viro wrote:
> > >On Mon, 12 Mar 2001, Anton Altaparmakov wrote:
> > [snip]
> > > > 1) Makes the ext2 filesystem independent of the kernel's BLOCK_SIZE by
> > > > making use of the already defined but for some reason unused
> > > > EXT2_MIN_BLOCK_SIZE. This makes ext2 work on kernels with BLOCK_SIZE !=
> > > > 1024 and anyway, there is no good reason to depend on BLOCK_SIZE
> being any
> > > > particular value.
> > >
> > >Why would you need to redefine it? Not that I had any objections on the
> > >ext2 side, but... What's the point?
> >
> > Point is to eventually have the kernel working in 512 byte blocks rather
> > than 1024, which makes a lot more sense IMHO, as most block devices (most
> > hard drives) use a sector size of 512 bytes and NTFS for example has a
> > granularity of 512 bytes at the lowest level.
>
>What is sacred about 512 byte blocks? Talking to hard disk engineers I
>get the impression that many disks are using whatever blocking scheme
>they want internally and doing a translation to create the appearance
>of 512 byte blocks. The only reason we see 512 byte sectors by
>default is that sony floppy disk controllers worked in 512 byte
>sectors and DOS->VFAT has never been able to understand a hard disk as
>anything other than a big floppy disk. Right now, Linux is trying to
>be part of the problem, the only difference is that we are trying to
>impose an arbitrary 1K block measure on the world instead of the equally
>arbitrary 512 byte measure.
>
>Instead of continuing to be part of the problem, why don't we try to
>make some forward progress? It would be so much nicer if we measured
>all device sizes and filesystem sizes in the kernel measured in units of
>device->block_shift, then we could get away completely from the
>awkward 1K *and* 512 byte magic numbers, not to mention fixing at least
>one bug, losing some filesystem/file size limits, making the code more
>efficient and improving readability.
Agreed. I can see one potential problem though: the arrays blk_size and
blksize_size. blk_size will be in units of device->block_shift, too. Which
means we have to make sure that any uses of these arrays play ball.
> > My reasoning for making ext2
> > independent of BLOCK_SIZE is that Linus might not want to have the
> > BLOCK_SIZE changed in the kernel but he might accept to have it as a
> > configure option, but for such an option to be viable it means all the
> code
> > in the kernel has to become independent of BLOCK_SIZE or at least cope
> with
> > it being different value rather than assume that it equals 1024. I just
> > thought submitting my ext2 patch is as good a place to start as any.
>
>I'd suggest leaving BLOCK_SIZE entirely alone - just rename it to
>BLOCK_SIZE_1024 or similar - and get started on the process of
>generalizing the 42 or so kernel bits one at a time, getting completely
>away from the idea of a fixed BLOCK_SIZE in a series of nice, easy,
>forward/backward-compatible steps. My point is, it doesn't make
>any sense at all to change the constant BLOCK_SIZE to a different
>constant. Block size is by nature variable so let's recognize that and
>handle it.
Point taken. But it is in a way already handled by the blk_size and
blksize_size arrays. It's just that the default size is currently set to
1024 and not 512. Even if we introduce a device->block_shift this still
doesn't solve the problem. Many devices have blksize_size == 0 which
implies to use the default value of BLOCK_SIZE. And this is one of the
problems I am trying to address. The default is too high at 1024 bytes, it
needs to be 512.
> > Note that changing BLOCK_SIZE to 512 automagically solves the current
> > problem of not being able to access the last odd sector on a
> disk/partition
> > for example and it solves software RAID devices on NTFS (NTFS quite
> happily
> > will split a cluster into two parts to use up the last not cluster size
> > granular number of sectors on a partition, so the cluster will be
> contained
> > part in one partition and part in the next partition in RAID array! That
> > really kills us at the moment as software RAID uses BLOCK_SIZE blocks and
> > assumes nobody is crazy enough to split a block in two.)
>
>I noticed that, but using device->block_shift is a better solution.
>If we go to BLOCK_SIZE=512 we get the 2TB limit on everything unless we
>go to long long, which generates horrible code on ARCH=i386. Using
>device->blocksize_bits is much better - it actually improves the 32
>bit code in many places.
Er, how is a 2TiB limit worse than a 4TiB limit? (I mean, it is worse, but
not that much; if you are talking 2TiB you will only a few months later be
talking 4TiB...)
And we will need to go to 64-bit code at some point, anyway. NTFS requires
it already for sparse files as used by the usn journal for example, and I
am writing the new driver using __[us]64 everywhere, where it is necessary,
for that matter. IMHO it doesn't matter if the code is horrid, it's a fact
of life. Using just shifts, addition/subtraction and logical operators can
solve a lot of the code ugliness problems, especially in file system/block
device related code where most numbers are powers of 2. True 64-bit
multiplication and division are seldom necessary in these scenarios,
divisions can often be simplified to dividing 64-bit by 32-bit numbers. All
multiplications can be simplified to a shift left and an add. So there is
not that much ugliness...
> > [snip]
> > >I don't see the bug in #1. As a matter of taste - why not, but if you
> > >really want to redefine the BLOCK_SIZE you'll most likely find that
> > >places where you want a new value are _seriously_ outnumbered by
> > >places where you'll need to preserve the old one. I.e. it's easier to
> > >replace BLOCK_SIZE with something else in the places where want a
> > >new value.
>
>Here's a pointer to all those references for your browsing enjoyment:
>
> http://innominate.org/~graichen/projects/lxr/ident?v=v2.4&i=BLOCK_SIZE
> (BLOCK_SIZE referenced in 42 files)
That is an understatement of the problem. After that, do searches for
blksize_size (referenced in 59 files) and blk_size (referenced in 43 files)
and all of these will need checking/adapting to new scheme. And there are
some places that use 1024 (or 10 if bits) as a number instead of
referencing the BLOCK_SIZE which might not be covered with the above searches.
> > Well, but then it would become confusing! BLOCK_SIZE is supposed to be the
> > default kernel block size. (I know we could rename it, but...) I realize
> > that it is not trivial.
>
>It's trivial - BLOCK_SIZE -> BLOCK_SIZE_1024, and don't reuse the old
>symbol. Then at our leisure we can exterminate most of the
>BLOCK_SIZE_1024's, wherever the underlying device blocksize
>is a better measure (roughly speaking: everywhere but fs->stat, and even
>there we can easily drop the number of refs to 1/function instead of
>~5).
>
> > I know as I have converted all subsystems that I
> > use personally and am running a BLOCK_SIZE = 512 kernel on my VMware setup
> > and have been doing for ages without problems (well, there are a few minor
> > things left to sort out but it does work and is stable, the only thing I
> > get are spurious messages from ll_rw_block about submitting wrong sized
> > requests, and that only when I run fdisk(!), but then again ll_rw_block
> > will be dropping that limitation in the future anyway).
> >
> > I guess the alternative to changing the BLOCK_SIZE would be to fix
> > ll_rw_block and friends so they can return the last sector (by modifying
> > the out of bounds check, jumping out of the fast path and performing a
> > proper check perhaps) as well as software raid and that would suffice but
> > it's a hack.
>
>*Ick*. I guess that's the reaction you wanted?
Yes; as I said, it is a hack...
>I strongly support the idea that BLOCK_SIZE is broken and needs to be
>fixed but I don't think you're going far enough.
Agreed, it is a good idea to take it further. But we need to work out
exactly how. One thing I mentioned above which we need to sort out is what
we do if devices leave their device->block_shift == 0. Or are we going to
forbid this? There must be default minimum size, we need to have that
changed to 512 from 1024.
Also I think a better name for device->block_shift would be
device->blksize_shift to remain consistent with the existing blksize_size
naming scheme. Do we want to actually have a device->block_shift or do we
want this to be a static array a-la blksize_size and blk_size? If the
former, we should probably consider destroying blksize_size and blk_size
and making them appear in device->blksize_size/blk_size? - I think Andries
has ideas about blk_size and blksize_size for 2.5 already. Andries?
Comments?
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://sourceforge.net/projects/linux-ntfs/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
-------------------------------------------------------
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic