[prev in list] [next in list] [prev in thread] [next in thread]
List: fuse-devel
Subject: [fuse-devel] RFC: Read/Write I/Os to block devices - Zero-Copy
From: Brad.Goodman () schange ! com
Date: 2008-06-17 15:18:46
Message-ID: OF6CBF5389.ECC70C10-ON8525746B.00506A39-8525746B.00541DC4 () schange ! com
[Download RAW message or body]
So I've been thinking about this a lot, and have come up with what I
believe is a pretty cool idea. I'm not completely there yet, but here's a
rough sketch:
FUSE [seems to have been] written primarily with the concept of user-space
"translation" filesytems - i.e. thinkgs like gmailfs and sshfs - i.e.
doing translation of data, involving user-space processes, and other
things that would be cumbersome to implmenet in kernel-space - and even if
they were - would have to communicate with a whole bunch of user-space
code anyway (like SSH, etc.)
It seems as though a lot of people (like myself, NTFS-3g, ext2inuserspace,
etc.) are now trying to use it for "traditional" filesystems which are
"traditional" in the sense that they are filesystems which are backed by
block devices. Examples would be [user-space implmenetations of] ext[234],
NTFS (g3) or ZFS. Examples would NOT include things like gmailfs or sshfs.
However, FUSE is not really intended to be optimized as normal in-kernel
filesytems in this respect - i.e. user<->kernel tanslation, context
switching, copying, etc.
1. This implmenetation is geared primarily for "traditional" type
filesystems - i.e. ones backed by block devices.
2. The patimplmenetation ch aims to accomplish the following: Let
user-space code do the "heavy-lifting" - i.e. the "logic" of the
filesystem is implmented in normal FUSE code - handling of the VFS
functions, etc. Just like FUSE does today. User-space code implmenets
almost the entire filesystem. HOWEVER when it comes to the actual I/O -
get the user-space code out of the way and let the kernel code take over.
This provides the following optimizaitons:
a. Reducing context switching - user-space code may be avoided in
most cases during normal I/O (read/write/readpage, etc)
b. Allow the filesystem to "redirect" their I/Os to the underlying
block devices. (I have re-included some code sinippits below of how they
do this). This would be optimum for zero-copy I/O.
IMPLEMENTATION:
So FUSE, and user-space code basically acts as it does today.
"Step-One" - allows readpage to directly call the kernel's
block_read_full_page() to complete a request. This is the heart (or often
the *sole*) thing that kernel "traditional" filesystems need to do for a
readpage(). The readpage() calls this, with a pointer to a function
(within the filesystem) that helps the filesystem locate the block and
block device. From there, the kernel takes care of the I/O (See details at
bottom of message). This can be mmap, zero-copy from the page cache, etc.
It just "redirects" the filesystem I/O to the block device. Very very
simple.
This would give us all the zero-copy semantics.
"Step-Two" would aim to reduce or eliminate a lot of the user-space code
execution and context switching involved during I/O operations. This would
effectibly be done by implementing a "fuse_get_block" (kernel-space)
function (which will be passed to, and called by block_read_full_page -
see code snip at bottom of file for detail). A very basic stub of this is
inherently required by "step one" - but a "fuller" implementation would:
1. Have access to a "cached" or "prefetched" block map -
optionally provided by the user-space code - possibly when the file was
opened. It would be a list generated by the user-code and made available
to the kernel-code of block/blockdevs in the file. It may also be a
*partial* list, or it may not exist. The would require an API to create
this or "attach" the data to Kernel space.
2. FUSE userspace would require a callback for the fuse_get_block
function. If a requested block was not in the cache (or the cache did not
exist), it would pass this callback to user space to resolve the
block/blockdev.
3. Once it has the block/blockdev, it would return it, and let the
kernel continue with block_read_full_page to service the I/O (in the
zero-copy generic block-device fasion described above).
Example:
So VFS calls would get serviced through normal means - excpet for readpage
(and writepage) which [appear] to be the actual heart of all the I/Os. So
if a block was in the cache it would look like:
Kernel/Syscall FUSE Kernel Kernel
readpage()
block_read_full_page
(page,fuse_get_callback)
Call fuse_get_block
Return block in cache
Hand I/O to block
driver
If the block was NOT in the cache, it would look like:
Kernel/Syscall FUSE Kernel Kernel Userspace
readpage()
block_read_full_page
(page,fuse_get_callback)
Call fuse_get_block
Call user-space get_block
get_block user callback to get block
Return Block now in cache
Hand I/O to block
driver
Note that as I said before, the user-space code has the option of
populating the cache at any point in time. i.e. Only when needed, or when
the file is opened, etc - or populating a portition of it.
This would be fantastic for a filesystem like (for example ext2) - where
the first "x" blocks of the file are stored in the "normal" inode record
(I forget the exact term). When you opened, or even stat'ed a file, you'd
have to read this record anyway - so you might as well jam this data in
the cache. (It's tiny). From this point onward - any readpage-writepage
would be all in kernel space - unless the file was larger than those first
"x" block - at which point you'd get a hit on uncached data - and get
vectored through the user-space get_block callback - if the data was hit.
No - I haven't started implmeneting it yet - but am fairly close to
starting. I just wanted to run it by people for feedback and
sanity-checking first.
Any comments?
Thanks,
-BKG
------ Original message on optimizing FUSE for block-device access -----
All I really need is for readpage() to do what every other filesystem
does:
block_read_full_page(page, my_get_block);
And for my_get_block to [effectivley] do what it does for a normal block
device:
bh->b_bdev = I_BDEV(inode);
bh->b_blocknr = iblock;
set_buffer_mapped(bh);
Where "inode" and "iblock" would come from my user-space code, via. a
"new" FUSE request.
I believe this would allow FUSE filesystems to "redirect" I/Os to other
block-devices- allowing the system to use the same page cache and
zero-copy semantics for readpage, sendfile, (and probibly splice, though I
don't know much about it). This is basically what other filesystems
appear to do - using their own "my_get_block" function to translate the
iblock and inode values.
Outside of my own use, other filesystems backed by "normal" block-devices
stores (such as "ext2-in-userspace") would benefit from this.
Does this sound correct?
-BKG
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic