'[fuse-devel] RFC: Read/Write I/Os to block devices - Zero-Copy'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       fuse-devel
Subject:    [fuse-devel] RFC:  Read/Write I/Os to block devices - Zero-Copy
From:       Brad.Goodman () schange ! com
Date:       2008-06-17 15:18:46
Message-ID: OF6CBF5389.ECC70C10-ON8525746B.00506A39-8525746B.00541DC4 () schange ! com
[Download RAW message or body]

So I've been thinking about this a lot, and have come up with what I 
believe is a pretty cool idea. I'm not completely there yet, but here's a 
rough sketch:

FUSE [seems to have been] written primarily with the concept of user-space 
"translation" filesytems - i.e. thinkgs like gmailfs and sshfs - i.e. 
doing translation of data, involving user-space processes, and other 
things that would be cumbersome to implmenet in kernel-space - and even if 
they were - would have to communicate with a whole bunch of user-space 
code anyway (like SSH, etc.)

It seems as though a lot of people (like myself, NTFS-3g, ext2inuserspace, 
etc.) are now trying to use it for "traditional" filesystems which are 
"traditional" in the sense that they are filesystems which are backed by 
block devices. Examples would be [user-space implmenetations of] ext[234], 
NTFS (g3) or ZFS. Examples would NOT include things like gmailfs or sshfs. 


However, FUSE is not really intended to be optimized as normal in-kernel 
filesytems in this respect - i.e. user<->kernel tanslation, context 
switching, copying, etc.

1. This implmenetation is geared primarily for "traditional" type 
filesystems - i.e. ones backed by block devices.

2. The patimplmenetation ch aims to accomplish the following: Let 
user-space code do the "heavy-lifting" - i.e. the "logic" of the 
filesystem is implmented in normal FUSE code - handling of the VFS 
functions, etc. Just like FUSE does today. User-space code implmenets 
almost the entire filesystem.  HOWEVER when it comes to the actual I/O - 
get the user-space code out of the way and let the kernel code take over. 
This provides the following optimizaitons:

        a. Reducing context switching - user-space code may be avoided in 
most cases during normal I/O (read/write/readpage, etc)

        b. Allow the filesystem to "redirect" their I/Os to the underlying 
block devices. (I have re-included some code sinippits below of how they 
do this). This would be optimum for zero-copy I/O.

IMPLEMENTATION:

So FUSE, and user-space code basically acts as it does today. 

"Step-One" - allows readpage to directly call the kernel's 
block_read_full_page() to complete a request. This is the heart (or often 
the *sole*) thing that kernel "traditional" filesystems need to do for a 
readpage(). The readpage() calls this, with a pointer to a function 
(within the filesystem) that helps the filesystem locate the block and 
block device. From there, the kernel takes care of the I/O (See details at 
bottom of message). This can be mmap, zero-copy from the page cache, etc. 
It just "redirects" the filesystem I/O to the block device. Very very 
simple.

This would give us all the zero-copy semantics.

"Step-Two" would aim to reduce or eliminate a lot of the user-space code 
execution and context switching involved during I/O operations. This would 
effectibly be done by implementing a "fuse_get_block" (kernel-space) 
function (which will be passed to, and called by block_read_full_page - 
see code snip at bottom of file for detail).  A very basic stub of this is 
inherently required by "step one" - but a "fuller" implementation would:

        1. Have access to a "cached" or "prefetched" block map - 
optionally provided by the user-space code - possibly when the file was 
opened. It would be a list generated by the user-code and made available 
to the kernel-code  of block/blockdevs in the file. It may also be a 
*partial* list, or it may not exist. The would require an API to create 
this or "attach" the data to Kernel space.

        2.  FUSE userspace would require a callback for the fuse_get_block 
function. If a requested block was not in the cache (or the cache did not 
exist), it would pass this callback to user space to resolve the 
block/blockdev.

        3. Once it has the block/blockdev, it would return it, and let the 
kernel continue with block_read_full_page to service the I/O (in the 
zero-copy generic block-device fasion described above).


Example:

So VFS calls would get serviced through normal means - excpet for readpage 
(and writepage) which [appear] to be the actual heart of all the I/Os. So 
if a block was in the cache it would look like:

Kernel/Syscall           FUSE Kernel                  Kernel

readpage() 
                         block_read_full_page
                         (page,fuse_get_callback)
                                                      Call fuse_get_block
                         Return block in cache
                                                      Hand I/O to block 
driver

If the block was NOT in the cache, it would look like:

Kernel/Syscall           FUSE Kernel                  Kernel  Userspace

readpage() 
                         block_read_full_page
                         (page,fuse_get_callback)
                                                      Call fuse_get_block
                         Call user-space get_block  
          get_block user callback to get block
                         Return Block now in cache 
                                                      Hand I/O to block 
driver


Note that as I said before, the user-space code has the option of 
populating the cache at any point in time. i.e. Only when needed, or when 
the file is opened, etc - or populating a portition of it. 

This would be fantastic for a filesystem like (for example ext2) - where 
the first "x" blocks of the file are stored in the "normal" inode record 
(I forget the exact term). When you opened, or even stat'ed a file, you'd 
have to read this record anyway - so you might as well jam this data in 
the cache. (It's tiny). From this point onward - any readpage-writepage 
would be all in kernel space - unless the file was larger than those first 
"x" block - at which point you'd get a hit on uncached data - and get 
vectored through the user-space get_block callback - if the data was hit.

No - I haven't started implmeneting it yet - but am fairly close to 
starting. I just wanted to run it by people for feedback and 
sanity-checking first.

Any comments?

Thanks,

-BKG

------ Original message on optimizing FUSE for block-device access -----

All I really need is for readpage() to do what every other filesystem 
does:

        block_read_full_page(page, my_get_block);

And for my_get_block to [effectivley] do what it does for a normal block 
device:

          bh->b_bdev = I_BDEV(inode);
          bh->b_blocknr = iblock;
          set_buffer_mapped(bh);

Where "inode" and "iblock" would come from my user-space code, via. a 
"new" FUSE request.

I believe this would allow FUSE filesystems to "redirect" I/Os to other 
block-devices- allowing the system to use the same page cache and 
zero-copy semantics for readpage, sendfile, (and probibly splice, though I 
don't know much about it).  This is basically what other filesystems 
appear to do - using their own "my_get_block" function to translate the 
iblock and inode values.

Outside of my own use, other filesystems backed by "normal" block-devices 
stores (such as "ext2-in-userspace") would benefit from this.

Does this sound correct? 

-BKG
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
fuse-devel mailing list
fuse-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fuse-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic