[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-fsdevel
Subject:    problem with orangefs readpage...
From:       Mike Marshall <hubcap () omnibond ! com>
Date:       2020-12-31 21:51:53
Message-ID: CAOg9mSQkkZtqBND-HKb2oSB8jxT6bkQU1LuExo0hPsEUhcMrPw () mail ! gmail ! com
[Download RAW message or body]

Greetings...

I hope some of you will suffer through reading this long message :-) ...

Orangefs isn't built to do small IO. Reading a
big file in page cache sized chunks is slow and painful.
I tried to write orangefs_readpage so that it would do a reasonable
sized hard IO, fill the page that was being called for, and then
go ahead and fill a whole bunch of the following pages into the
page cache with the extra data in the IO buffer.

Anywho... I thought this code was working pretty much like I
designed it to work, but on closer inspection I see that it is
not, and I thought I'd ask for some help or suggestions.

Here's the core of the loop in orangefs_readpage that tries to fill extra
pages, and what follows is a description of how it is not working
the way I designed it to work:

        while there's still data in the IO buffer
        {
            index++;
            slot_index++;
            next_page = find_get_page(inode->i_mapping, index);
            if (next_page) {
                gossip_debug(GOSSIP_FILE_DEBUG,
                    "%s: found next page, quitting\n",
                    __func__);
                put_page(next_page);
                goto out;
            }
            next_page = find_or_create_page(inode->i_mapping,
                            index,
                            GFP_KERNEL);
            /*
             * I've never hit this, leave it as a printk for
             * now so it will be obvious.
             */
            if (!next_page) {
                printk("%s: can't create next page, quitting\n",
                    __func__);
                goto out;
            }
            kaddr = kmap_atomic(next_page);
            orangefs_bufmap_page_fill(kaddr,
                        buffer_index,
                        slot_index);
            kunmap_atomic(kaddr);
            SetPageUptodate(next_page);
            unlock_page(next_page);
            put_page(next_page);
        }

So... my design was that orangefs_readpage would get called, the
needed page would be supplied and a bunch of the following pages
would get filled as well. That way if more pages were needed,
they would be in the page cache already.

My plan "kind of" works when a file is read a page at a time:

  /pvfsmnt/nine is nine pages long.
  -rwxr-xr-x. 1 root root 36864 Dec 29 11:09 /pvfsmnt/nine

  dd if=/pvfsmnt/nine of=/tmp/nine bs=4096 count=9

orangefs_readpage gets called for the first four pages and then my
prefill kicks in and fills the next pages and the right data ends
up in /tmp/nine. I, of course, wished and planned for orangefs_readpage
to only get called once, I don't understand why it gets called four
times, which results in three extraneous expensive hard IOs.

A nine page file is just an example, in general when files are read
a page at a time, orangefs_readpage gets called four times and the
rest of the pages (up to the design limit) are pre-filled.

When a file gets read all at once, though, my design
fails in a different way...

  dd if=/pvfsmnt/nine of=/tmp/nine bs=36864 count=1

In the above, orangefs_readpage gets called nine times, with
eight extraneous expensive hard IOs. Further investigation into
larger and larger block sizes shows a pattern.

I hope it is apparent to some of you why my page-at-a-time
reads don't start pre-filling until after four calls to orangefs_readpage.
Below are some more examples that show what happens with larger and larger
block sizes, hopefully the pattern there will be suggestive as well.

/pvfsmnt/N is a file exactly N pages long.

Key: orangefs_readpage->X times foo, bar, baz, ..., qux

X = number of calls to orangefs_readpage.
foo = number of bytes fetched from Orangefs on the first read.
bar = number of bytes fetched from Orangefs on the extraneous 2nd read.
baz = number of bytes fetched from Orangefs on the extraneous 3rd read.
qux = number of bytes fetched from Orangefs on the extraneous last read.


  dd if=/pvfsmnt/32 of=/tmp/32 bs=131072 count=1
       orangefs_readpage->32 times 131072, 126976, 122880, ..., 4096
       orangefs_bufmap_page_fill->0 times

  dd if=/pvfsmnt/33 of=/tmp/33 bs=135168 count=1
       orangefs_readpage->32 times 135168, 131072, 126976, ..., 8192
       orangefs_bufmap_page_fill->1 time

  dd if=/pvfsmnt/34 of=/tmp/34 bs=139264 count=1
       orangefs_readpage->32 times 139264, 135168, 131072, ..., 12288
       orangefs_bufmap_page_fill->2 times

  dd if=/pvfsmnt/35 of=/tmp/35 bs=143360 count=1
       orangefs_readpage->32 times 143360, 139264, 135168, ..., 16384
       orangefs_bufmap_page_fill->3 times

  dd if=/pvfsmnt/36 of=/tmp/36 bs=147456 count=1
       orangefs_readpage->32 times 147456, 143360, 139264, ..., 20480
       orangefs_bufmap_page_fill->4 times

  dd if=/pvfsmnt/37 of=/tmp/37 bs=151552 count=1
       orangefs_readpage->32 times 151552, 147456, 143360, ..., 24576
       orangefs_bufmap_page_fill->5 times

  dd if=/pvfsmnt/38 of=/tmp/38 bs=155648 count=1
       orangefs_readpage->32 times 155648, 151552, 147456, ..., 28672
       orangefs_bufmap_page_fill->6 times

  dd if=/pvfsmnt/39 of=/tmp/39 bs=159744 count=1
       orangefs_readpage->32 times 159744, 155648, 151552, ..., 32768
       orangefs_bufmap_page_fill->7 times

  dd if=/pvfsmnt/40 of=/tmp/40 bs=163840 count=1
       orangefs_readpage->32 times 163840, 159744, 155648, ..., 36864
       orangefs_bufmap_page_fill->8 times

  dd if=/pvfsmnt/41 of=/tmp/41 bs=167936 count=1
       orangefs_readpage->32 times 167936, 163840, 159744, ..., 40960
       orangefs_bufmap_page_fill->9 times

                     .
                     .
                     .

  dd if=/pvfsmnt/47 of=/tmp/47 bs=192512 count=1
       orangefs_readpage->32 times 192512, 188416, 184320, ..., 65536
       orangefs_bufmap_page_fill->15 times

  dd if=/pvfsmnt/48 of=/tmp/48 bs=196608 count=1
       orangefs_readpage->32 times 196608, 192512, 188416, ..., 69632
       orangefs_bufmap_page_fill->16 times

  dd if=/pvfsmnt/49 of=/tmp/49 bs=200704 count=1
       orangefs_readpage->32 times 200704, 196608, 192512, ..., 73728
       orangefs_bufmap_page_fill->17 times

                     .
                     .
                     .

  dd if=/pvfsmnt/63 of=/tmp/63 bs=258048 count=1
       orangefs_readpage->32 times 258048, 253952, 249856, ..., 131072
       orangefs_bufmap_page_fill->31 times

  dd if=/pvfsmnt/64 of=/tmp/64 bs=262144 count=1
       orangefs_readpage->32 times 262144, 258048, 253952, ..., 135168
       orangefs_bufmap_page_fill->32 times

  dd if=/pvfsmnt/65 of=/tmp/65 bs=266240 count=1
       orangefs_readpage->32 times 266240, 262144, 258048, ..., 139264
       orangefs_bufmap_page_fill->33 times

                     .
                     .
                     .

  dd if=/pvfsmnt/127 of=/tmp/127 bs=520192 count=1
       orangefs_readpage->32 times 520192, 516096, 512000, ..., 393216
       orangefs_bufmap_page_fill->95 times

  dd if=/pvfsmnt/128 of=/tmp/128 bs=524288 count=1
       orangefs_readpage->32 times 524288, 520192, 516096, ..., 397312
       orangefs_bufmap_page_fill->96 times

It kind of starts over here, since the hard IOs are all 524288 bytes.
# grep 524288 fs/orangefs/inode.c
    read_size = 524288;

Thanks for any help y'all can give, I'll of course keep on trying
to understand what is going on.

-Mike
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic