[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openldap-bugs
Subject:    Re: (ITS#9017) Improving performance of commit sync in Windows
From:       kriszyp () gmail ! com
Date:       2019-09-18 18:56:34
Message-ID: E1iAf7y-0003yj-BP () gauss ! openldap ! net
[Download RAW message or body]

Checking on this again, is this still a possibility for merging into LMDB?
This fix is still working great (improved performance) on our systems.
Thanks,
Kris

On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp <kriszyp@gmail.com> wrote:

> Is this still being considered/reviewed? Let me know if there are any
> other changes you would like me to make. This patch has continued to yield
> significant and reliable performance improvements for us, and seems like it
> would be nice for this to be available for other Windows users.
>
> On Fri, May 3, 2019 at 3:52 PM Kris Zyp <kriszyp@gmail.com> wrote:
>
>> For the sake of putting this in the email thread (other code discussion
>> in GitHub), here is the latest squashed commit of the proposed patch (with
>> the on-demand, retained overlapped array to reduce re-malloc and opening
>> event handles):
>> https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f
>>
>>
>>
>> Thanks,
>> Kris
>>
>>
>>
>> *From: *Kris Zyp <kriszyp@gmail.com>
>> *Sent: *April 30, 2019 12:43 PM
>> *To: *Howard Chu <hyc@symas.com>; openldap-its@OpenLDAP.org
>> *Subject: *RE: (ITS#9017) Improving performance of commit sync in Windows
>>
>>
>>
>> > What is the point of using writemap mode if you still need to use
>> WriteFile
>>
>> > on every individual page?
>>
>>
>>
>> As I understood from the documentation, and have observed, using writemap
>> mode is faster (and uses less temporary memory) because it doesn't require
>> mallocs to allocate pages (docs: "This is faster and uses fewer mallocs").
>> To be clear though, LMDB is so incredibly fast and efficient, that in
>> sync-mode, it takes enormous transactions before the time spent allocating
>> and creating the dirty pages with the updated b-tree is anywhere even
>> remotely close to the time it takes to wait for disk flushing, even with an
>> SSD. But the more pertinent question is efficiency, and measuring CPU
>> cycles rather than time spent (efficiency is more important than just time
>> spent). When I ran my tests this morning of 100 (sync) transactions with
>> 100 puts per transaction, times varied quite a bit, but it seemed like
>> running with writemap enabled typically averages about 500ms of CPU and
>> with writemap disabled it typically averages around 600ms. Not a huge
>> difference, but still definitely worthwhile, I think.
>>
>>
>>
>> Caveat emptor: Measuring LMDB performance with sync interactions on
>> Windows is one of the most frustratingly erratic things to measure. It is
>> sunny outside right now, times could be different when it starts raining
>> later, but, this is what I saw this morning...
>>
>>
>>
>> > What is the performance difference between your patch using writemap,
>> and just
>>
>> > not using writemap in the first place?
>>
>>
>>
>> Running 1000 sync transactions on 3GB db with a single put per
>> transaction, without writemap map, without the patch took about 60 seconds.
>> And it took about 1 second with the patch with writemap mode enabled!
>> (there is no significant difference in sync times with writemap enabled or
>> disabled with the patch.) So the difference was huge in my test. And not
>> only that, without the patch, the CPU usage was actually _*higher*_
>> during that 60 seconds (close to 100% of a core) than during the execution
>> with the patch for one second (close to 50%).  Anyway, there are certainly
>> tests I have run where the differences are not as large (doing small
>> commits on large dbs accentuates the differences), but the patch always
>> seems to win. It could also be that my particular configuration causes
>> bigger differences (on an SSD drive, and maybe a more fragmented file?).
>>
>>
>>
>> Anyway, I added error handling for the malloc, and fixed/changed the
>> other things you suggested. Be happy to make any other changes you want.
>> The updated patch is here:
>>
>>
>> https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde
>>
>>
>>
>> > OVERLAPPED* ov = malloc((pagecount - keep) * sizeof(OVERLAPPED));
>>
>> > Probably this ought to just be pre-allocated based on the maximum
>> number of dirty pages a txn allows.
>>
>>
>>
>> I wasn't sure I understood this comment. Are you suggesting we malloc(MDB_IDL_UM_MAX
>> * sizeof(OVERLAPPED)) for each environment, and retain it for the life of
>> the environment? I think that is 4MB, if my math is right, which seems like
>> a lot of memory to keep allocated (we usually have a lot of open
>> environments). If the goal is to reduce the number of mallocs, how about we
>> retain the OVERLAPPED array, and only free and re-malloc if the previous
>> allocation wasn't large enough? Then there isn't unnecessary allocation,
>> and we only malloc when there is a bigger transaction than any previous. I
>> put this together in a separate commit, as I wasn't sure if this what you
>> wanted (can squash if you prefer):
>> https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40
>>
>>
>>
>> Thank you for the review!
>>
>>
>>
>> Thanks,
>> Kris
>>
>>
>>
>> *From: *Howard Chu <hyc@symas.com>
>> *Sent: *April 30, 2019 7:12 AM
>> *To: *kriszyp@gmail.com; openldap-its@OpenLDAP.org
>> *Subject: *Re: (ITS#9017) Improving performance of commit sync in Windows
>>
>>
>>
>> kriszyp@gmail.com wrote:
>>
>> > Full_Name: Kristopher William Zyp
>>
>> > Version: LMDB 0.9.23
>>
>> > OS: Windows
>>
>> > URL:
>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9
>>
>> > Submission from: (NULL) (71.199.6.148)
>>
>> >
>>
>> >
>>
>> > We have seen very poor performance on the sync of commits on large
>> databases in
>>
>> > Windows. On databases with 2GB of data, in writemap mode, the sync of
>> even small
>>
>> > commits is consistently well over 100ms (without writemap it is faster,
>> but
>>
>> > still slow). It is expected that a sync should take some time while
>> waiting for
>>
>> > disk confirmation of the writes, but more concerning is that these sync
>>
>> > operations (in writemap mode) are instead dominated by nearly 100%
>> system CPU
>>
>> > utilization, so operations that requires sub-millisecond b-tree update
>>
>> > operations are then dominated by very large amounts of system CPU
>> cycles during
>>
>> > the sync phase.
>>
>> >
>>
>> > I think that the fundamental problem is that FlushViewOfFile seems to
>> be an O(n)
>>
>> > operation where n is the size of the file (or map). I presume that
>> Windows is
>>
>> > scanning the entire map/file for dirty pages to flush, I'm guessing
>> because it
>>
>> > doesn't have an internal index of all the dirty pages for every
>> file/map-view in
>>
>> > the OS disk cache. Therefore, the turns into an extremely expensive,
>> CPU-bound
>>
>> > operation to find the dirty pages for (large file) and initiate their
>> writes,
>>
>> > which, of course, is contrary to the whole goal of a scalable database
>> system.
>>
>> > And FlushFileBuffers is also relatively slow as well. We have attempted
>> to batch
>>
>> > as many operations into single transaction as possible, but this is
>> still a very
>>
>> > large overhead.
>>
>> >
>>
>> > The Windows docs for FlushFileBuffers itself warns about the
>> inefficiencies of
>>
>> > this function (
>> https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuffers
>> ).
>>
>> > Which also points to the solution: it is much faster to write out the
>> dirty
>>
>> > pages with WriteFile through a sync file handle
>> (FILE_FLAG_WRITE_THROUGH).
>>
>> >
>>
>> > The associated patch
>>
>> > (
>> https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9
>> )
>>
>> > is my attempt at implementing this solution, for Windows. Fortunately,
>> with the
>>
>> > design of LMDB, this is relatively straightforward. LMDB already
>> supports
>>
>> > writing out dirty pages with WriteFile calls. I added a write-through
>> handle for
>>
>> > sending these writes directly to disk. I then made that file-handle
>>
>> > overlapped/asynchronously, so all the writes for a commit could be
>> started in
>>
>> > overlap mode, and (at least theoretically) transfer in parallel to the
>> drive and
>>
>> > then used GetOverlappedResult to wait for the completion. So basically
>>
>> > mdb_page_flush becomes the sync. I extended the writing of dirty pages
>> through
>>
>> > WriteFile to writemap mode as well (for writing meta too), so that
>> WriteFile
>>
>> > with write-through can be used to flush the data without ever needing
>> to call
>>
>> > FlushViewOfFile or FlushFileBuffers. I also implemented support for
>> write
>>
>> > gathering in writemap mode where contiguous file positions infers
>> contiguous
>>
>> > memory (by tracking the starting position with wdp and writing
>> contiguous pages
>>
>> > in single operations). Sorting of the dirty list is maintained even in
>> writemap
>>
>> > mode for this purpose.
>>
>>
>>
>> What is the point of using writemap mode if you still need to use
>> WriteFile
>>
>> on every individual page?
>>
>>
>>
>> > The performance benefits of this patch, in my testing, are
>> considerable. Writing
>>
>> > out/syncing transactions is typically over 5x faster in writemap mode,
>> and 2x
>>
>> > faster in standard mode. And perhaps more importantly (especially in
>> environment
>>
>> > with many threads/processes), the efficiency benefits are even larger,
>>
>> > particularly in writemap mode, where there can be a 50-100x reduction
>> in the
>>
>> > system CPU usage by using this patch. This brings windows performance
>> with
>>
>> > sync'ed transactions in LMDB back into the range of "lightning"
>> performance :).
>>
>>
>>
>> What is the performance difference between your patch using writemap, and
>> just
>>
>> not using writemap in the first place?
>>
>>
>>
>> --
>>
>>   -- Howard Chu
>>
>>   CTO, Symas Corp.           http://www.symas.com
>>
>>   Director, Highland Sun     http://highlandsun.com/hyc/
>>
>>   Chief Architect, OpenLDAP  http://www.openldap.org/project/
>>
>>
>>
>>
>>
>

[Attachment #3 (text/html)]

<div dir="ltr">Checking on this again, is this still a possibility for merging into \
LMDB? This fix is still working great (improved performance) on our \
systems.<div>Thanks,</div><div>Kris</div></div><br><div class="gmail_quote"><div \
dir="ltr" class="gmail_attr">On Mon, Jun 17, 2019 at 1:04 PM Kris Zyp &lt;<a \
href="mailto:kriszyp@gmail.com">kriszyp@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Is this still being \
considered/reviewed? Let me know if there are any other changes you would like me to \
make. This patch has continued to yield significant and reliable performance \
improvements for us, and seems like it would be nice for this to be available for \
other Windows users.</div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Fri, May 3, 2019 at 3:52 PM Kris Zyp &lt;<a \
href="mailto:kriszyp@gmail.com" target="_blank">kriszyp@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-CA"><div \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140WordSection1"><p \
class="MsoNormal">For the sake of putting this in the email thread (other code \
discussion in GitHub), here is the latest squashed commit of the proposed patch (with \
the on-demand, retained overlapped array to reduce re-malloc and opening event \
handles): <a href="https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f" \
target="_blank">https://github.com/kriszyp/node-lmdb/commit/726a9156662c703bf3d453aab75ee222072b990f</a></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">Thanks,<br>Kris</p><p \
class="MsoNormal"><u></u>  <u></u></p><div \
style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid \
rgb(225,225,225);padding:3pt 0cm 0cm"><p class="MsoNormal" \
style="border:none;padding:0cm"><b>From: </b><a href="mailto:kriszyp@gmail.com" \
target="_blank">Kris Zyp</a><br><b>Sent: </b>April 30, 2019 12:43 PM<br><b>To: </b><a \
href="mailto:hyc@symas.com" target="_blank">Howard Chu</a>; <a \
href="mailto:openldap-its@OpenLDAP.org" \
target="_blank">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>RE: (ITS#9017) \
Improving performance of commit sync in Windows</p></div><p class="MsoNormal"><u></u> \
<u></u></p><p class="MsoNormal">&gt; What is the point of using writemap mode if you \
still need to use WriteFile<u></u><u></u></p><p class="MsoNormal">&gt; on every \
individual page?<u></u><u></u></p><p class="MsoNormal"><u></u>  <u></u></p><p \
class="MsoNormal">As I understood from the documentation, and have observed, using \
writemap mode is faster (and uses less temporary memory) because it doesn't require \
mallocs to allocate pages (docs: "This is faster and uses fewer mallocs"). To be \
clear though, LMDB is so incredibly fast and efficient, that in sync-mode, it takes \
enormous transactions before the time spent allocating and creating the dirty pages \
with the updated b-tree is anywhere even remotely close to the time it takes to wait \
for disk flushing, even with an SSD. But the more pertinent question is efficiency, \
and measuring CPU cycles rather than time spent (efficiency is more important than \
just time spent). When I ran my tests this morning of 100 (sync) transactions with \
100 puts per transaction, times varied quite a bit, but it seemed like running with \
writemap enabled typically averages about 500ms of CPU and with writemap disabled it \
typically averages around 600ms. Not a huge difference, but still definitely \
worthwhile, I think.<u></u><u></u></p><p class="MsoNormal"><u></u>  <u></u></p><p \
class="MsoNormal">Caveat emptor: Measuring LMDB performance with sync interactions on \
Windows is one of the most frustratingly erratic things to measure. It is sunny \
outside right now, times could be different when it starts raining later, but, this \
is what I saw this morning...<u></u><u></u></p><p class="MsoNormal"><u></u>  \
<u></u></p><p class="MsoNormal">&gt; What is the performance difference between your \
patch using writemap, and just<u></u><u></u></p><p class="MsoNormal">&gt; not using \
writemap in the first place?<u></u><u></u></p><p class="MsoNormal"><u></u>  \
<u></u></p><p class="MsoNormal">Running 1000 sync transactions on 3GB db with a \
single put per transaction, without writemap map, without the patch took about 60 \
seconds. And it took about 1 second with the patch with writemap mode enabled! (there \
is no significant difference in sync times with writemap enabled or disabled with the \
patch.) So the difference was huge in my test. And not only that, without the patch, \
the CPU usage was actually _<i>higher</i>_ during that 60 seconds (close to 100% of a \
core) than during the execution with the patch for one second (close to 50%).   \
Anyway, there are certainly tests I have run where the differences are not as large \
(doing small commits on large dbs accentuates the differences), but the patch always \
seems to win. It could also be that my particular configuration causes bigger \
differences (on an SSD drive, and maybe a more fragmented file?).<u></u><u></u></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">Anyway, I added error \
handling for the malloc, and fixed/changed the other things you suggested. Be happy \
to make any other changes you want. The updated patch is here:<u></u><u></u></p><p \
class="MsoNormal"><a \
href="https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde" \
target="_blank">https://github.com/kriszyp/node-lmdb/commit/25366dea9453749cf6637f43ec17b9b62094acde</a><u></u><u></u></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">&gt;<span \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(36,41,46)"> OVERLAPPED* ov = \
</span></span><span class="gmail-m_-3761698270430747347gmail-m_5242437559147988140pl-c1"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(0,92,197)">malloc</span></span><span \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">((pagecount - keep) * \
</span></span><span class="gmail-m_-3761698270430747347gmail-m_5242437559147988140pl-k"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(215,58,73)">sizeof</span></span><span \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">(OVERLAPPED));</span></span><span \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(36,41,46)"><u></u><u></u></span></span></p><p \
class="MsoNormal"><span \
class="gmail-m_-3761698270430747347gmail-m_5242437559147988140blob-code-inner"><span \
style="font-size:9pt;font-family:Consolas;color:rgb(36,41,46)">&gt; \
</span></span><span style="font-size:10.5pt;font-family:&quot;Segoe \
UI&quot;,sans-serif;color:rgb(36,41,46);background:white">Probably this ought to just \
be pre-allocated based on the maximum number of dirty pages a txn allows.</span><span \
style="font-size:10.5pt;font-family:&quot;Segoe \
UI&quot;,sans-serif;background:white"><u></u><u></u></span></p><p \
class="MsoNormal"><span style="font-size:10.5pt;font-family:&quot;Segoe \
UI&quot;,sans-serif;color:rgb(36,41,46);background:white"><u></u>  \
<u></u></span></p><p class="MsoNormal"><span \
style="font-size:10.5pt;font-family:&quot;Segoe \
UI&quot;,sans-serif;color:rgb(36,41,46);background:white">I wasn't sure I understood \
this comment. Are you suggesting we </span>malloc(MDB_IDL_UM_MAX * \
sizeof(OVERLAPPED)) for each environment, and retain it for the life of the \
environment? I think that is 4MB, if my math is right, which seems like a lot of \
memory to keep allocated (we usually have a lot of open environments). If the goal is \
to reduce the number of mallocs, how about we retain the OVERLAPPED array, and only \
free and re-malloc if the previous allocation wasn't large enough? Then there isn't \
unnecessary allocation, and we only malloc when there is a bigger transaction than \
any previous. I put this together in a separate commit, as I wasn't sure if this what \
you wanted (can squash if you prefer): <a \
href="https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40" \
target="_blank">https://github.com/kriszyp/node-lmdb/commit/2fe68fb5269c843e2e789746a17a4b2adefaac40</a><u></u><u></u></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">Thank you for the review! \
<span style="font-size:10.5pt;font-family:&quot;Segoe \
UI&quot;,sans-serif;color:rgb(36,41,46);background:white"><u></u><u></u></span></p><p \
class="MsoNormal"><u></u>  <u></u></p><p \
class="MsoNormal">Thanks,<br>Kris<u></u><u></u></p><p class="MsoNormal"><u></u>  \
<u></u></p><div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt \
solid rgb(225,225,225);padding:3pt 0cm 0cm"><p class="MsoNormal"><b>From: </b><a \
href="mailto:hyc@symas.com" target="_blank">Howard Chu</a><br><b>Sent: </b>April 30, \
2019 7:12 AM<br><b>To: </b><a href="mailto:kriszyp@gmail.com" \
target="_blank">kriszyp@gmail.com</a>; <a href="mailto:openldap-its@OpenLDAP.org" \
target="_blank">openldap-its@OpenLDAP.org</a><br><b>Subject: </b>Re: (ITS#9017) \
Improving performance of commit sync in Windows<u></u><u></u></p></div><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal"><a \
href="mailto:kriszyp@gmail.com" target="_blank">kriszyp@gmail.com</a> \
wrote:<u></u><u></u></p><p class="MsoNormal">&gt; Full_Name: Kristopher William \
Zyp<u></u><u></u></p><p class="MsoNormal">&gt; Version: LMDB \
0.9.23<u></u><u></u></p><p class="MsoNormal">&gt; OS: Windows<u></u><u></u></p><p \
class="MsoNormal">&gt; URL: <a \
href="https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" \
target="_blank">https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9</a><u></u><u></u></p><p \
class="MsoNormal">&gt; Submission from: (NULL) (71.199.6.148)<u></u><u></u></p><p \
class="MsoNormal">&gt; <u></u><u></u></p><p class="MsoNormal">&gt; \
<u></u><u></u></p><p class="MsoNormal">&gt; We have seen very poor performance on the \
sync of commits on large databases in<u></u><u></u></p><p class="MsoNormal">&gt; \
Windows. On databases with 2GB of data, in writemap mode, the sync of even \
small<u></u><u></u></p><p class="MsoNormal">&gt; commits is consistently well over \
100ms (without writemap it is faster, but<u></u><u></u></p><p class="MsoNormal">&gt; \
still slow). It is expected that a sync should take some time while waiting \
for<u></u><u></u></p><p class="MsoNormal">&gt; disk confirmation of the writes, but \
more concerning is that these sync<u></u><u></u></p><p class="MsoNormal">&gt; \
operations (in writemap mode) are instead dominated by nearly 100% system \
CPU<u></u><u></u></p><p class="MsoNormal">&gt; utilization, so operations that \
requires sub-millisecond b-tree update<u></u><u></u></p><p class="MsoNormal">&gt; \
operations are then dominated by very large amounts of system CPU cycles \
during<u></u><u></u></p><p class="MsoNormal">&gt; the sync phase.<u></u><u></u></p><p \
class="MsoNormal">&gt; <u></u><u></u></p><p class="MsoNormal">&gt; I think that the \
fundamental problem is that FlushViewOfFile seems to be an O(n)<u></u><u></u></p><p \
class="MsoNormal">&gt; operation where n is the size of the file (or map). I presume \
that Windows is<u></u><u></u></p><p class="MsoNormal">&gt; scanning the entire \
map/file for dirty pages to flush, I&#39;m guessing because it<u></u><u></u></p><p \
class="MsoNormal">&gt; doesn&#39;t have an internal index of all the dirty pages for \
every file/map-view in<u></u><u></u></p><p class="MsoNormal">&gt; the OS disk cache. \
Therefore, the turns into an extremely expensive, CPU-bound<u></u><u></u></p><p \
class="MsoNormal">&gt; operation to find the dirty pages for (large file) and \
initiate their writes,<u></u><u></u></p><p class="MsoNormal">&gt; which, of course, \
is contrary to the whole goal of a scalable database system.<u></u><u></u></p><p \
class="MsoNormal">&gt; And FlushFileBuffers is also relatively slow as well. We have \
attempted to batch<u></u><u></u></p><p class="MsoNormal">&gt; as many operations into \
single transaction as possible, but this is still a very<u></u><u></u></p><p \
class="MsoNormal">&gt; large overhead.<u></u><u></u></p><p class="MsoNormal">&gt; \
<u></u><u></u></p><p class="MsoNormal">&gt; The Windows docs for FlushFileBuffers \
itself warns about the inefficiencies of<u></u><u></u></p><p class="MsoNormal">&gt; \
this function (<a href="https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuffers" \
target="_blank">https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-flushfilebuffers</a>).<u></u><u></u></p><p \
class="MsoNormal">&gt; Which also points to the solution: it is much faster to write \
out the dirty<u></u><u></u></p><p class="MsoNormal">&gt; pages with WriteFile through \
a sync file handle (FILE_FLAG_WRITE_THROUGH).<u></u><u></u></p><p \
class="MsoNormal">&gt; <u></u><u></u></p><p class="MsoNormal">&gt; The associated \
patch<u></u><u></u></p><p class="MsoNormal">&gt; (<a \
href="https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9" \
target="_blank">https://github.com/kriszyp/node-lmdb/commit/7ff525ae57684a163d32af74a0ab9332b7fc4ce9</a>)<u></u><u></u></p><p \
class="MsoNormal">&gt; is my attempt at implementing this solution, for Windows. \
Fortunately, with the<u></u><u></u></p><p class="MsoNormal">&gt; design of LMDB, this \
is relatively straightforward. LMDB already supports<u></u><u></u></p><p \
class="MsoNormal">&gt; writing out dirty pages with WriteFile calls. I added a \
write-through handle for<u></u><u></u></p><p class="MsoNormal">&gt; sending these \
writes directly to disk. I then made that file-handle<u></u><u></u></p><p \
class="MsoNormal">&gt; overlapped/asynchronously, so all the writes for a commit \
could be started in<u></u><u></u></p><p class="MsoNormal">&gt; overlap mode, and (at \
least theoretically) transfer in parallel to the drive and<u></u><u></u></p><p \
class="MsoNormal">&gt; then used GetOverlappedResult to wait for the completion. So \
basically<u></u><u></u></p><p class="MsoNormal">&gt; mdb_page_flush becomes the sync. \
I extended the writing of dirty pages through<u></u><u></u></p><p \
class="MsoNormal">&gt; WriteFile to writemap mode as well (for writing meta too), so \
that WriteFile<u></u><u></u></p><p class="MsoNormal">&gt; with write-through can be \
used to flush the data without ever needing to call<u></u><u></u></p><p \
class="MsoNormal">&gt; FlushViewOfFile or FlushFileBuffers. I also implemented \
support for write<u></u><u></u></p><p class="MsoNormal">&gt; gathering in writemap \
mode where contiguous file positions infers contiguous<u></u><u></u></p><p \
class="MsoNormal">&gt; memory (by tracking the starting position with wdp and writing \
contiguous pages<u></u><u></u></p><p class="MsoNormal">&gt; in single operations). \
Sorting of the dirty list is maintained even in writemap<u></u><u></u></p><p \
class="MsoNormal">&gt; mode for this purpose.<u></u><u></u></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">What is the point of using \
writemap mode if you still need to use WriteFile<u></u><u></u></p><p \
class="MsoNormal">on every individual page?<u></u><u></u></p><p \
class="MsoNormal"><u></u>  <u></u></p><p class="MsoNormal">&gt; The performance \
benefits of this patch, in my testing, are considerable. Writing<u></u><u></u></p><p \
class="MsoNormal">&gt; out/syncing transactions is typically over 5x faster in \
writemap mode, and 2x<u></u><u></u></p><p class="MsoNormal">&gt; faster in standard \
mode. And perhaps more importantly (especially in environment<u></u><u></u></p><p \
class="MsoNormal">&gt; with many threads/processes), the efficiency benefits are even \
larger,<u></u><u></u></p><p class="MsoNormal">&gt; particularly in writemap mode, \
where there can be a 50-100x reduction in the<u></u><u></u></p><p \
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic