[prev in list] [next in list] [prev in thread] [next in thread]
List: cfe-dev
Subject: Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis
From: Ben Craig via cfe-dev <cfe-dev () lists ! llvm ! org>
Date: 2015-08-31 17:29:25
Message-ID: 178901d0e412$95c563f0$c1502bd0$ () codeaurora ! org
[Download RAW message or body]
This is a multipart message in MIME format.
[Attachment #2 (multipart/alternative)]
This is a multipart message in MIME format.
Progress so far…
With the FoldingSet changes that I have prototyped on my machine, I've reduced a \
6m57.730s analysis to 6m20.126s. This is without caching a hash value, but it does \
do an up-front bucket reservation that I will discuss below.
My responses inline.
I'll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe \
that is not so bad. I'm totally fine with ditching FoldingSetNodeID; that was just a \
way to get some hashing based on a mechanism already in place in LLVM.
I've got an alternative pitch. There are two times when the hash of a given node is \
needed:
1. When it is added to the FoldingSet
2. When the FoldingSet grows and needs to rebucket items.
Saving the hash made case 2 significantly faster. I could instead try to avoid \
growing the FoldingSet entirely by guessing that the eventual number of buckets will \
roughly the number of work items that will be run. This approach will make simple \
analyze runs take up more memory, but it will make large analyze runs take up less, \
since we won't need to allocate, then free the too-small buckets.
It may be more profitable to put the cached hash in ProgramState. Conceptually, \
ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and \
ProgramState should be independently hashable. ProgramPoints should be quickly \
hashable on their own, but ProgramStates are very expensive to both hash and \
semantically compare. That's why we do all sorts of uniquing with the ImmutableMaps \
so that we can often use referential identity in some places to provide semantic \
identity, even thought he referential identity of the underlying ImmutableMap objects \
doesn't actually matter.
Currently, the address of the ProgramState is going into the hash. The contents of \
the ProgramState aren't inspected at all for hashing or equality purposes. I think \
this is a good way to have it too. The performance trouble I'm seeing mostly come \
from cache misses, and cache misses are normally a result of chasing pointers. I \
don't want to put a hash inside of ProgramState, because that's just one more pointer \
to chase.
I've considered trying to use a b-tree instead of a hash map.
Just for clarification, which hash map are you talking about?
I was referring to the Nodes hash map inside of ExplodedGraph. I have my doubts that \
I'll be able to squeeze a comparison key and a pointer into 16-20 bytes though. I \
would need to get the key-pointer pair down to that size in order to get a 3 element \
b-tree node into a 64-byte cache line.
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project
From: Ted Kremenek [mailto:kremenek@apple.com]
Sent: Friday, August 28, 2015 1:21 AM
To: Ben Craig
Cc: cfe-dev@lists.llvm.org
Subject: Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis
Hi Ben,
Great points, and I've provided responses inline.
On Aug 27, 2015, at 1:31 PM, Ben Craig <ben.craig@codeaurora.org \
<mailto:ben.craig@codeaurora.org> > wrote:
Thanks Ted. Great information. I was hoping you would chime in :)
RE: FoldingSet
Some of the reasons that I'm asking about de-duplication is that I've already \
attempted to change things in some reckless experiments. My quick-and-dirty \
attempts at removing the ExplodedGraph cache "worked" (for some value of working) and \
was faster, but the analysis quality was lowered enough that check-clang failed. I \
made a few feeble attempts at being selective with de-duplication, but those haven't \
worked out either.
I have had some success with forking + modifying llvm::FoldingSet. Instead of \
hashing in terms of a FoldingSetID, I just have a trait with a Hash and an Equals \
method. This gives me a lot more flexibility in calculating the hash more quickly. \
I even went as far as caching the hash on the ExplodedNode itself. This is looking \
like it will result in a ~9% speedup, at the cost of an extra 4 bytes per \
ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet \
class.
I'll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe \
that is not so bad. I'm totally fine with ditching FoldingSetNodeID; that was just a \
way to get some hashing based on a mechanism already in place in LLVM.
It may be more profitable to put the cached hash in ProgramState. Conceptually, \
ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and \
ProgramState should be independently hashable. ProgramPoints should be quickly \
hashable on their own, but ProgramStates are very expensive to both hash and \
semantically compare. That's why we do all sorts of uniquing with the ImmutableMaps \
so that we can often use referential identity in some places to provide semantic \
identity, even thought he referential identity of the underlying ImmutableMap objects \
doesn't actually matter.
I've considered trying to use a b-tree instead of a hash map.
Just for clarification, which hash map are you talking about?
If (and that's a big if) I can find / construct a small comparison key that is \
accurate and encourages good access locality, then I think it may end up being a more \
cache friendly structure than the hash map. I think that the insertions and lookups \
happen ‘near' each other in terms of the AST, so I think the cache-friendliness can \
likely outweigh the downgrade from O(1) to O(lg n). But that does all hinge on \
making a small accurate comparison key.
RE: ImmutableMap
I'm going to rephrase what you said to check for my own understanding.
Instead of ImmutableMap, a naïve implementation could have stored a std::map (or \
some other associative container) instead. Doing so would get the same quality of \
analysis. However, it would use a lot more memory, and the allocations would be \
slower, since they would be using the system / mallocator instead of the bump \
allocator. The ImmutableMap is taking advantage of the fact that the program state \
changes in little bits and pieces, instead of being drastically different from one \
node to the next. The ImmutableMap allows multiple ExplodedNodes to share most of \
the common program state elements, without mutating the old elements.
If that is accurate, then I can definitely see why an ImmutableMap is being used \
here. I'll have to think more about what can change. I'll also keep finger search \
trees in mind.
Just to be clear, it's mostly about semantics. The allocations using a bump \
allocator is gravy, because we allocate so many ImmutableMap.
A ProgramState, which itself is an immutable value, contains a bunch of ImmutableMaps \
to represent things like bindings of variables to values, constraints on symbols, and \
so on. You don't mutate an ImmutableMap at all; you create a new ImmutableMap from \
an existing one, and they share state for the purpose of efficiency. That memory \
sharing between ImmutableMap values is purely an (important) optimization; what is \
most important is the semantics.
When you add a value to an std::map, that map changes in place. If two ProgramStates \
had the same std::map, then adding a value to that std::map would change multiple \
ProgramStates, which is not what we want at all. Indeed, a common mistake that \
people make when making changes to the analyzer or writing checkers is using mutating \
data structures (e.g., DenseMap) when they are in a context where immutable values \
are needed.
We could conceptually use std::map in the same places we used ImmutableMap, and even \
have sharing of std::maps between ProgramState values, but we would need to do some \
kind of copy-on-write when changing the maps for a ProgramState so that it did not \
change the underlying values of another ProgramState. This is entirely doable, but \
essentially would re-implement much of the core semantics of ImmutableMap. \
ImmutableMap also can have maps of maps, all composing to preserve immutability, \
which is really important. If you try building that all on top of std::map directly \
you can have a big house of cards that is hard to get semantic integrity in the \
analyzer core.
RE: General scalability
For now, I am just trying to make the existing 150,000 iterations provide the same \
quality in less time. Longer term, it may make sense for me to look at things like \
the summary based algorithm.
That is totally fine, and a great place to start. Even a constant factor improvement \
will make a big impact in practice.
One other hand-wavey approach that I've considered is to save off a file with an \
intermediate analysis state, so that future analysis runs can reuse the parts of that \
analysis that haven't changed. Both of those are going to need to wait until I've \
gotten more familiar with the code base.
I think this is all great, but please when making changes keep into account the \
memory overhead. I put a fair amount of tweaking into the current uses of \
ImmutableMap, including how aggressively we reclaim ExplodedNodes, based on tuning \
for memory usage. Changes in core representation of these data structures, or even \
minor algorithmic tweaks, can have a huge impact on memory. I even changed some of \
the balancing heuristics for the AVL tree so that the trees for the ImmutableMaps \
would be slightly less balanced at the benefit of using up less memory in practice.
To put it into context, suppose we made a change that caused the analyzer to use 10MB \
more memory on a complicated translation unit (this isn't a totally unrealistic \
number for very complicated files). If we had 8 cores each doing static analysis, \
that's 80MB additional memory being used for static analysis on a machine. As 10MB \
goes to 20MB, 20MB goes to 30MB, you can quickly see how much that additional memory \
overhead really matters. And the problem tends to be more pronounced on machines \
with more cores, with don't always have the same amount of memory per core as \
machines with fewer cores.
Cheers,
Ted
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project
From: Ted Kremenek [mailto:kremenek@apple.com]
Sent: Wednesday, August 26, 2015 9:54 PM
To: Ben Craig
Cc: cfe-dev@lists.llvm.org <mailto:cfe-dev@lists.llvm.org>
Subject: Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis
Hi Ben,
Thank you for taking a crack at looking at how to speed up the analyzer. The last \
time this was done was years ago, and I'm glad to see some fresh eyes taking a look \
at it.
I think to answer your questions I think it is worth stepping back and understanding \
a bit better how the pieces of the analyzer work together, and how they accomplish \
things as a whole. While the profile data is illuminating, I don't think it provides \
a total picture of what is going on.
To put things in perspective, the static analyzer employs a path-sensitive dataflow \
analysis, or analogously, a symbolic execution. In the worst case this is a super \
exponential algorithm. Analysis is essentially driven by a worklist algorithm with a \
finite number of items that can be removed from the worklist, which is essentially \
bounds the amount of work. That still means the analyzer can take a long time, but \
it is guaranteed to terminate.
Because the algorithm is super exponential, in practice the real performance wins are \
usually algorithmic. Essentially, they can fall into two categories:
1. Don't repeat the same work. The main example of this in the analyzer right now is \
to perform memoization so that the same path is not analyzed more than once. The \
FoldingSet code that you see show up in the profile is crucial to this memoization. \
It's not surprising that it shows up a lot, but keep in mind that it is being used to \
do intelligent caching so that we don't get super exponential complexity in practice.
2. Do the same work more intelligently in a way that, algorithmically, takes less \
time. One example of this is that currently the analyzer unrolls loops 3 times, but \
there are potential ways to analyze some loops without unrolling. You thus can get \
the same analysis quality with less work, and increase scalability.
Speeding up the raw data structures in isolation can sometimes be a win, but usually \
this results in a constant factor win at best. That really doesn't chip away at the \
scalability problem. From your email, it sounds like you're initially interested in \
reducing the current analysis time with keeping the analyzer doing the same amount of \
work. That's great, but also keep in mind there is a bunch of analysis that is not \
done right now because the analyzer doesn't scale. Finding ways to truly make the \
analyzer scale to larger problems requires much deeper changes to the analyzer, but \
will likely result in far more profound performance wins in the end.
Another crucial thing to take into account when profiling the analyzer is memory \
usage. Some of the current tradeoffs were reached because keeping the memory usage \
down is a critical part of making the analyzer perform. There's an immense amount of \
parallelism in the analyzer's analysis; the most basic one that is used now is that \
multiple translation units can be analyzed in parallel, which means N number of clang \
processes doing static analysis (where N is the number of cores the machine can \
saturate). If the memory usage is too high, the machine will start to thrash. More \
intelligent scheduling of analyzer processes is also another way to increase the \
analyzer's scalability in practice, and this is not a task anyone has taken on. \
There's also an immense amount of parallelism to be exploited when analyzing a single \
translation unit. That can be done in two ways: parallelize the current worklist \
algorithm, which will be tricky (but not impossible) since many of the data \
structures used are not thread-safe, or move more towards a summary-based algorithm \
approach the a few of us having been chatting about for years (which would be the \
basis of doing whole program, interprocedural analysis). Both are non-trivial \
efforts, but are likely to achieve HUGE wins in performance and scalability because \
they algorithmically scale up the performance of the analyzer.
I'll respond to your other points inline, since they have important context.
On Aug 26, 2015, at 2:34 PM, Ben Craig via cfe-dev < <mailto:cfe-dev@lists.llvm.org> \
cfe-dev@lists.llvm.org> wrote:
I am trying to speed up the static analyzer. I also haven't really worked on LLVM or \
Clang much in the past, so some of this may be old news, or naïve, or just plain \
wrong. Regardless, here are some of the things I've discovered while profiling and \
experimenting in the code base:
1. Getting consistent static analysis times is difficult. This has been making \
it difficult for me to quantify the change in performance as I make changes. If \
anyone has suggestions, or even a standard benchmark to use, then it would help me \
out a lot.
2. For my toy test case, roughly 10% of the CPU's time is spent in the checkers \
(inclusive time / time including children). This seems low.
3. Roughly 20% (inclusive) is spent in the Immutable family of data structures.
4. Roughly 25% (inclusive) is spent in the llvm::FoldingSet code. Much of this \
is memory / cache bound.
Neither 3 or 4 is surprising. This is basically the meat-and-potatoes of the \
analyzer engine. Making these faster would obviously speed up the analyzer.
5. In my toy test case, ExplodedGraph::getNode returns a new node ~99% of the \
time. The majority of the old nodes found are purge nodes, but there are a few \
non-purge nodes that get returned.
With all this in mind, it seems reasonable for me to focus on the data structures \
behind the static analyzer, rather than on the checkers.
To be honest, I'm not convinced this is the best investment of your time. I don't \
want to discourage you, but aren't going to get the huge wins you allude to in the \
following sentence:
I suspect that there are some algorithmic changes hiding that could improve the \
speed by orders of magnitude, if I only understood the underlying code and their \
assumptions better.
That said, speeding up the core data structures could be instructive… but you must \
do it with GREAT care. The data structures chosen have specific semantic properties \
which make them viable. I'll comment on them more below. But I do think there are \
wins to be made here, but a key requirement is that the memory usage of the analyzer \
doesn't greatly increase. In fact, I'd be willing to slow the analyzer down slightly \
if we could save more memory. Using less memory makes the analyzer more viable on \
more complicated projects where we tend to hit the maximum analysis time.
For example, with ExplodedGraph::Node, I think I am averaging 3 cache misses per \
ExplodedGraph::getNode call in order to support deduplication, even though less than \
1% of lookups are going to be deduplicated. Does anyone on this list know if there \
are some reasonable criteria to know in advance if a node needs to be stored for \
later deduplication? If we know that a node won't / doesn't need to be reduplicated, \
then we could insert the node blindly into the hash map without looking at all the \
colliding nodes.
Yes, there are a bunch of heuristics you can employ, but I think you'll need to do a \
lot more profiling to be sure.
Essentially the de-duplication happens when paths can possibly merge together. I can \
see this in three places:
(1) At confluence points in the CFG, where multiple branches merge together at one \
basic block. At that point, multiple paths may have the same ProgramState, and thus \
one of them can "cache out". Profiling would show this in practice.
(2) After nodes generated by a checker. Checkers can generate new ProgramStates that \
after reaching some critical point in the analysis have information removed or \
otherwise reach an equilibrium, and thus ExplodedNodes from different paths may \
suddenly have the same ProgramStates (and thus cache out). This is only a theory, \
however, in how much of an impact this really has in practice. Profiling would show \
this in practice.
(3) After a ProgramState has been cleaned out using RemoveDeadBindings. The \
ProgramState returned by RemoveDeadBindings may have a bunch of data cleansed from it \
that made it more unique from other states. At that point, the path is more likely \
to cache out. Again a theory, and one that can be verified by more profiling.
When profiling, you'll also want to analyze a variety of codebases, as the \
characteristics could be very different.
The other big area that I'm confused by is the use and purpose of the ImmutableMap \
typedef Environment::BindingsTy, which maps EnvironmentEntry keys to SVal values. 5% \
of my time is getting spent in this instantiations ImmutableMap::Factory::add, \
largely because of a call to getCanonicalTree. Why is ImmutableMap being used \
instead of alternative associative types? Of particular note is that even iteration \
is expensive because ImmutableMap nodes can't know their parent. There are some \
other ImmutableMaps being used, but I'm hoping that explanations of the \
EnvironmentEntry->SVal instantiation will help me understand the other \
instantiations.
Stepping back, ProgramStates are immutable values. The symbolic simulation works by \
keeping all ProgramStates along a path, and generated a reachable state graph (the \
ExplodedGraph). Since ProgramStates are likely to differ only slightly from each \
other, we use a functional data structure, ImmutableMap, that represents a mapping \
using a functional AVL tree. When you add a value to an ImmutableMap, you really \
just create a new ImmutableMap and the original stays intact. The trick is that \
ImmutableMaps *share* large amounts of their map representation (subtrees of the AVL \
tree) between each other, which represents a significant memory savings. \
Alternatively, we could use a reference-counted std::map or something, and when we \
needed to add a value to the map to create a new ProgramState would could copy the \
entire map. That would likely be very wasteful in practice. Also, ImmutableMap uses \
a BumpAllocator, which is very fast for allocations. This is the reason we use \
ImmutableMap.
But… your profiling data hits on something I saw before. getCanonicalTree() gets \
called when we unique the maps. This uniquing is critical for de-duplication of \
ProgramStates, but we don't always need to de-duplicate ProgramStates unless we think \
there is a strong possibility we will want to merge two paths together. That said, \
recall I mentioned that saving memory was also important. We also want to \
de-duplicate ImmutableMaps, since they are used everywhere, for memory savings. \
Again, I think profiling data could be the key here, and we can be in the position \
where we de-duplicate some ImmutableMaps we use less often than others based \
depending on what the profiling data says. But the profiling data must include \
memory usage as a factor; otherwise it would be easy to optimize these CPU numbers \
but tank memory usage.
Another thing to explore is that for smaller maps maybe we don't want an ImmutableMap \
at all, but just something like an ImmutableList to record a bunch of key value \
pairs. Once we de-duplicate states, we could canonicalize around use ImmutableMaps \
when creating a canonicalized ProgramState, or we can switch to ImmutableMap once the \
list gets too big. For example, Environment uses an ImmutableMap, but there is an \
access pattern to values in an Environment. Typically we just want the \
subexpressions of the current expression. If we are working with a binary operator \
or anything with multiple subexpressions an ImmutableMap might be good, but if there \
is only one subexpression sometimes an ImmutableMap is overkill. Or perhaps we can \
create a data structure that is a mixture of both ImmutableLists and ImmutableMaps, \
chaining them together. Why would we do this? It's less work to manage a small list \
and an immutable map, but it would be an increase in implementation complexity (that \
could possibly be hidden by the right abstractions).
Question summary:
1. Is there a good, existing benchmark for static analysis that I should be \
using?
In the past, I've often looked at:
- sqlite3.c
- postgresql
These were good stable C baselines. I've also analyzed the Python codebase in the \
past, which also exhibited some of the more algorithmic issues in the analyzer \
(analyzing the interpreter code in Python has to deal with a bunch of branches and \
switch statements). Sqlite3 is nice since it is one file.
LLVM/Clang is also a good codebase for C++. I think we should also pick some other \
C++ codebases.
2. Is there some reasonable criteria to know in advance if an ExplodedNode \
needs to be stored for later reduplication?
See my comment above.
3. Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, \
std::hash_map, or even llvmFoldingSet?
Hopefully I've answered this question.
Cheers,
Ted
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project
_______________________________________________
cfe-dev mailing list
<mailto:cfe-dev@lists.llvm.org> cfe-dev@lists.llvm.org
<https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman \
_listinfo_cfe-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=UVc407_CCx3FapxjS2xZ9jo4Q91upS \
GpJHRF8fPPYVY&m=rDsVjZ5h8hLZXMQlZz7pvL5iklwjHNwyE9dfwASlyQQ&s=evEZF87HnXhB6gk0zZ-ejVmUiqWR-Qe5GWiFYtMqUBk&e=> \
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_l \
istinfo_cfe-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=UVc407_CCx3FapxjS2xZ9jo4Q91upSGp \
JHRF8fPPYVY&m=rDsVjZ5h8hLZXMQlZz7pvL5iklwjHNwyE9dfwASlyQQ&s=evEZF87HnXhB6gk0zZ-ejVmUiqWR-Qe5GWiFYtMqUBk&e=
[Attachment #5 (text/html)]
<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:x="urn:schemas-microsoft-com:office:excel" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type \
content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 \
(filtered medium)"><style><!-- /* Font Definitions */
@font-face
{font-family:Helvetica;
panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.apple-converted-space
{mso-style-name:apple-converted-space;}
span.EmailStyle18
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:#1F497D;}
span.EmailStyle19
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div \
class=WordSection1><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Progress so \
far…<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>With the \
FoldingSet changes that I have prototyped on my machine, I've reduced a 6m57.730s \
analysis to 6m20.126s. This is without caching a hash value, but it does do an \
up-front bucket reservation that I will discuss below.<o:p></o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>My responses \
inline. <o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><p \
class=MsoNormal style='margin-left:.5in'>I'll be honest that an extra 4 bytes per \
ExplodedNode does make me cringe, but maybe that is not so bad. I'm totally \
fine with ditching FoldingSetNodeID; that was just a way to get some hashing based on \
a mechanism already in place in LLVM.<o:p></o:p></p><p class=MsoNormal \
style='margin-left:.5in'><o:p> </o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've got an \
alternative pitch. There are two times when the hash of a given node is \
needed:<o:p></o:p></span></p><p class=MsoNormal style='text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>1. When it is \
added to the FoldingSet<o:p></o:p></span></p><p class=MsoNormal \
style='text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>2. When the \
FoldingSet grows and needs to rebucket items.<o:p></o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Saving the \
hash made case 2 significantly faster. I could instead try to avoid growing the \
FoldingSet entirely by guessing that the eventual number of buckets will roughly the \
number of work items that will be run. This approach will make simple analyze runs \
take up more memory, but it will make large analyze runs take up less, since we won't \
need to allocate, then free the too-small buckets.</span><o:p></o:p></p><p \
class=MsoNormal style='margin-left:.5in'><o:p> </o:p></p><p class=MsoNormal \
style='margin-left:.5in'>It may be more profitable to put the cached hash in \
ProgramState. Conceptually, ExplodedNode is just a <ProgramState, \
ProgramPoint> pair, and both ProgramPoint and ProgramState should be independently \
hashable. ProgramPoints should be quickly hashable on their own, but \
ProgramStates are very expensive to both hash and semantically compare. That's \
why we do all sorts of uniquing with the ImmutableMaps so that we can often use \
referential identity in some places to provide semantic identity, even thought he \
referential identity of the underlying ImmutableMap objects doesn't actually \
matter.<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Currently, \
the address of the ProgramState is going into the hash. The contents of the \
ProgramState aren't inspected at all for hashing or equality purposes. I think this \
is a good way to have it too. The performance trouble I'm seeing mostly come from \
cache misses, and cache misses are normally a result of chasing pointers. I don't \
want to put a hash inside of ProgramState, because that's just one more pointer to \
chase.</span> <o:p></o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><p \
class=MsoNormal style='margin-left:.5in;text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've \
considered trying to use a b-tree instead of a hash map.</span><o:p></o:p></p><p \
class=MsoNormal style='margin-left:.5in'><o:p> </o:p></p><p class=MsoNormal \
style='margin-left:.5in'>Just for clarification, which hash map are you talking \
about?<o:p></o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I was \
referring to the Nodes hash map inside of ExplodedGraph. I have my doubts that I'll \
be able to squeeze a comparison key and a pointer into 16-20 bytes though. I would \
need to get the key-pointer pair down to that size in order to get a 3 element b-tree \
node into a 64-byte cache line.<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><div><p \
class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier \
New";color:#1F497D'>Employee of Qualcomm Innovation Center, \
Inc.<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:10.0pt;font-family:"Courier New";color:#1F497D'>Qualcomm Innovation \
Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative \
Project<o:p></o:p></span></p></div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p> </o:p></span></p><div><div \
style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in'><p \
class=MsoNormal><b><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif'>From:</span></b><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif'> Ted Kremenek \
[mailto:kremenek@apple.com] <br><b>Sent:</b> Friday, August 28, 2015 1:21 \
AM<br><b>To:</b> Ben Craig<br><b>Cc:</b> cfe-dev@lists.llvm.org<br><b>Subject:</b> \
Re: [cfe-dev] [Analyzer] Attempting to speed up static \
analysis<o:p></o:p></span></p></div></div><p \
class=MsoNormal><o:p> </o:p></p><div><p class=MsoNormal>Hi \
Ben,<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p \
class=MsoNormal>Great points, and I've provided responses \
inline.<o:p></o:p></p></div><p class=MsoNormal><o:p> </o:p></p><div><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><p class=MsoNormal>On Aug 27, 2015, \
at 1:31 PM, Ben Craig <<a \
href="mailto:ben.craig@codeaurora.org">ben.craig@codeaurora.org</a>> \
wrote:<o:p></o:p></p></div><p class=MsoNormal><o:p> </o:p></p><div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Thanks \
Ted. Great information. I was hoping you would chime in<span \
class=apple-converted-space> </span></span><span \
style='font-size:11.0pt;font-family:Wingdings;color:#1F497D'>J</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> </span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>RE: \
FoldingSet</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Some of the \
reasons that I'm asking about de-duplication is that I've already attempted to change \
things in some reckless experiments. My quick-and-dirty attempts at \
removing the ExplodedGraph cache "worked" (for some value of working) and was faster, \
but the analysis quality was lowered enough that check-clang failed. I made a \
few feeble attempts at being selective with de-duplication, but those haven't worked \
out either.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> </span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I have had \
some success with forking + modifying llvm::FoldingSet. Instead of hashing in \
terms of a FoldingSetID, I just have a trait with a Hash and an Equals method. \
This gives me a lot more flexibility in calculating the hash more quickly. I \
even went as far as caching the hash on the ExplodedNode itself. This is \
looking like it will result in a ~9% speedup, at the cost of an extra 4 bytes per \
ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet \
class.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>I'll be honest \
that an extra 4 bytes per ExplodedNode does make me cringe, but maybe that is not so \
bad. I'm totally fine with ditching FoldingSetNodeID; that was just a way to \
get some hashing based on a mechanism already in place in \
LLVM.<o:p></o:p></p></div><div><p class=MsoNormal><o:p> </o:p></p></div><div><p \
class=MsoNormal>It may be more profitable to put the cached hash in ProgramState. \
Conceptually, ExplodedNode is just a <ProgramState, ProgramPoint> pair, \
and both ProgramPoint and ProgramState should be independently hashable. \
ProgramPoints should be quickly hashable on their own, but ProgramStates are \
very expensive to both hash and semantically compare. That's why we do all \
sorts of uniquing with the ImmutableMaps so that we can often use referential \
identity in some places to provide semantic identity, even thought he referential \
identity of the underlying ImmutableMap objects doesn't actually \
matter.<o:p></o:p></p></div><p class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> </span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've \
considered trying to use a b-tree instead of a hash \
map.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Just for \
clarification, which hash map are you talking about?<o:p></o:p></p></div><p \
class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> If \
(and that's a big if) I can find / construct a small comparison key that is accurate \
and encourages good access locality, then I think it may end up being a more cache \
friendly structure than the hash map. I think that the insertions and lookups \
happen ‘near' each other in terms of the AST, so I think the cache-friendliness can \
likely outweigh the downgrade from O(1) to O(lg n). But that does all hinge on \
making a small accurate comparison key.</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> </span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>RE: \
ImmutableMap</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I'm going to \
rephrase what you said to check for my own \
understanding.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Instead of \
ImmutableMap, a naïve implementation could have stored a std::map (or some other \
associative container) instead. Doing so would get the same quality of \
analysis. However, it would use a lot more memory, and the allocations would be \
slower, since they would be using the system / mallocator instead of the bump \
allocator. The ImmutableMap is taking advantage of the fact that the program \
state changes in little bits and pieces, instead of being drastically different from \
one node to the next. The ImmutableMap allows multiple ExplodedNodes to share \
most of the common program state elements, without mutating the old \
elements.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'> </span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>If that is \
accurate, then I can definitely see why an ImmutableMap is being used here. \
I'll have to think more about what can change. I'll also keep finger search \
trees in mind.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>Just to be clear, \
it's mostly about semantics. The allocations using a bump allocator is gravy, \
because we allocate so many ImmutableMap.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>A ProgramState, \
which itself is an immutable value, contains a bunch of ImmutableMaps to represent \
things like bindings of variables to values, constraints on symbols, and so on. \
You don't mutate an ImmutableMap at all; you create a new ImmutableMap from an \
existing one, and they share state for the purpose of efficiency. That memory \
sharing between ImmutableMap values is purely an (important) optimization; what is \
most important is the semantics.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>When you add a \
value to an std::map, that map changes in place. If two ProgramStates had the \
same std::map, then adding a value to that std::map would change multiple \
ProgramStates, which is not what we want at all. Indeed, a common mistake that \
people make when making changes to the analyzer or writing checkers is using mutating \
data structures (e.g., DenseMap) when they are in a context where immutable values \
are needed.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p> </o:p></p></div><div><p class=MsoNormal>We could \
conceptually use std::map in the same places we used ImmutableMap, and even have \
sharing of std::maps between ProgramState values, but we would need to do some kind \
of copy-on-write when changing the maps for a ProgramState so that it did not change \
the underlying values of another ProgramState. This is entirely doable, but \
essentially would re-implement much of the core semantics of ImmutableMap. \
ImmutableMap also can have maps of maps, all composing to preserve \
immutability, which is really important. If you try building that all on top of \
std::map directly you can have a big house of cards that is hard to get semantic \
integrity in the analyzer core.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p> </o:p></p></div><p \
class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \
[Attachment #6 (text/plain)]
_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic