[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cfe-dev
Subject:    Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis
From:       Ben Craig via cfe-dev <cfe-dev () lists ! llvm ! org>
Date:       2015-08-31 17:29:25
Message-ID: 178901d0e412$95c563f0$c1502bd0$ () codeaurora ! org
[Download RAW message or body]

This is a multipart message in MIME format.

[Attachment #2 (multipart/alternative)]
This is a multipart message in MIME format.


Progress so far…

With the FoldingSet changes that I have prototyped on my machine, I've reduced a \
6m57.730s analysis to 6m20.126s.  This is without caching a hash value, but it does \
do an up-front bucket reservation that I will discuss below.

 

My responses inline.  

 

I'll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe \
that is not so bad.  I'm totally fine with ditching FoldingSetNodeID; that was just a \
way to get some hashing based on a mechanism already in place in LLVM.

 

I've got an alternative pitch.  There are two times when the hash of a given node is \
needed:

1. When it is added to the FoldingSet

2. When the FoldingSet grows and needs to rebucket items.

Saving the hash made case 2 significantly faster.  I could instead try to avoid \
growing the FoldingSet entirely by guessing that the eventual number of buckets will \
roughly the number of work items that will be run.  This approach will make simple \
analyze runs take up more memory, but it will make large analyze runs take up less, \
since we won't need to allocate, then free the too-small buckets.

 

It may be more profitable to put the cached hash in ProgramState.  Conceptually, \
ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and \
ProgramState should be independently hashable.  ProgramPoints should be quickly \
hashable on their own, but ProgramStates are very expensive to both hash and \
semantically compare.  That's why we do all sorts of uniquing with the ImmutableMaps \
so that we can often use referential identity in some places to provide semantic \
identity, even thought he referential identity of the underlying ImmutableMap objects \
doesn't actually matter.

 

Currently, the address of the ProgramState is going into the hash.  The contents of \
the ProgramState aren't inspected at all for hashing or equality purposes.  I think \
this is a good way to have it too.  The performance trouble I'm seeing mostly come \
from cache misses, and cache misses are normally a result of chasing pointers.  I \
don't want to put a hash inside of ProgramState, because that's just one more pointer \
to chase. 

 

I've considered trying to use a b-tree instead of a hash map.

 

Just for clarification, which hash map are you talking about?

 

I was referring to the Nodes hash map inside of ExplodedGraph.  I have my doubts that \
I'll be able to squeeze a comparison key and a pointer into 16-20 bytes though.  I \
would need to get the key-pointer pair down to that size in order to get a 3 element \
b-tree node into a 64-byte cache line.

 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project

 

From: Ted Kremenek [mailto:kremenek@apple.com] 
Sent: Friday, August 28, 2015 1:21 AM
To: Ben Craig
Cc: cfe-dev@lists.llvm.org
Subject: Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis

 

Hi Ben,

 

Great points, and I've provided responses inline.

 

On Aug 27, 2015, at 1:31 PM, Ben Craig <ben.craig@codeaurora.org \
<mailto:ben.craig@codeaurora.org> > wrote:

 

Thanks Ted.  Great information.  I was hoping you would chime in :)

 

RE: FoldingSet

Some of the reasons that I'm asking about de-duplication is that I've already \
attempted to change things in some reckless experiments.   My quick-and-dirty \
attempts at removing the ExplodedGraph cache "worked" (for some value of working) and \
was faster, but the analysis quality was lowered enough that check-clang failed.  I \
made a few feeble attempts at being selective with de-duplication, but those haven't \
worked out either.

 

I have had some success with forking + modifying llvm::FoldingSet.  Instead of \
hashing in terms of a FoldingSetID, I just have a trait with a Hash and an Equals \
method.  This gives me a lot more flexibility in calculating the hash more quickly.  \
I even went as far as caching the hash on the ExplodedNode itself.  This is looking \
like it will result in a ~9% speedup, at the cost of an extra 4 bytes per \
ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet \
class.

 

I'll be honest that an extra 4 bytes per ExplodedNode does make me cringe, but maybe \
that is not so bad.  I'm totally fine with ditching FoldingSetNodeID; that was just a \
way to get some hashing based on a mechanism already in place in LLVM.

 

It may be more profitable to put the cached hash in ProgramState.  Conceptually, \
ExplodedNode is just a <ProgramState, ProgramPoint> pair, and both ProgramPoint and \
ProgramState should be independently hashable.  ProgramPoints should be quickly \
hashable on their own, but ProgramStates are very expensive to both hash and \
semantically compare.  That's why we do all sorts of uniquing with the ImmutableMaps \
so that we can often use referential identity in some places to provide semantic \
identity, even thought he referential identity of the underlying ImmutableMap objects \
doesn't actually matter.





 

I've considered trying to use a b-tree instead of a hash map.

 

Just for clarification, which hash map are you talking about?





  If (and that's a big if) I can find / construct a small comparison key that is \
accurate and encourages good access locality, then I think it may end up being a more \
cache friendly structure than the hash map.  I think that the insertions and lookups \
happen ‘near' each other in terms of the AST, so I think the cache-friendliness can \
likely outweigh the downgrade from O(1) to O(lg n).  But that does all hinge on \
making a small accurate comparison key.

 

RE: ImmutableMap

I'm going to rephrase what you said to check for my own understanding.

Instead of ImmutableMap, a naïve implementation could have stored a std::map (or \
some other associative container) instead.  Doing so would get the same quality of \
analysis.  However, it would use a lot more memory, and the allocations would be \
slower, since they would be using the system / mallocator instead of the bump \
allocator.  The ImmutableMap is taking advantage of the fact that the program state \
changes in little bits and pieces, instead of being drastically different from one \
node to the next.  The ImmutableMap allows multiple ExplodedNodes to share most of \
the common program state elements, without mutating the old elements.

 

If that is accurate, then I can definitely see why an ImmutableMap is being used \
here.  I'll have to think more about what can change.  I'll also keep finger search \
trees in mind.

 

Just to be clear, it's mostly about semantics.  The allocations using a bump \
allocator is gravy, because we allocate so many ImmutableMap.

 

A ProgramState, which itself is an immutable value, contains a bunch of ImmutableMaps \
to represent things like bindings of variables to values, constraints on symbols, and \
so on.  You don't mutate an ImmutableMap at all; you create a new ImmutableMap from \
an existing one, and they share state for the purpose of efficiency.  That memory \
sharing between ImmutableMap values is purely an (important) optimization; what is \
most important is the semantics.

 

When you add a value to an std::map, that map changes in place.  If two ProgramStates \
had the same std::map, then adding a value to that std::map would change multiple \
ProgramStates, which is not what we want at all.  Indeed, a common mistake that \
people make when making changes to the analyzer or writing checkers is using mutating \
data structures (e.g., DenseMap) when they are in a context where immutable values \
are needed.

 

We could conceptually use std::map in the same places we used ImmutableMap, and even \
have sharing of std::maps between ProgramState values, but we would need to do some \
kind of copy-on-write when changing the maps for a ProgramState so that it did not \
change the underlying values of another ProgramState.  This is entirely doable, but \
essentially would re-implement much of the core semantics of ImmutableMap.  \
ImmutableMap also can have maps of maps, all composing to preserve immutability, \
which is really important.  If you try building that all on top of std::map directly \
you can have a big house of cards that is hard to get semantic integrity in the \
analyzer core.

 





 

RE: General scalability

For now, I am just trying to make the existing 150,000 iterations provide the same \
quality in less time.  Longer term, it may make sense for me to look at things like \
the summary based algorithm.

 

That is totally fine, and a great place to start.  Even a constant factor improvement \
will make a big impact in practice.





  One other hand-wavey approach that I've considered is to save off a file with an \
intermediate analysis state, so that future analysis runs can reuse the parts of that \
analysis that haven't changed.  Both of those are going to need to wait until I've \
gotten more familiar with the code base.

 

I think this is all great, but please when making changes keep into account the \
memory overhead.  I put a fair amount of tweaking into the current uses of \
ImmutableMap, including how aggressively we reclaim ExplodedNodes, based on tuning \
for memory usage.  Changes in core representation of these data structures, or even \
minor algorithmic tweaks, can have a huge impact on memory.  I even changed some of \
the balancing heuristics for the AVL tree so that the trees for the ImmutableMaps \
would be slightly less balanced at the benefit of using up less memory in practice.

 

To put it into context, suppose we made a change that caused the analyzer to use 10MB \
more memory on a complicated translation unit (this isn't a totally unrealistic \
number for very complicated files).  If we had 8 cores each doing static analysis, \
that's 80MB additional memory being used for static analysis on a machine.  As 10MB \
goes to 20MB, 20MB goes to 30MB, you can quickly see how much that additional memory \
overhead really matters.  And the problem tends to be more pronounced on machines \
with more cores, with don't always have the same amount of memory per core as \
machines with fewer cores.

 

Cheers,

Ted

 

 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project

 

From: Ted Kremenek [mailto:kremenek@apple.com] 
Sent: Wednesday, August 26, 2015 9:54 PM
To: Ben Craig
Cc: cfe-dev@lists.llvm.org <mailto:cfe-dev@lists.llvm.org> 
Subject: Re: [cfe-dev] [Analyzer] Attempting to speed up static analysis

 

Hi Ben,

 

Thank you for taking a crack at looking at how to speed up the analyzer.  The last \
time this was done was years ago, and I'm glad to see some fresh eyes taking a look \
at it.

 

I think to answer your questions I think it is worth stepping back and understanding \
a bit better how the pieces of the analyzer work together, and how they accomplish \
things as a whole.  While the profile data is illuminating, I don't think it provides \
a total picture of what is going on.

 

To put things in perspective, the static analyzer employs a path-sensitive dataflow \
analysis, or analogously, a symbolic execution.  In the worst case this is a super \
exponential algorithm.  Analysis is essentially driven by a worklist algorithm with a \
finite number of items that can be removed from the worklist, which is essentially \
bounds the amount of work.  That still means the analyzer can take a long time, but \
it is guaranteed to terminate.

 

Because the algorithm is super exponential, in practice the real performance wins are \
usually algorithmic.  Essentially, they can fall into two categories:

 

1. Don't repeat the same work.  The main example of this in the analyzer right now is \
to perform memoization so that the same path is not analyzed more than once.  The \
FoldingSet code that you see show up in the profile is crucial to this memoization.  \
It's not surprising that it shows up a lot, but keep in mind that it is being used to \
do intelligent caching so that we don't get super exponential complexity in practice.

 

2. Do the same work more intelligently in a way that, algorithmically, takes less \
time.  One example of this is that currently the analyzer unrolls loops 3 times, but \
there are potential ways to analyze some loops without unrolling.  You thus can get \
the same analysis quality with less work, and increase scalability.

 

Speeding up the raw data structures in isolation can sometimes be a win, but usually \
this results in a constant factor win at best.  That really doesn't chip away at the \
scalability problem.  From your email, it sounds like you're initially interested in \
reducing the current analysis time with keeping the analyzer doing the same amount of \
work.  That's great, but also keep in mind there is a bunch of analysis that is not \
done right now because the analyzer doesn't scale.  Finding ways to truly make the \
analyzer scale to larger problems requires much deeper changes to the analyzer, but \
will likely result in far more profound performance wins in the end.

 

Another crucial thing to take into account when profiling the analyzer is memory \
usage.  Some of the current tradeoffs were reached because keeping the memory usage \
down is a critical part of making the analyzer perform.  There's an immense amount of \
parallelism in the analyzer's analysis; the most basic one that is used now is that \
multiple translation units can be analyzed in parallel, which means N number of clang \
processes doing static analysis (where N is the number of cores the machine can \
saturate).  If the memory usage is too high, the machine will start to thrash.  More \
intelligent scheduling of analyzer processes is also another way to increase the \
analyzer's scalability in practice, and this is not a task anyone has taken on.  \
There's also an immense amount of parallelism to be exploited when analyzing a single \
translation unit.  That can be done in two ways: parallelize the current worklist \
algorithm, which will be tricky (but not impossible) since many of the data \
structures used are not thread-safe, or move more towards a summary-based algorithm \
approach the a few of us having been chatting about for years (which would be the \
basis of doing whole program, interprocedural analysis).  Both are non-trivial \
efforts, but are likely to achieve HUGE wins in performance and scalability because \
they algorithmically scale up the performance of the analyzer.

 

I'll respond to your other points inline, since they have important context.

 

 

On Aug 26, 2015, at 2:34 PM, Ben Craig via cfe-dev < <mailto:cfe-dev@lists.llvm.org> \
cfe-dev@lists.llvm.org> wrote:

 

I am trying to speed up the static analyzer.  I also haven't really worked on LLVM or \
Clang much in the past, so some of this may be old news, or naïve, or just plain \
wrong.  Regardless, here are some of the things I've discovered while profiling and \
experimenting in the code base:

 

1.       Getting consistent static analysis times is difficult.  This has been making \
it difficult for me to quantify the change in performance as I make changes.  If \
anyone has suggestions, or even a standard benchmark to use, then it would help me \
out a lot.

2.       For my toy test case, roughly 10% of the CPU's time is spent in the checkers \
(inclusive time / time including children).  This seems low.

3.       Roughly 20% (inclusive) is spent in the Immutable family of data structures.

4.       Roughly 25% (inclusive) is spent in the llvm::FoldingSet code.  Much of this \
is memory / cache bound.

 

Neither 3 or 4 is surprising.  This is basically the meat-and-potatoes of the \
analyzer engine.  Making these faster would obviously speed up the analyzer.






5.       In my toy test case, ExplodedGraph::getNode returns a new node ~99% of the \
time.  The majority of the old nodes found are purge nodes, but there are a few \
non-purge nodes that get returned.

 

With all this in mind, it seems reasonable for me to focus on the data structures \
behind the static analyzer, rather than on the checkers.

 

To be honest, I'm not convinced this is the best investment of your time.  I don't \
want to discourage you, but aren't going to get the huge wins you allude to in the \
following sentence:






  I suspect that there are some algorithmic changes hiding that could improve the \
speed by orders of magnitude, if I only understood the underlying code and their \
assumptions better.  

 

That said, speeding up the core data structures could be instructive… but you must \
do it with GREAT care.  The data structures chosen have specific semantic properties \
which make them viable.  I'll comment on them more below.  But I do think there are \
wins to be made here, but a key requirement is that the memory usage of the analyzer \
doesn't greatly increase.  In fact, I'd be willing to slow the analyzer down slightly \
if we could save more memory.  Using less memory makes the analyzer more viable on \
more complicated projects where we tend to hit the maximum analysis time.

 

 

For example, with ExplodedGraph::Node, I think I am averaging 3 cache misses per \
ExplodedGraph::getNode call in order to support deduplication, even though less than \
1% of lookups are going to be deduplicated.  Does anyone on this list know if there \
are some reasonable criteria to know in advance if a node needs to be stored for \
later deduplication?  If we know that a node won't / doesn't need to be reduplicated, \
then we could insert the node blindly into the hash map without looking at all the \
colliding nodes.

 

Yes, there are a bunch of heuristics you can employ, but I think you'll need to do a \
lot more profiling to be sure.

 

Essentially the de-duplication happens when paths can possibly merge together.  I can \
see this in three places:

 

(1) At confluence points in the CFG, where multiple branches merge together at one \
basic block.  At that point, multiple paths may have the same ProgramState, and thus \
one of them can "cache out".  Profiling would show this in practice.

 

(2) After nodes generated by a checker.  Checkers can generate new ProgramStates that \
after reaching some critical point in the analysis have information removed or \
otherwise reach an equilibrium, and thus ExplodedNodes from different paths may \
suddenly have the same ProgramStates (and thus cache out).  This is only a theory, \
however, in how much of an impact this really has in practice.  Profiling would show \
this in practice.

 

(3) After a ProgramState has been cleaned out using RemoveDeadBindings.  The \
ProgramState returned by RemoveDeadBindings may have a bunch of data cleansed from it \
that made it more unique from other states.  At that point, the path is more likely \
to cache out.  Again a theory, and one that can be verified by more profiling.

 

When profiling, you'll also want to analyze a variety of codebases, as the \
characteristics could be very different.






 

The other big area that I'm confused by is the use and purpose of the ImmutableMap \
typedef Environment::BindingsTy, which maps EnvironmentEntry keys to SVal values.  5% \
of my time is getting spent in this instantiations ImmutableMap::Factory::add, \
largely because of a call to getCanonicalTree.  Why is ImmutableMap being used \
instead of alternative associative types?  Of particular note is that even iteration \
is expensive because ImmutableMap nodes can't know their parent.  There are some \
other ImmutableMaps being used, but I'm hoping that explanations of the \
EnvironmentEntry->SVal instantiation will help me understand the other \
instantiations.

 

Stepping back, ProgramStates are immutable values.  The symbolic simulation works by \
keeping all ProgramStates along a path, and generated a reachable state graph (the \
ExplodedGraph).  Since ProgramStates are likely to differ only slightly from each \
other, we use a functional data structure, ImmutableMap, that represents a mapping \
using a functional AVL tree.  When you add a value to an ImmutableMap, you really \
just create a new ImmutableMap and the original stays intact.   The trick is that \
ImmutableMaps *share* large amounts of their map representation (subtrees of the AVL \
tree) between each other, which represents a significant memory savings.  \
Alternatively, we could use a reference-counted std::map or something, and when we \
needed to add a value to the map to create a new ProgramState would could copy the \
entire map.  That would likely be very wasteful in practice.  Also, ImmutableMap uses \
a BumpAllocator, which is very fast for allocations.  This is the reason we use \
ImmutableMap.

 

But… your profiling data hits on something I saw before.  getCanonicalTree() gets \
called when we unique the maps.  This uniquing is critical for de-duplication of \
ProgramStates, but we don't always need to de-duplicate ProgramStates unless we think \
there is a strong possibility we will want to merge two paths together.  That said, \
recall I mentioned that saving memory was also important.  We also want to \
de-duplicate ImmutableMaps, since they are used everywhere, for memory savings.  \
Again, I think profiling data could be the key here, and we can be in the position \
where we de-duplicate some ImmutableMaps we use less often than others based \
depending on what the profiling data says.  But the profiling data must include \
memory usage as a factor; otherwise it would be easy to optimize these CPU numbers \
but tank memory usage.

 

Another thing to explore is that for smaller maps maybe we don't want an ImmutableMap \
at all, but just something like an ImmutableList to record a bunch of key value \
pairs.  Once we de-duplicate states, we could canonicalize around use ImmutableMaps \
when creating a canonicalized ProgramState, or we can switch to ImmutableMap once the \
list gets too big.  For example, Environment uses an ImmutableMap, but there is an \
access pattern to values in an Environment.  Typically we just want the \
subexpressions of the current expression.  If we are working with a binary operator \
or anything with multiple subexpressions an ImmutableMap might be good, but if there \
is only one subexpression sometimes an ImmutableMap is overkill.  Or perhaps we can \
create a data structure that is a mixture of both ImmutableLists and ImmutableMaps, \
chaining them together. Why would we do this?  It's less work to manage a small list \
and an immutable map, but it would be an increase in implementation complexity (that \
could possibly be hidden by the right abstractions).

 

 

Question summary:

1.       Is there a good, existing benchmark for static analysis that I should be \
using?

 

In the past, I've often looked at:

 

- sqlite3.c

- postgresql

 

These were good stable C baselines.  I've also analyzed the Python codebase in the \
past, which also exhibited some of the more algorithmic issues in the analyzer \
(analyzing the interpreter code in Python has to deal with a bunch of branches and \
switch statements).  Sqlite3 is nice since it is one file.

 

LLVM/Clang is also a good codebase for C++.  I think we should also pick some other \
C++ codebases.

 






2.       Is there some reasonable criteria to know in advance if an ExplodedNode \
needs to be stored for later reduplication?

 

See my comment above.






3.       Why is ImmutableMap<EnvironmentEntry, SVal> being used instead of std::map, \
std::hash_map, or even llvmFoldingSet?

 

Hopefully I've answered this question.

 

Cheers,

Ted






 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation \
Collaborative Project

 

_______________________________________________
cfe-dev mailing list
 <mailto:cfe-dev@lists.llvm.org> cfe-dev@lists.llvm.org
 <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman \
_listinfo_cfe-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=UVc407_CCx3FapxjS2xZ9jo4Q91upS \
GpJHRF8fPPYVY&m=rDsVjZ5h8hLZXMQlZz7pvL5iklwjHNwyE9dfwASlyQQ&s=evEZF87HnXhB6gk0zZ-ejVmUiqWR-Qe5GWiFYtMqUBk&e=> \
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_l \
istinfo_cfe-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=UVc407_CCx3FapxjS2xZ9jo4Q91upSGp \
JHRF8fPPYVY&m=rDsVjZ5h8hLZXMQlZz7pvL5iklwjHNwyE9dfwASlyQQ&s=evEZF87HnXhB6gk0zZ-ejVmUiqWR-Qe5GWiFYtMqUBk&e=


 


[Attachment #5 (text/html)]

<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:x="urn:schemas-microsoft-com:office:excel" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type \
content="text/html; charset=utf-8"><meta name=Generator content="Microsoft Word 15 \
(filtered medium)"><style><!-- /* Font Definitions */
@font-face
	{font-family:Helvetica;
	panose-1:2 11 6 4 2 2 2 2 2 4;}
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:12.0pt;
	font-family:"Times New Roman",serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.apple-converted-space
	{mso-style-name:apple-converted-space;}
span.EmailStyle18
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:#1F497D;}
span.EmailStyle19
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=EN-US link=blue vlink=purple><div \
class=WordSection1><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Progress so \
far…<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>With the \
FoldingSet changes that I have prototyped on my machine, I've reduced a 6m57.730s \
analysis to 6m20.126s.   This is without caching a hash value, but it does do an \
up-front bucket reservation that I will discuss below.<o:p></o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>My responses \
inline.   <o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><p \
class=MsoNormal style='margin-left:.5in'>I'll be honest that an extra 4 bytes per \
ExplodedNode does make me cringe, but maybe that is not so bad. &nbsp;I'm totally \
fine with ditching FoldingSetNodeID; that was just a way to get some hashing based on \
a mechanism already in place in LLVM.<o:p></o:p></p><p class=MsoNormal \
style='margin-left:.5in'><o:p>&nbsp;</o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've got an \
alternative pitch.   There are two times when the hash of a given node is \
needed:<o:p></o:p></span></p><p class=MsoNormal style='text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>1. When it is \
added to the FoldingSet<o:p></o:p></span></p><p class=MsoNormal \
style='text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>2. When the \
FoldingSet grows and needs to rebucket items.<o:p></o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Saving the \
hash made case 2 significantly faster.   I could instead try to avoid growing the \
FoldingSet entirely by guessing that the eventual number of buckets will roughly the \
number of work items that will be run.   This approach will make simple analyze runs \
take up more memory, but it will make large analyze runs take up less, since we won't \
need to allocate, then free the too-small buckets.</span><o:p></o:p></p><p \
class=MsoNormal style='margin-left:.5in'><o:p>&nbsp;</o:p></p><p class=MsoNormal \
style='margin-left:.5in'>It may be more profitable to put the cached hash in \
ProgramState. &nbsp;Conceptually, ExplodedNode is just a &lt;ProgramState, \
ProgramPoint&gt; pair, and both ProgramPoint and ProgramState should be independently \
hashable. &nbsp;ProgramPoints should be quickly hashable on their own, but \
ProgramStates are very expensive to both hash and semantically compare. &nbsp;That's \
why we do all sorts of uniquing with the ImmutableMaps so that we can often use \
referential identity in some places to provide semantic identity, even thought he \
referential identity of the underlying ImmutableMap objects doesn't actually \
matter.<o:p></o:p></p><p class=MsoNormal><o:p>&nbsp;</o:p></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Currently, \
the address of the ProgramState is going into the hash.   The contents of the \
ProgramState aren't inspected at all for hashing or equality purposes.   I think this \
is a good way to have it too.   The performance trouble I'm seeing mostly come from \
cache misses, and cache misses are normally a result of chasing pointers.   I don't \
want to put a hash inside of ProgramState, because that's just one more pointer to \
chase.</span> <o:p></o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><p \
class=MsoNormal style='margin-left:.5in;text-indent:.5in'><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've \
considered trying to use a b-tree instead of a hash map.</span><o:p></o:p></p><p \
class=MsoNormal style='margin-left:.5in'><o:p>&nbsp;</o:p></p><p class=MsoNormal \
style='margin-left:.5in'>Just for clarification, which hash map are you talking \
about?<o:p></o:p></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I was \
referring to the Nodes hash map inside of ExplodedGraph.   I have my doubts that I'll \
be able to squeeze a comparison key and a pointer into 16-20 bytes though.   I would \
need to get the key-pointer pair down to that size in order to get a 3 element b-tree \
node into a 64-byte cache line.<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><p \
class=MsoNormal><span style='font-size:10.0pt;font-family:"Courier \
New";color:#1F497D'>Employee of Qualcomm Innovation Center, \
Inc.<o:p></o:p></span></p><p class=MsoNormal><span \
style='font-size:10.0pt;font-family:"Courier New";color:#1F497D'>Qualcomm Innovation \
Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative \
Project<o:p></o:p></span></p></div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'><o:p>&nbsp;</o:p></span></p><div><div \
style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in'><p \
class=MsoNormal><b><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif'>From:</span></b><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif'> Ted Kremenek \
[mailto:kremenek@apple.com] <br><b>Sent:</b> Friday, August 28, 2015 1:21 \
AM<br><b>To:</b> Ben Craig<br><b>Cc:</b> cfe-dev@lists.llvm.org<br><b>Subject:</b> \
Re: [cfe-dev] [Analyzer] Attempting to speed up static \
analysis<o:p></o:p></span></p></div></div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p><div><p class=MsoNormal>Hi \
Ben,<o:p></o:p></p></div><div><p class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p \
class=MsoNormal>Great points, and I've provided responses \
inline.<o:p></o:p></p></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><div><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><p class=MsoNormal>On Aug 27, 2015, \
at 1:31 PM, Ben Craig &lt;<a \
href="mailto:ben.craig@codeaurora.org">ben.craig@codeaurora.org</a>&gt; \
wrote:<o:p></o:p></p></div><p class=MsoNormal><o:p>&nbsp;</o:p></p><div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Thanks \
Ted.&nbsp; Great information.&nbsp; I was hoping you would chime in<span \
class=apple-converted-space>&nbsp;</span></span><span \
style='font-size:11.0pt;font-family:Wingdings;color:#1F497D'>J</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp;</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>RE: \
FoldingSet</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Some of the \
reasons that I'm asking about de-duplication is that I've already attempted to change \
things in some reckless experiments.&nbsp; &nbsp;My quick-and-dirty attempts at \
removing the ExplodedGraph cache "worked" (for some value of working) and was faster, \
but the analysis quality was lowered enough that check-clang failed.&nbsp; I made a \
few feeble attempts at being selective with de-duplication, but those haven't worked \
out either.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp;</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I have had \
some success with forking + modifying llvm::FoldingSet.&nbsp; Instead of hashing in \
terms of a FoldingSetID, I just have a trait with a Hash and an Equals method.&nbsp; \
This gives me a lot more flexibility in calculating the hash more quickly.&nbsp; I \
even went as far as caching the hash on the ExplodedNode itself.&nbsp; This is \
looking like it will result in a ~9% speedup, at the cost of an extra 4 bytes per \
ExplodedNode, and the (possibly temporary) maintenance burden of an extra FoldingSet \
class.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>I'll be honest \
that an extra 4 bytes per ExplodedNode does make me cringe, but maybe that is not so \
bad. &nbsp;I'm totally fine with ditching FoldingSetNodeID; that was just a way to \
get some hashing based on a mechanism already in place in \
LLVM.<o:p></o:p></p></div><div><p class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p \
class=MsoNormal>It may be more profitable to put the cached hash in ProgramState. \
&nbsp;Conceptually, ExplodedNode is just a &lt;ProgramState, ProgramPoint&gt; pair, \
and both ProgramPoint and ProgramState should be independently hashable. \
&nbsp;ProgramPoints should be quickly hashable on their own, but ProgramStates are \
very expensive to both hash and semantically compare. &nbsp;That's why we do all \
sorts of uniquing with the ImmutableMaps so that we can often use referential \
identity in some places to provide semantic identity, even thought he referential \
identity of the underlying ImmutableMap objects doesn't actually \
matter.<o:p></o:p></p></div><p class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp;</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I've \
considered trying to use a b-tree instead of a hash \
map.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>Just for \
clarification, which hash map are you talking about?<o:p></o:p></p></div><p \
class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp; If \
(and that's a big if) I can find / construct a small comparison key that is accurate \
and encourages good access locality, then I think it may end up being a more cache \
friendly structure than the hash map.&nbsp; I think that the insertions and lookups \
happen ‘near' each other in terms of the AST, so I think the cache-friendliness can \
likely outweigh the downgrade from O(1) to O(lg n).&nbsp; But that does all hinge on \
making a small accurate comparison key.</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp;</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>RE: \
ImmutableMap</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>I'm going to \
rephrase what you said to check for my own \
understanding.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>Instead of \
ImmutableMap, a naïve implementation could have stored a std::map (or some other \
associative container) instead.&nbsp; Doing so would get the same quality of \
analysis.&nbsp; However, it would use a lot more memory, and the allocations would be \
slower, since they would be using the system / mallocator instead of the bump \
allocator.&nbsp; The ImmutableMap is taking advantage of the fact that the program \
state changes in little bits and pieces, instead of being drastically different from \
one node to the next.&nbsp; The ImmutableMap allows multiple ExplodedNodes to share \
most of the common program state elements, without mutating the old \
elements.</span><o:p></o:p></p></div><div><p class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>&nbsp;</span><o:p></o:p></p></div><div><p \
class=MsoNormal><span \
style='font-size:11.0pt;font-family:"Calibri",sans-serif;color:#1F497D'>If that is \
accurate, then I can definitely see why an ImmutableMap is being used here.&nbsp; \
I'll have to think more about what can change.&nbsp; I'll also keep finger search \
trees in mind.</span><o:p></o:p></p></div></div></blockquote><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>Just to be clear, \
it's mostly about semantics. &nbsp;The allocations using a bump allocator is gravy, \
because we allocate so many ImmutableMap.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>A ProgramState, \
which itself is an immutable value, contains a bunch of ImmutableMaps to represent \
things like bindings of variables to values, constraints on symbols, and so on. \
&nbsp;You don't mutate an ImmutableMap at all; you create a new ImmutableMap from an \
existing one, and they share state for the purpose of efficiency. &nbsp;That memory \
sharing between ImmutableMap values is purely an (important) optimization; what is \
most important is the semantics.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>When you add a \
value to an std::map, that map changes in place. &nbsp;If two ProgramStates had the \
same std::map, then adding a value to that std::map would change multiple \
ProgramStates, which is not what we want at all. &nbsp;Indeed, a common mistake that \
people make when making changes to the analyzer or writing checkers is using mutating \
data structures (e.g., DenseMap) when they are in a context where immutable values \
are needed.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><div><p class=MsoNormal>We could \
conceptually use std::map in the same places we used ImmutableMap, and even have \
sharing of std::maps between ProgramState values, but we would need to do some kind \
of copy-on-write when changing the maps for a ProgramState so that it did not change \
the underlying values of another ProgramState. &nbsp;This is entirely doable, but \
essentially would re-implement much of the core semantics of ImmutableMap. \
&nbsp;ImmutableMap also can have maps of maps, all composing to preserve \
immutability, which is really important. &nbsp;If you try building that all on top of \
std::map directly you can have a big house of cards that is hard to get semantic \
integrity in the analyzer core.<o:p></o:p></p></div><div><p \
class=MsoNormal><o:p>&nbsp;</o:p></p></div><p \
class=MsoNormal><br><br><o:p></o:p></p><blockquote \
style='margin-top:5.0pt;margin-bottom:5.0pt'><div><div><p class=MsoNormal><span \


[Attachment #6 (text/plain)]

_______________________________________________
cfe-dev mailing list
cfe-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic