[prev in list] [next in list] [prev in thread] [next in thread] 

List:       httpclient-users
Subject:    Re: HttpClient performance with multiple threads; Re: AbstractNIOConnPool memory leak?
From:       Ken Krugler <kkrugler_lists () transpac ! com>
Date:       2013-01-07 21:38:36
Message-ID: 4F414B9B-6E63-473A-AA15-0770A0DBE140 () transpac ! com
[Download RAW message or body]


Hi Oleg,

Thanks for the responses. I've filed a Bixo issue to try using the new minimal \
version of HttpClient, and also the unlimited connection manager.

I'll try to test using an existing crawl workflow that hits the top-level pages for \
60K domains, though that's not exactly the same as a large-scale crawl.

-- Ken


On Jan 7, 2013, at 2:39am, Oleg Kalnichevski wrote:

> On Sun, 2013-01-06 at 15:48 -0800, Ken Krugler wrote:
> > Hi Oleg,
> > 
> > [snip]
> > 
> > > Ken,
> > > 
> > > You might want to have a look at the lest code in SVN trunk (to be
> > > released as 4.3). Several classes such as the scheme registry that
> > > previously had to be synchronized in order to ensure thread safety have
> > > been replaced with immutable equivalents. There is also now a way to
> > > create HttpClient in a minimal configuration without authentication,
> > > state management (cookies), proxy support and other non-essential
> > > functions.
> > 
> > That sounds interesting - any hints as to how to create this minimal HttpClient?
> > 
> 
> The new API is not yet final and not properly documented. Presently this
> can be done with HttpClients#createMinimal
> 
> 
> > > These functions are not merely disabled but physically
> > > removed from the processing pipeline, which should result in somewhat
> > > better performance in high threads contention scenarios, as the only
> > > synchronization point involved in request execution would be the lock of
> > > the connection pool. Minimal HttpClient may be particularly useful for
> > > anonymous web crawling when authentication and state management are not
> > > required.
> > > 
> > > 
> > > > 3. Global lock on connection pool
> > > > 
> > > > Oleg had written:
> > > > 
> > > > > Yes, your observation is correct. The problem is that the connection
> > > > > pool is guarded by a global lock. Naturally if you have 400 threads
> > > > > trying to obtain a connection at about the same time all of them end up
> > > > > contending for one lock. The problem is that I can't think of a
> > > > > different way to ensure the max limits (per route and total) are
> > > > > guaranteed not to be exceeded. If anyone can think of a better algorithm
> > > > > please do let me know. What might be a possibility is creating a more
> > > > > lenient and less prone to lock contention issues implementation that may
> > > > > under stress occasionally allocate a few more connections than the max
> > > > > limits.
> > > > 
> > > > I don't know if this has been resolved. My work-around from a few years ago \
> > > > was to rely on having multiple Hadoop reducers running on the server (each in \
> > > > their own JVM), where I could then limit each JVM to at most 300 connections. \
> > > > 
> > > 
> > > I experimented with the idea of lock-less (unlimited) connection manager
> > > but in my tests it did not perform any better than the standard
> > > connection manager.
> > 
> > Previously I'd asked:
> > 
> > > Would it work to go for finer-grained locking, by using atomic counters to \
> > > track & enforce limits on per route/total connections?
> > 
> > Any thoughts on that approach? E.g. have a map from route to atomic counter, and \
> > a single atomic counter for total connections? 
> 
> This may be worthwhile to try. However, in theory this should not
> perform any better than the approach I took with my experiments. The
> main problem is, though, that I do not have a good test framework that
> emulates an environment a web crawler is expected to operate in (and
> have no justification for building one in my spare time). So, this kind
> of effort ideally should be led by an external contributor.
> 
> Oleg
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: httpclient-users-unsubscribe@hc.apache.org
> For additional commands, e-mail: httpclient-users-help@hc.apache.org
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic