[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: Speed up a job thats been running for 60+ hours (long)
From:       "Tamer Elsayed" <telsayed () gmail ! com>
Date:       2008-05-29 14:06:31
Message-ID: a8d399c00805290706v3225f4cax83c968a560cc89a5 () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (text/plain)]

Hi,

The number of pairs u r emmitting is totally dominated by the most frequent
users (i.e, users who have the longest list of groups). If u can accept
approximate results, then I'd suggest that u drop the *top* 1% (or even
0.1%) of the users based on their frequencies. In a very similar study at
University of Maryland, with about a million documents (which correspond to
groups in your problem), we managed to get a linear time approximation of
this problem of quadratic complexity. Here is the link to the study:

http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf

On a search-related application, we found experimentally that this trick
(specifficaly, dropping 0.1%) results in a drop of just 2% in the
effectiveness. Of course, this might be different for different
applications.

Tamer

On 5/26/08, jkupferman <jkupferman@umail.ucsb.edu> wrote:
> 
> 
> Hi Yuri,
> So each user actually outputs (n^2-n)/2 records where n is the number of
> groups it is a member of. If the groups were arranged in an array from
> 0...n-1 then group x will output for all values between x+1...n-1.
> 
> But yes, it does output a LOT of records. This is why I used the combiner
> which has shown to decrease the number of output records about 10x, and
> based on my understanding the combiner is run locally so only the combined
> records actually make it to the sort.
> 
> I took a look at the implementation and the output is buffered so hopefully
> that helps since if it were written directly to disk on every output its
> understandable why it would be slow. I have the io.file.buffer.size set to
> 4096, since I am outputting so much, should I increase this size quite a
> bit? How big should I be looking to make this?
> 
> Thanks for the help
> 
> 
> 
> 
> Yuri Kudryavcev-2 wrote:
> > 
> > Hi.
> > 
> > I really would like some input on this case, since I'm trying to scale up
> > a
> > similar algorithm.
> > 
> > I can be totally wrong, please correct )
> > So you're emitting C^2_n group pairs from every user record by going for
> > group pairs?
> > For a n = 100 groups for an average user -- that's an 4950 output records
> > for every user. Do you see similar numbers in logs?
> > I think increasing the intermediate bunch of records in this proportion
> > degrades performance.
> > 
> > - Yuri.
> > 
> > On 5/26/08, jkupferman <jkupferman@umail.ucsb.edu> wrote:
> > > 
> > > 
> > > Hi everyone,
> > > I am using hadoop (17) to try and do some large scale user comparisons
> > > and
> > > although the programs are all written, its taking incredibly long to run
> > > and
> > > it seems like it should be going faster. I would really like some
> insight
> > > as
> > > to what I could do to speed this up aside from just "add more
> computers".
> > > I
> > > would really appreciate some help from all of the sagacious hadoop
> > > core-users.
> > > 
> > > The basic idea is that there are a bunch of users, each of which is some
> > > groups. I would like to know how many users each combination of groups
> > > has
> > > in common. I laid out the data using sequence files which seems to be
> > > working well and quickly, each sequence file entry has a text user name
> > > and
> > > a map writable which contains all of the groups they are in. The map
> > > function takes in each user and the outputs all of the combinations of
> > > the
> > > groups for which it is a part of and a 1 which is the instance
> > > counter(like
> > > in wordcount). So user x which is a member of groups 1,2,3,4 will output
> > > 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
> > > made
> > > a collector which reduces the number of records about 10x. Reducer is
> > > really
> > > simple just sums up the total for each combination and then outputs it
> to
> > > a
> > > file. Just as an aside, I make sure to use intwritables just about
> > > everywhere which I hoped would help since there are inevitable tons of
> > > comparisons going on.
> > > 
> > > This is being done on about 4gb of user data on an 20 Large instance
> > > cluster
> > > on Amazons EC2. With that much data, there are about 240 map tasks and I
> > > have it set to run 10 map tasks per task tracker. With those settings,
> > > the
> > > slaves are running about 100% CPU and memory is just about capacity but
> > > is
> > > almost no paging. Although the tasks seem to be progressing, some of the
> > > tasks that have just completed have run for 30+ hours. Some of the tasks
> > > have failed with a "Lost task tracker:" which I intend on fixing with
> > > HADOOP-3403_0_20080516.patch, whenever this job finishes.
> > > 
> > > It seemed to me that the problem might have been calling the collector
> so
> > > many times since users can be in 1000's of groups and it does about n^2
> > > comparisons. I tried another version which outputs only n times by
> having
> > > each entry output a map, but this did not prove much better on the test
> > > trials I ran, and the extra work in the reducer is really killer.
> > > 
> > > It is not clear to me what is dragging down this job, or what I can do
> to
> > > increase the rate at which it is computing. Although there is quite a
> bit
> > > of
> > > data, it doesnt seem like it should be taking this long on 20 nodes. Any
> > > help/questions/comments would be greatly appreciated. Thanks for all of
> > > your
> > > help.
> > > 
> > > 
> > > --
> > > View this message in context:
> > > 
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
> 
> > > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > > 
> > > 
> > 
> > 
> 
> --
> View this message in context:
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17474577.html
>  Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 


-- 
Proud to be a follower of the "Best of Mankind"
"æóÇÐúßõÑú ÑóÈøóßó ÅöÐóÇ äóÓöíÊó æóÞõáú ÚóÓóì Ãóäú íóåúÏöíóäöí ÑóÈøöí
áÃÞúÑóÈó ãöäú åóÐóÇ ÑóÔóÏðÇ"



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic