[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Re: How to select random n records using mapreduce ?
From:       Matt Pouttu-Clarke <Matt.Pouttu-Clarke () icrossing ! com>
Date:       2011-06-27 20:01:28
Message-ID: CA2E2FA8.6427%Matt.Pouttu-Clarke () icrossing ! com
[Download RAW message or body]

If the incoming data is unique you can create a hash of the data and then do
a modulus of the hash to select a random set.  So if you wanted 10% of the
data randomly:

hash % 10 == 0

Gives a random 10%


On 6/27/11 12:54 PM, "Habermaas, William" <William.Habermaas@fatwire.com>
wrote:

> I did something similar.  Basically I had a random sampling algorithm that I
> called from the mapper. If it returned true I would collect the data,
> otherwise I would discard it.
> 
> Bill 
> 
> -----Original Message-----
> From: niels@basj.es [mailto:niels@basj.es] On Behalf Of Niels Basjes
> Sent: Monday, June 27, 2011 3:29 PM
> To: mapreduce-user@hadoop.apache.org
> Cc: core-user@hadoop.apache.org
> Subject: Re: How to select random n records using mapreduce ?
> 
> The only solution I can think of is by creating a counter in Hadoop
> that is incremented each time a mapper lets a record through.
> As soon as the value reaches a preselected value the mappers simply
> discard the additional input they receive.
> 
> Note that this will not at all be random.... yet it's the best I can
> come up with right now.
> 
> HTH
> 
> On Mon, Jun 27, 2011 at 09:11, Jeff Zhang <zjffdu@gmail.com> wrote:
> > 
> > Hi all,
> > I'd like to select random N records from a large amount of data using
> > hadoop, just wonder how can I archive this ? Currently my idea is that let
> > each mapper task select N / mapper_number records. Does anyone has such
> > experience ?
> > 
> > --
> > Best Regards
> > 
> > Jeff Zhang
> > 
> 
> 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may contain \
confidential and privileged information of iCrossing. Any unauthorized review, use, \
disclosure or distribution is prohibited. If you are not the intended recipient, \
please contact the sender by reply email and destroy all copies of the original \
message.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic