'Re: Randomly Print Records'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: Randomly Print Records
From:       John Whittington <John.W () MEDISCIENCE ! CO ! UK>
Date:       1999-09-30 21:24:49
[Download RAW message or body]

Since I suggested:

proc print data = yourfile (where = ( ranuni(2345346) < 20/1000000 ) ) ;
run ;

... several people have asked me about the chances of ending up with
samples of particular sizes using this approach.  The theoretical answer to
this is simply a matter of application of the binomial distribution.  The
following code, and the output which follows, gives the probabilities of
getting a sample of each possible size from 0 to 40, from a source dataset
of 1,000,000 when 'aiming for' a sample of 20 - although the 20, and
1,000,000 can obviously be changed in the code to suit any other situation:

data doit ;
   n = 0 ; p = probbnml(20/1e6, 1e6 , 0) ; p2 = p*100 ; output ;
   do n = 1 to 40 ;
      p = probbnml(20/1e6, 1e6 , n) - probbnml(20/1e6, 1e6, n-1) ;
      p2 = p*100 ;
      output ;
   end ;
   format p p2 16.14 ;
   label n = 'No. in Sample'
         p = 'Probability of this'
         p2 = 'Probability as %' ;
run ;

proc print label noobs ; run ;

... gives output:

No. in         Probability
Sample             of this    Probability as %

   0      0.00000000206074    0.00000020607414
   1      0.00000004121565    0.00000412156529
   2      0.00000041216436    0.00004121643597
   3      0.00000274781186    0.00027478118590
   4      0.00001373929286    0.00137392928637
   5      0.00005495805079    0.00549580507870
   6      0.00018319625058    0.01831962505810
   7      0.00052342518680    0.05234251868001
   8      0.00130857997866    0.13085799786603
   9      0.00290799040430    0.29079904043011
  10      0.00581604478568    0.58160447856764
  11      0.01057473263144    1.05747326314427
  12      0.01762471300992    1.76247130099162
  13      0.02711516001609    2.71151600160886
  14      0.03873621403719    3.87362140371920
  15      0.05164859527888    5.16485952788821
  16      0.06456106690885    6.45610669088481
  17      0.07595450018630    7.59545001862986
  18      0.08439414228268    8.43941422826766
  19      0.08883611692044    8.88361169204376
  20      0.08883620575845    8.88362057584528
  21      0.08460591024620    8.46059102461959
  22      0.07691438694430    7.69143869442992
  23      0.06688194183694    6.68819418369368
  24      0.05573478432266    5.57347843226550
  25      0.04458764910328    4.45876491032804
  26      0.03429802012356    3.42980201235620
  27      0.02540578839360    2.54057883936003
  28      0.01814686467825    1.81468646782497
  29      0.01251497896582    1.25149789658247
  30      0.00834324421918    0.83432442191819
  31      0.00538268437747    0.53826843774717
  32      0.00336414072923    0.33641407292276
  33      0.00203884870226    0.20388487022636
  34      0.00119930717453    0.11993071745305
  35      0.00068530879080    0.06853087907976
  36      0.00038072139498    0.03807213949781
  37      0.00020579205585    0.02057920558457
  38      0.00010830976701    0.01083097670057
  39      0.00005554247046    0.00555424704570
  40      0.00002777070756    0.00277707075644

It can be seen that the highest probability is, indeed, of N  in the
sample, but there is only an 8.9% chance of that happening - and N has
almost exactly the same chance, with probabilites becoming increasingly
less as one moves away from 20.

Regards,


John

----------------------------------------------------------------
Dr John Whittington,       Voice:    +44 (0) 1296 730225
Mediscience Services       Fax:      +44 (0) 1296 738893
Twyford Manor, Twyford,    E-mail:   John.W@mediscience.co.uk
Buckingham  MK18 4EL, UK             mediscience@compuserve.com
----------------------------------------------------------------

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic