'Re: Predicting size of summarized dataset for every permutat'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: Predicting size of summarized dataset for every permutat
From:       HERMANS1 <hermans1 () WESTAT ! COM>
Date:       1997-11-30 18:08:18
[Download RAW message or body]

We summarize datasets whenever possible.  The only real difficulties
occur when data contain continuous or nearly-continuous values.  Some
degree of rounding, aggregation, or other summarization of these types
of variables has to take place prior to any attempt to summarize the
dataset.  In general, simple frequencies of individual variables show
how many distinct instances of values appear in each.  From there, the
maximum number of distinct rows of values in the dataset equals the
product of the number of distinct instances per variable.  For
example, a dataset composed of 2 possible values of gender, 10 age
groups, 50 geographical regions, and 10 categories of illness could
have up to 10,000 distinct states of the 4 variables.  Nonetheless,
many data sources have significant levels of correlation among several
variables.  If, for example, females in 2 age categories living in 5
regions and in 1 illness category account for 80% of all N rows of
source data, then the maximum number of distinct states of the 4
variables becomes 10 + .2N.

Your knowledge of the source data offers the best bet for finding a
good summarization scheme.  Partitioning data into different tables
also faciliates summarization of complex datasets.  No general rule
exists for predicting the actual number of distinct states that x
variables will have.  If one existed, every dataset could be reduced
to a smaller summary dataset without any loss of information!   Sig
______________________________ Reply Separator _________________________________
Subject: Predicting size of summarized dataset for every permutation
Author:  Harmon Jolley <jolleyh@HLTHSRC.COM> at Internet-E-Mail
Date:    11/26/97 4:49 PM


We frequently create summarized datasets using PROC SQL's GROUP BY
feature.  By knowing how many observations will be created by specifying
every permutation of the variables eligible for listing in the GROUP BY,
we could pick the best set of variables to include in the GROUP BY.  That
is, we would be balancing the benefit gained from having summary
statistics for a given list of GROUP BY variables against the storage
required.  If the wrong list of GROUP BY variables is chosen, the result
is nearly one-to-one with the detail source data in size.

One notion that I had for doing this would be to generate every possible
permutation of the GROUP BY variables, and get a COUNT(*) for the result.
For instance, a GROUP BY A, B, C results in 100 rows, a GROUP BY A and C
gives 75 rows, a GROUP BY A, B gives 50 rows.  Does anyone know how to do
generate all possible permutations using SAS?  Anyone know of an
alternative approach to get the results that we need?

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic