'Re: subsetting large raw file?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: subsetting large raw file?
From:       Peter Crawford <Peter () CRAWFORDSOFTWARE ! DEMON ! CO ! UK>
Date:       1998-06-30 20:00:14
[Download RAW message or body]

In article <6naunr$uuf$1@nnrp1.dejanews.com>, Bruce Rogers
<B.Rogers@virgin.net> writes
>In article <vcNzzaAgK$l1EwwN@crawfordsoftware.demon.co.uk>,
>  Peter Crawford <Peter@crawfordsoftware.demon.co.uk> wrote:
>>
>> In article <6n7m5i$e50$1@nnrp1.dejanews.com>, Bruce Rogers
>> <B.Rogers@virgin.net> writes
>> >In article <6n3db9$p50@arctic.nprdc.navy.mil>,
>> >  stanonik@nprdc.navy.mil (Ron Stanonik) wrote:
><original post snipped>
>
>> >Assuming that your datasets are both sorted correctly (which your merge
would
>> >require in any case) a simple solution (untested code) follows:
>> >
>> >data both ;
>> >length id act_id $ 8 ;
>> >retain act_id ;
>> >infile history ;
>> >input id @ ; /* id only - '@' leaves record open */
>> >  do while(act_id < id  & ^eof) ;
>> >     infile action end=eof;
>> >     input act_id ;
>> >  end;
>> >  if eof then STOP ;
>> >  if act_id = id ;
>> >RUN;
>
>Whoops! Nobody spotted the deliberate mistake - the lack of any further input
>statements from the history file!
>
>> Bruce,
>> I'm feeling ready for your challenge... almost
>> So ...big IF
>> if the big file has no order into which the little file can be forced,
>> to coordinate a pseudo merge like you suggest, a little bit of sql (not
>> allways my favourite) might help
>> data action ;
>>  infile '<action file definition>' ;
>>  input @NNN id fff. /* replace nnn and fff with position and informat*/
>>        @..... further action fields as necessary           ;
>> run;
>>
>> proc sql noprint;
>>   select id into :ids seperated by ',' from action ;
>> quit;    /* load &ids with a comma delimited list of the id required
>>             from history; and this list can be up up to 32k long! ****/
>>
>> data changing /view=changing;
>>   infile history;
>>   input @nnn id fff.@ ;
>>   if id in( &ids );
>>   input .....all the fields you might like to use/keep/update;
>> run;
>> /*I suggest using a view to avoid storing another intermediate copy **/
>
>Sometimes, I find that the act of using a view can actually slow the process
>- I guess it must be dependant on platform/machine/resources etc.
One of the guidance notes on datastep views, indicates that to obtain
best performance switch off the stimer option. If that is in place, the
the view is timed .. so it is probably executed entirely separately -
just providing the behaviour you describe.
It's a catch 22
        to gain performance, switch off performance monitoring!!


>
>The original post implies that the action file contains _only_ the id, so your
>solution (which I rather like) could just stop here (without the view, of
>course). The rest would only be required if there were further fields on the
>action dataset.
>
>>
>> proc sort data= changing out= histbit; by id; run;
>>
>> /** the original merge but now, only using relevant history records
>>     and in id order     **/
>>
>>         data subset;
>>         merge action (in=in_a) histbit (in=in_h);
>>         by id;
>>         if in_a & in_h;
>>
>> It might only improve on your solution, if the history is not ordered
>
>Certainly, if the history file is not sorted, then something like the above
>would be required. Of course, depending on the platform used and the reletive
>file sizes, an external file on the history file might be more efficient than
>a SAS sort on the subset dataset.
>
>I do like the use of symbolic variables from proc sql, although I confess I've
>never seen the 'seperated[sic] by' parameter before. (Actually, having just
>tried it, I find that SAS quite happily accepts the mis-spelling! )
>
>There is a problem with this solution, however. It will only work if the ID
>values are numeric, as the resulting statement would read :
>if id in(a,b,c,d ... ) ;
>and therefore fall over with a syntax error. In this case the relevant lines
>would have to change to :
>
>select id into :ids separated by '","' from action ;
>  and
>if id in("&ids" ) ;
I'ld recommend using
        select quote(id)  into :ids separated by ', ' from action
instead, but it adds up to the same,

>
>Another, similar solution might be to transpose the action dataset and use
>something like:
>
>proc sql ;
>select count(id) into:idcnt from action ;
>quit ; /* &idcnt will be the number of col vars in the transposed dataset */
>
>proc transpose data=action out=idcols ;
>run; /* creates a single record containing col1-colnnn - max 32k */
>
>data subset (drop=col1-col&idcnt) ;
>if _n_ = 1 then set idcols ;
>infile history ;
>input id @ ;
>if id in (col1-col&idcnt) ; /
>input .... /* rest of variables */
>
>It's quite likely that this would be less efficient than the symbolic variable
>list, as the PDV would have to contain the extra variables. On the other hand,
>maybe the long literal list would take up more virtual memory - I'd be
>interested to know which but haven't got the time now. Got any ideas?

I too, would be interested to learn which is more resource hungry.
My expectation is that managing datastep data vector variables consumes
more cycles than one long macro var ...

... but who can tell us ?

>
>Regards
--
Peter Crawford

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic