[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: how to delete a subject when some variable has missing values
From:       Sigurd Hermansen <HERMANS1 () WESTAT ! COM>
Date:       2003-10-31 21:40:21
[Download RAW message or body]

Ian:
I agree that most RDBMS's (all that I know) do not preserve details in group
by queries. One danger of that is the situation in SAS SQL where the yield
of a query contains a summary column. In a remerge, the values in the column
likely will not add up as one might expect.

In the form of query that you and I wrote, the yield of the summary function
does not appear in the yield of the query. For that reason, I would
distinguish it from a remerge, whether or not SAS remerges to solve the
query.

In a private message about a related question, Harry Droogendyk suggested a
NOT EXISTS <subquery> as way to implement key set complements. The SAS SQL
compiler could apply something similar under the hood to each ID group to
implement a NOT SUM(wt IS MISSING) expression. For example, sort the dataset
by ID and then write each group of ID's to a temporary file, checking each
row for a missing value. If none in the group, write the group to an output
dataset; else, discard it. Or better, index all rows with missing values,
and then write to output all rows with ID's not found on the index. Though
neither method would select output during one pass through the data, neither
merges anything onto rows in the output dataset.
Sig


-----Original Message-----
From: Ian Whitlock
Sent: Thursday, October 30, 2003 9:40 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: how to delete a subject when some variable has missing values ?
example inside


Sigurd,

I agree that the summing of IS NULL may have more generality, but part of my
point is that for generality one has to avoid an implicit remerge.

14   proc sql  ;
15      create table q as
16      select * from w
17         group by id
18         having nmiss(wt) = 0
19      ;
NOTE: The query requires remerging summary statistics back with the original
data.
NOTE: Table WORK.Q created, with 850 rows and 2 columns.

20   quit ;
NOTE: PROCEDURE SQL used:
      real time           0.35 seconds
      cpu time            0.04 seconds

The remerging is required because the GROUP BY and HAVING clauses ask for
processing at the group level, but WT is not at the group level therefore
the group level information has to be remerged with the original W.  Now
maybe one could, in principle, have an SQL compiler that would solve the
issue without remerging.  For example, the record level data could be stored
in an array as the group processing is done and then records output from the
array when the group level information is available.  But that solution
sounds like one any SQL implementer would choose to avoid.  Can you suggest
a believable implementation plan without using a JOIN?  Perhaps saving a
pointer to the beginning record of the current group would provide such a
method.  However, that begins to sound an awful lot like a join, i.e.
remerge.  Perhaps it is the term, REMERGE, that you think can be avoided
rather than the process.

To be specific, Access, for example, simply treats it as an error.  I think
you will find that the ANSI requirement leaves it up to the developer as to
whether it should be treated as an error in specification or not.  To me,
SAS made a very good choice here, albeit an easy one considering how often
one wants group statistics at the record level in SAS.  The only argument
against, that I can see, is that it might be a mistake (I have often seen
students making this mistake in classroom exercises), and the implementer
wants to protect you from your oversight by making you spell out the request
in explicit detail.

Can you, or anyone else, name any SQL other than SAS that will accept the
above SQL code when the HAVING clause is replaced by an appropriately
general form?

IanWhitlock@westat.com

-----Original Message-----
From: Sigurd Hermansen
Sent: Wednesday, October 29, 2003 10:30 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: how to delete a subject when some variable has missing values ?
example inside


Ian:
I agree that the NMISS() function in SAS SQL (only) is clearer than anything
based on Boolean expressions. I had forgotten about that SAS SQL summary
function.

As for generality, I don't find the NMISS() function outside SAS SQL.
Boolean expression in HAVING clauses (perhaps substituting IS NULL for IS
MISSING) might work better in other flavours of SQL.

Further, while SAS may remerge to implement a GROUP BY clause, I don't see
any direct connection between a HAVING clause and the SAS REMERGE. A SQL
dialect may implement groups as pointers to file locations or indexes.
Implementing a condition on a HAVING clause does not have to involve
remerging. Sig

-----Original Message-----
From: Ian Whitlock
To: SAS-L@LISTSERV.UGA.EDU
Sent: 10/29/2003 3:14 PM
Subject: Re: how to delete a subject when some variable has missing values ?
example inside

Sigurd,

I would think

   group by id having nmiss(wt) = 0

would be much clearer in SAS.

I then wondered about the conventionality.  Most SQL's do not allow the
remerging that makes this code possible in SAS.  I should think a form using
a subquery to get the list of wanted (or not wanted) ID's would be more
standard SQL way.

I was suprised at how much more efficient the remerging process was in SAS
SQL than the subquery form.  However, I think this is another indication
that SAS SQL has been optimised for joins rather than subqueries.

In comparing the implied join with an explicit join I was again surprised at
how much better the performance of the implied join over an explicit one.

group by
      real time           10.16 seconds
      cpu time            6.75 seconds

explicit join
      real time           15.61 seconds
      cpu time            12.45 seconds

subquery
      real time           22.11 seconds
      cpu time            21.21 seconds

IanWhitlock@westat.com

-----Original Message-----
From: Sigurd Hermansen
Sent: Wednesday, October 29, 2003 1:21 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: how to delete a subject when some variable has missing values ?
example inside


A more conventional SQL solution looks agrammatic but a bit more intuitive
and robust than the solution using MONOTONIC():  ......  select * from test
group by id having not sum(wt is missing)  ; Sig

<snip>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic