[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sas-l
Subject:    Re: Missing Values, Primary Keys, and Unique Indexes
From:       Kevin Myers <KevinMyers () AUSTIN ! RR ! COM>
Date:       2003-01-31 20:48:29
[Download RAW message or body]

I wanted to follow up on this discussion after having some additional time
to think it over a bit further...

Consider the following example of a (simplified) table used to hold land
survey data for the state of Texas.  In Texas, most land surveys are based
on a hierarchical arrangement of land measures: Surveys, blocks, and
sections.  So, it would make sense for example, to define a Section table
with a composite primary key of SurveyName, BlockNumber, and SectionNumber.
Unfortunately, however, there is a complication:  Some surveys or blocks
consist of only a single, unnumbered, undivided area.  In those sitiuations
(and many others similar to them with other types of data), the widespread,
and commonly accepted business practice is to simply use a blank value for
BlockNumber and/or SectionNumber when the higher level land area is
undivided.

I also have some other tables that are updated fairly routinely, which are
related to the Survey, Block, and Section tables.  To insure the integrity
of these tables, I need to be able to define foreign keys that point to
primary keys on the Survey, Block, and Section tables.  I also need the
primary keys to help enforce the uniqueness of the key values in the Survey,
Block, and Section tables themselves.  But the present SAS implementation
won't let me do that, because the desired primary key fields contain blanks,
since that is the accepted, standard business manner of encoding these
items.  SAS interprets the blank values as NULLS, and therefore refuses to
allow the primary key to be created.

Now if this was the only example like this that I could think of, then I
could just work around this single problem by substituting some oddball
value and forget about it.  But this same scenario turns up over and over
again, in multiple tables of most every single database that I look at.  And
once again, I should mention that this issue is extremely prevelant in data
from outside suppliers where I have absolutely no control over their coding
practices.

So, let's consider this problem a little deeper.  Some others have suggested
that I am violating relational theory by desiring to index blank values.
BUT, that is simply NOT true!  In making that statement, these fine folks
are making the very same invalid assumption that SAS has implemented in
code.  They are assuming that blank values are the same as NULL values, and
that is simply NOT true!!!

A NULL value is inherently unknown.  My blank values are NOT unknown.  They
are very specifically blank and nothing else.  Blank is a valid value here,
equivalent in stature to any other value.  For example, a blank value can be
compared to another value to determine equality, whereas that is NOT the
case for a true NULL.

Now of course the problem in SAS is that there is no real distinction
between NULL and blank values.  SAS has decided in this specific case to
treat blank values as if they are truly NULL, which therefore means they are
inappropriate for use in a primary key.  But who asked SAS to make that very
limiting decision for me???

Now to be fair I must mention that there are some situations where it makes
reasonably good sense in SAS to treat blank values in some respects as if
they are NULL, since they are the closest thing to a null character value
that SAS has.  However, as shown by my examples, there is a real need in
many cases for SAS to NOT make this assumption, and carry this treatment to
undesirable extremes.

One might argue that to provide an option to disable treating blank value as
NULLs would be too complex and require too much effort for implementation.
But that would seem extremely unlikely since SAS has a long history of NOT
treating blank values as NULLS before they ever got started down this path.
Everywhere else in SAS, blank values are NOT given any special NULL
treatment.  All that needs to be done to provide the support that I'm asking
for is to simply provide an option that bypasses the current non-blank check
for primary keys.  Any other necessary handling must essentially already be
in place, because as far as all of the underlying guts of SAS are concerned,
a blank value is no different from any other value.  It can be sorted,
compared, concatenated, etc.  There is no special significance to blank
values that distinguishes them from any other value in SAS, as far as their
ability to be indexed, etc. under the covers.

I hope that made some sense.  As usual, it took much longer to say all that
than I might have hoped.  Summarized in many fewer words:

1. My desire to include blank values in a primary key does NOT violate
relational principles.  It is the inaccurate assumption on the part of the
present SAS implementation and others on this list that blank=NULL which
produces this invalid and undesirably limiting conclusion.

2. SAS should NOT force the preceding assumption on the user, adversely
affecting their ability to define primary keys that may include blank
values.  Blank values within a primary key are an extremely common real
world business need, and once again, that is NOT the same as NULL values.

3. It should be very easy to provide an option to turn off the check for
blank values when defining a primary key.  SAS would then simply revert to
treating blank values the same way that it treats any other value, which
would by the way, completelely support uniqueness.

So you see, when you get right down to the nitty gritty details and don't
make invalid assumptions, I am actually on the side of the relational
purists after all.  SAS shouldn't force me into an undesirable corner
because of invalid assumptions regarding the actual "nullness" of my blank
values.  Nor should the current restriction even apply at all, given that
blanks are only being used to "simulate" nulls where convenient in PROC SQL,
and that they are most definitely NOT the same thing.

I hope this may help to clarify my position somewhat.
s/KAM


----- Original Message -----
From: "Kevin Myers" <KevinMyers@austin.rr.com>
To: "SAS Users" <sas-l@listserv.uga.edu>
Sent: Thursday, January 30, 2003 12:34 PM
Subject: Re: Missing Values, Primary Keys, and Unique Indexes


> See below...
>
> ----- Original Message -----
> From: "Ian Whitlock" <WHITLOI1@WESTAT.com>
> To: "'Kevin Myers'" <KevinMyers@AUSTIN.RR.COM>; <SAS-L@LISTSERV.UGA.EDU>
> Sent: Thursday, January 30, 2003 12:09 PM
> Subject: RE: Missing Values, Primary Keys, and Unique Indexes
>
>
> > Kevin,
> >
> > In part you write:
> >
> > >I don't have a problem with the *default* implementation of something
> being
> > >the normal "safe" way.  But there should be a way to turn that off when
> the
> > >safety net gets in the way.
> >
> > I agree, and in this case you should either turn off the indexing or
> create
> > your own index.
>
> It isn't desirable to turn off the indexing in my application, because
users
> may be perform ad-hoc updates to the data using multiple updating
> applications and multiple data sources.  Therefore the data respository
> needs all of the protection that integrity constraints can provide.  The
> same issues make creating my own index undesirable for a couple of
reasons:
> 1) You can't define a foreign key against a plain old index; it must be a
> primary key.  2) Creating my own index has the same non-blank requirement
as
> creating a primary key, unless I go with a non-unique key, which sort of
> defeats using the index as an integrity constraint.
>
> >
> > After my suggestion that you use "FF"x to stand for all blank you wrote
> >
> > >>>>>>>
> > Now, let's say for example that I run PROC SUMMARY with NWAY against a
> > table, and some of the class columns contain blank values.  Furthermore,
> > let's say that I want to store that information in another table,
possibly
> > along with some additional data that is relevant to each specific
> > combination of class column values.  Now in this scenario, the class
> columms
> > are the OBVIOUS primary keys for my new table.  BUT, since SAS won't let
> me
> > create a unique index on columns that include blank values, I can't use
my
> > class columns as the primary keys for the table!!!!!  Now what other
> options
> > might I have...?   Well let's see, how about a surrogate key???  Oh
darn,
> > SAS doesn't support those either, and the relational purist types hate
'em
> > anyway!
> > >>>>>>>
> >
> > So it appears that your character index can contain all possible values
of
> a
> > byte. Well then let me suggest another way to create your own index.
> > Suppose X is the character classification variable then add a DATA step
> > after the summary making a viable _X which has one extra byte in it.
> >
> >     if x = " " then _x = <your choice(not blank)> || x ;
> >     else _x = <second choice> || x ;
> >
> > Yes, I suppose you will loose a byte on 32K character strings.  If it is
> > important then perhaps 32K character strings should not be used as
indexes
> > or you can partition the variable into two character strings and replace
> the
> > one index with two indices.
> >
> > Now that I think about, it probably makes better sense to put the extra
> byte
> > at the end instead of the beginning. It could lead to inefficiencies,
but
> > you get the advantage that you could always control what is printed with
a
> > format or create the orignal variable by simply changing the length.
>
> Yes, I had considered something along those lines as well.  Although I
could
> probably make it work, it is very ugly and undesirable.  From a data
> modeler/data adminstrator/data user point of view you are making changes
to
> the real data, and that is almost always a bad thing.  It will produce
> greater application complexity, maintenance headaches, user errors, and
> increased user training requirements.  Unfortunately, I'll have to agree
> that might be the best that I can do for the moment...
>
> But, as an application developer, why should I need to write my
application
> around the limitation of not having blank values within columns of a
primary
> key, when there are no good reasons for that restriction to be *REQUIRED*,
> and some very good reasons why it shouldn't be?  The language is going out
> of its way to block the easiest solution to my problem, and it shouldn't
be
> doing that!
>
>
> >
> > IanWhitlock@westat.com
>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic