[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sbcl-devel
Subject:    [Sbcl-devel] FILE-POSITION & friends [was Re: Encoding/decoding errors & restarts]
From:       Richard M Kreuter via Sbcl-devel <sbcl-devel () lists ! sourceforge ! net>
Date:       2023-07-02 18:03:57
Message-ID: 45577.1688321037 () ruk ! vpn ! home ! arpa
[Download RAW message or body]

Christophe Rhodes <csr21@cantab.net> wrote:

> [Re: FILE-POSITION] I have tests for these behaviours, and I think
> that they are reasonable.  (I can also see why they might be
> characterized as weird).

"Weird" was counterproductively imprecise. I'll explain below.

But first ISTM worthwhile to spell out what I think ANSI requires of
portable program that uses FILE-POSITION on character streams.

One-arg FILE-POSITION returns a file position (a nonnegative integer) or
NIL. Two-arg FILE-POSITION returns true if repositioning succeeded, NIL
if repositioning failed, or an error if the position "is too large or
otherwise inappropriate". No stream is required to return non-NIL from
either one-arg or two-arg FILE-POSITION. That's about all ANSI says.

One thing to notice is that in order to be portable, a program should be
prepared to receive NIL on each and every one-arg or two-arg call, and
should be prepared for an error in each two-arg call. That is, a
portable program cannot assume that if FILE-POSITION returned non-NIL on
a stream once before, then it will continue to return non-NIL for that
stream in the future, or vice versa. (How could a stream change whether
it can determine its file position or reposition somewhere? Some
possibilities below [*]. Let's suppose it's just a general
possibility. Also, I'll omit mention of what a portable program might do
when FILE-POSITION returns NIL or errors; let's assume a typical program
would halt or exit.)

Next, although all file positions are nonnegative integers, not all
nonnegative integers are "good" to use as file positions for
repositioning. An arbitrary integer must be considered "not good": using
one for repositioning might put the stream into the middle of a
multi-byte or variable-width character, the middle of a CR/LF, or
wherever; or it might cause FILE-POSITION to error (see "inappropriate"
in the Notes section). There are only 3 explicit ways to get a "good"
file position for a file stream:

1. zero is a "good" file position

2. an integer result from one-arg FILE-POSITION is a "good" file
   position,

3. given a character output stream CS and a string S, the sum, P, of
   (FILE-POSITION CS) plus (FILE-STRING-LENGTH CS S) will be a "good"
   file position once S is written to CS, provided that

   (a) no other I/O operations are performed on the CS between the
       FILE-POSITION, FILE-STRING-LENGTH, and the operation that writes S to
       CS,

   and either
   
   (b) the operation that writes S to CS is appending data to the end of
       CS, or
       
   (c) P was previously determined to be a good file position, and no
       subsequent data changes to the stream's contents have made P a
       not-good file position.

A nonnegative integer not from these three origins must be supposed
to be a "not-good" file position.

Case (3.c) needs explanation. A "good" file position can become
"not-good" if something changes the stream's contents. For example, on
SBCL,

  (defmacro assert* (form &aux (result (gensym)))
    `(let (,result)
       (assert (setq ,result (multiple-value-list ,form)))
       (values-list ,result)))
  
  (defun where-now-and-then (stream &optional string)
    (let* ((now (assert* (file-position stream)))
           (then (when string
                   (+ now
                      (assert* (file-string-length stream string1))))))
      (values now then)))
  
  ;; Assume you're just writing at the end of a normal file stream.
  (let ((string1 "abc"))
    (multiple-value-bind (fp0 fp1)
        (where-now-and-then stream string1)
      (write-string string1 stream) ;; stream now at FP1
      (assert* (file-position stream fp0)) ;; now at FP0
      ;; Suppose CHAR here is any character that encodes to
      ;; a different number of octets than #\c, say
      ;; #\currency_sign (from Latin-1), etc.;
      ;; or is #\Newline if STREAM terminates lines with CR/LF.
      (format stream "ab~C" char) ;; FP1 is not-good now
      (file-position stream fp1))) ;; impossible to know what happens here

ISTM that this subtle "not-good-ification" of file positions has always
been implicit in Common Lisp. A program that's serious about portably
doing "overwrites" on a random access stream probably needs to do some
careful bookkeeping, or padding, or something.

Otherwise, it should, I think, be possible for a program to portably do
random I/O on a character stream: to read or write something, go back
and read it back in, move around and (carefully) overwrite data, etc.

(I've got a couple stray things to say about "random access" in general,
but they're inessential for a sequential read through this
message. Reposition to [**] if you want those things now.)

> (CLHS describes FILE-POSITION as monotonically increasing.  In my
> understanding, that is distinct from strictly increasing: staying
> at the current value is acceptable.)

I believe I know why you'd think that: that's how I was taught the
meanings of "monotonically increasing" vs "strictly increasing",
too. However, I'm afraid Steele and then ANSI used a more obscure
variant of the mathematical jargon. Have a look at the arithmetic
comparisons:

  The value of < is true if the numbers are in monotonically increasing
  order; otherwise it is false.

That "monotonically increasing" must mean "strictly increasing"; otherwise

  (< 1 1)

could return true. As it turns out, there's an older set of jargon, that
goes like this:

How I was taught                       Steele & ANSI
----------------                       -------------
strictly increasing                    monotonically increasing
monotonically increasing               monotonically nondecreasing
strictly decreasing                    monotonically decreasing
monotonically decreasing               monotonically nonincreasing

The jargon Steele & ANSI use is nowadays so obscure, it's hard even to
find references on the web, but here are two:

https://planetmath.org/monotonicallynondecreasing

https://math.stackexchange.com/questions/302423/monotonically-increasing-vs-non-decreasing

(Some poking around in Google Scholar suggests that even by the time
Steele wrote CLtL1 in 1984, "strictly increasing" or "strictly
decreasing" had 150x as many citations as "monotonically nondecreasing"
or "monotonically nonincreasing"; nowadays the modern terms have a
million times as many citations. So there's no reason anybody would know
this bit of trivia.)

Anyhow, that's just the jargon Steele seemed to use, and we can actually
find it in his initial proposal for how FILE-POSITION should work for
character streams: at https://www.saildart.org/COMMON.4[COM,LSP], in the
message dated 2 Sep 83 0047 EDT (Friday), with Subject: "Comments on
Excelsior manual", Steele wrote:

  [How about just requiring that it be a monotonically increasing function
  of the number of READ-CHAR/WRITE-CHAR operations; that is, xxx-CHAR always
  increments it by some positive integer but not necessarily by 1?]

"monotonically increasing" is explained here as "increments... by
some positive integer".

So I'm pretty confident that strictly increasing was Steele's intent,
for whatever that's worth: FILE-POSITION was to be strictly increasing.

There is one detail in ANSI incompatible with the "strictly increasing"
meaning. However, it's such a uniquely peculiar case that I've put it in
a footnote [***]. Modulo that detail, ISTM that FILE-POSITION was
"supposed to be" strictly increasing in ANSI.

But that's an assessment of what some people intended a long time
ago. Even if they intended that, their intentions are non-binding. So
let's say that both would be conforming options for FILE-POSITION:
either strictly increasing, else monotonically increasing (in the modern
sense).

What's better for users' programs?

Let's imagine that a stream's data would present this progression of
results of pairs of FILE-POSITION and READ-CHAR calls (the "p" row is
the file position prior to a READ-CHAR call in the "c" row):

p: ... p0 ... <0 or more p0>... p0   p1 ...  ;; assume all non-NIL
c: --- c0 ... <0 or more cs>... cN-1 cN ...

That is, this is a picture of a FILE-POSITION that's not strictly
increasing: it's "flat" between [c0..cN-1). Let's suppose that that this
file's contents are not changing over the duration being considered, and
that there are no undetectable bit flips, off-by-ones in the OS, or
other data errors that can't reasonably be handled by the
implementation.

Suppose a program is reading characters from the stream, and that the
characters up through cX, 0 <= X < cN-1, constitute input that causes
the program to get the file position, in order be able to resume reading
characters from cX+1. The program calls one-arg FILE-POSITION, and gets
back p0.

When the program subsequently uses p0 with two-arg FILE-POSITION, the
stream state will change so that the next character will (presumably) be
c0. c0 is arbitrarily many characters prior to where the program
"wanted" to be in the stream.

That repositioning behavior certainly puts the "random" in "random
access". :-)

Kidding aside, the serious problem here is that the stream is in an
unwanted location within the stream's progression of elements /due to
the mechanism for getting and setting that location/. I/O operations are
how a program inquires about and manipulates external state: if a
program must know the state in order to decide whether to trust an I/O
operation that reports on that state, why do the I/O operation in the
first place?  A monotonically increasing one-arg FILE-POSITION is just
unusably untrustworthy (that's what I meant by "weird").

For the case above, if the program needs to know to do a certain number
of READ-CHAR calls after the two-arg FILE-POSITION with p0, how is it
supposed to get that knowledge? Call one-arg FILE-POSITION before and
after each READ-CHAR to see whether the file position function has gone
flat? And also avoid READ-LINE, READ-SEQUENCE, or any other operation
that might input characters but not track file positions for each
character of input? That seems an unworkable "workaround" for a
non-strictly-increasing FILE-POSITION.

None of this scenario can happen with a FILE-POSITION that is strictly
increasing; that seems an unambiguous win.

But what if a stream implementor simply can't return a number bigger
than the previous one, for whatever reason? ISTM that it would be better
for one-arg FILE-POSITION to return NIL in such cases. More precisely,
I'd say one-arg FILE-POSITION should return NIL whenever a stream is in
a state it won't be able to later return to, where the state is
identified as the one where the characters that will be available to
read afterward are the ones that are available to read now, and the
characters that will be overwritten afterward are the ones that will be
overwritten now, modulo any intervening data changes in either case. If
the stream later transitions to states it can reliably reposition to, it
can start returning non-NIL file positions then.

Returning NIL would be a less untrustworthy interface for users:
portable programs must be prepared for NIL returns anyhow. An early and
honest "you can't reposition back to here" result is better than a
number that will silently land the program somewhere random later, don't
you think?

(So far I've deliberately avoided the reason why SBCL's FILE-POSITION
can "go flat": multi-character replacements for input decoding errors,
and zero-length replacements for output encoding errors. No matter what
some future design for encoding/decoding might be, because ISTM programs
"ought to be prepared" to get NIL from FILE-POSITION in any case, I
think NIL is a reasonable thing for SBCL to return when no more precise
and usable a number is available. If there is a real need for
arbitrary-length replacements in the stream layer, I think it'd be
reasonable to have an extension function whose semantic is "get the Unix
file offset of the octet that gave rise to some number of recent and
pending characters on this stream", but only if it also returned the
number of characters read since that offset and the number of pending
characters "within" that offset as extra values. The offset just isn't
usable by itself, without those other numbers.)

Regards,
Richard

[*] Some ways a stream might become unable to determine a file position:

- ANSI does not require that any stream must be able to determine a file
  position. If there are any streams that can't, then a synonym stream
  would be able to "change" whether it can determine a file position,
  since its symbol's value might change over time.

- Suppose a file stream provides a record-oriented view of plain-text
  Unix files, internally counting newlines and computing file positions
  as R*L+C, where R is the current record number, L is some upper bound
  for line length (say, user-supplied during OPEN), and C is the number
  of characters read or written since the last newline or the start of
  the file. (Such a stream would allow a conforming-but-non-portable
  program to do line-addressed random access via arithmetic on file
  positions, say.) For such a stream,

    (file-position stream :end)

  could presumably operate by lseek(fd, 0, SEEK_END), or else by
  spooling through the file sequentially to count newlines. If the
  stream does lseek(), it wouldn't know the line number at the end of
  the file, and would presumably have to stop returning file
  positions. Repositioning such a stream to :START or any file position
  within the highest-numbered sequentially-accessed line could
  conceivably "restore" the stream's ability to determine its file
  position.

- In the same vein, such a stream would probably signal an error during
  two-arg FILE-POSITION when a supplied position decodes into a
  character position bigger than the length of the corresponding
  line. If a file's content were equivalent to

    (format nil "abc~%1234567890~%") ;; Assume Unix Newline convention

  and the file were opened with such a record-oriented view with "L"
  factor ten, then [0..3] and [10..20] would be "appropriate" file
  positions, but [4..9] might not be; so two-arg FILE-POSITION might
  error. If there were a reason for the stream implementor to have the
  error be continuable in a way that leaves the stream in that
  "inappropriate" state, it might be necessary for the stream to cease
  to be able to determine its position until some later operation (e.g.,
  repositioning to an appropriate file position.)

- Still thinking about a record-oriented view of plain text files on
  Unix, opening a file with :IF-EXISTS :APPEND could start the stream in
  a state where it cannot determine a file position. Repositioning to
  :START could perhaps make the stream able to do so.

(Btw. although record-oriented files, or views of files, are
old-fashioned, they're sometimes handy, and certainly "available", e.g.,
via BerkeleyDB. It's not hard to build a conforming Lisp file stream on
top of BerkeleyDB's recno as a "view" of a plain text file, IME. Such a
stream will offer slightly different random access semantics than an
FD-STREAM, but then so does opening a file with element-type
(UNSIGNED-BYTE 8) vs. CHARACTER.)

Anyhow, the point is that file systems can be anything, and file streams
can be anything else again, and ANSI neither specifies nor really
constrains what an implementation might do.

[**] Two more things about random access:

- For FILE-POSITION, the descriptions CLtL and ANSI differ in some small
  but semantically interesting ways. For two-arg FILE-POSITION, CLtL
  said

    The position may be an integer, or :START for the beginning of the
    stream, or :END for the end of the stream. If the integer is too
    large or otherwise inappropriate, an error is signaled...

  IOW, the error language only applied when the position is an integer.

  ANSI reorganized things, and reworded the Exceptional Situation
  language to

    If POSITION-SPEC is supplied, but is too large or otherwise
    inappropriate, an error is signaled.

  Since ANSI doesn't say otherwise, ISTM that it would be valid for an a
  stream implementor to consider :START or :END "otherwise
  inappropriate" for some kind of stream, and so signal an error. This
  is sort of a bummer: it would have been nicer for users had ANSI
  explicitly required that :START and :END always counted as
  "appropriate" and that the FILE-POSITION returns NIL if the stream
  cannot reposition to the start or end. (This would cut down on the
  possible ways a conforming program might have to handle
  implementation-dependent errors.)

- ANSI does not require any relationship between file positions and
  FILE-LENGTH for character streams. (CLtL did, but it was perhaps too
  restrictive.) Consequently, a portable program cannot compare an
  integer result from FILE-LENGTH with a file position for a character
  stream.

  This is also sort of a bummer; I believe ANSI could have required that
  when FILE-LENGTH returns an integer, it must be the equal to the one
  you would get when

    (when (file-position stream :end) (file-position stream))

  returns an integer [and and does not error, per the previous
  point]. IOW, either FILE-POSITION or FILE-LENGTH could independently
  return NIL, but when they both return integers, they need use the same
  unit of measure.

[***] One thing I can find in ANSI that goes against a strictly
increasing FILE-POSITION is in the dictionary entry for
BROADCAST-STREAM, which specifies that FILE-POSITION always returns zero
for a broadcast stream with no components (a "null output stream"). A
constant function isn't strictly increasing, so that would contradict
the interpretation of "monotonically increasing" as "strictly
increasing" in all cases.

So one option would be to say "Ha! Told you so! "Monotonically
increasing" doesn't means "strictly increasing"!"

I think that's a valid move, but ISTM worthwhile to observe that this
detail got into ANSI via the X3J13 Issue,
BROADCAST-STREAM-RETURN-VALUES, whose effect on the language merits
inspection.

First, it defines FILE-POSITION as a constant function for one type of
stream, either contradicting or forcing a reinterpretation of language
in FILE-POSITION.

Second it added a more explicit contradiction to the standard, requiring

   (let ((string <any-string>)
         (bs (make-broadcast-stream)))
     (= (+ (file-position bs)              ; required to be 0
           (file-string-length bs string)) ; required to be 1
        (progn (write-string string bs)
	       (file-position bs))))       ; required to be 0
   => NIL                                  ; supposed to be T

which goes against the specification for FILE-STRING-LENGTH.
FILE-STRING-LENGTH on a null broadcast stream should have been specified
to return zero, to get the required file position algebra to work right.

Next, the issue specifies that

  (stream-element-type (make-broadcast-stream)) => T

I believe this makes a null output stream the only standard stream that
can return a supertype of CHARACTER or UNSIGNED-BYTE. That's not
terrible, but it means that if you're the sort of person who ever does
something like

  (subtypep (stream-element-type stream) 'character)

you'll miss a null output stream.

Next, the Issue also specifies that

  (stream-external-format (make-broadcast-stream)) => :DEFAULT

This requirement either implies that STREAM-EXTERNAL-FORMAT returns an
external file format designator (not an external file format), or that
:DEFAULT is required (not merely permitted) to be an external file
format. In SBCL and some other Lisps, using :DEFAULT as an external file
format designator gets you a stream for which EXTERNAL-FILE-FORMAT
returns an object other than :DEFAULT; IOW, :DEFAULT is merely a
external file format designator, not an external file format per se. So
on the off chance you might need to do, say

  ;; Exactly which comparison is right for external file formats
  ;; is implementation-depedent.
  (equal (stream-external-format s1) (stream-external-format s2))
         
you probably have to special-case a null output stream.

I suspect that BROADCAST-STREAM-RETURN-VALUES was a bit of a "rush job"
during the Public Comment period in 1993. AFAICT, by the time this issue
was drafted, the X3J13 Cleanup subcommittee might have become inactive,
and it appears that the issue was voted on by letter ballot (most
preceding issues were approved at in-person meetings, perhaps after some
in-person discussion). So it may be that the Issue wasn't so closely
examined as earlier ones during X3J13's earlier years.

So ISTM these specifications for a null output stream ought to be taken
with a grain of salt. In multiple respects, a null output stream must be
treated as by users as "the exception that proves the rule": for its
element type, its external format, its behavior for
FILE-STRING-LENGTH. I say it's worthwhile to consider a null output
stream a unique exception for FILE-POSITION, too.

(If there were ever a future revision of the standard, I think it would
be okay to simply state "a broadcast stream with no components is the
unique case where..." say; except somebody really ought to fix
FILE-STRING-LENGTH. One and zero ain't zero.)


_______________________________________________
Sbcl-devel mailing list
Sbcl-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic