'[Sbcl-commits] master: translate a not-quite-ASCII data file to ASCII'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       sbcl-commits
Subject:    [Sbcl-commits] master: translate a not-quite-ASCII data file to ASCII
From:       Christophe Rhodes via Sbcl-commits <sbcl-commits () lists ! sourceforge ! net>
Date:       2019-09-28 17:17:34
Message-ID: 1569691055.620.12838 () sfp-scm-3 ! v30 ! lw ! sourceforge ! com
[Download RAW message or body]

The branch "master" has been updated in SBCL:
       via  88d63bb7e09e79b246f73bc4d0611b4a4833966b (commit)
      from  bb63caa5683c83459eca45a9de7881096bd5c68f (commit)

- Log -----------------------------------------------------------------
commit 88d63bb7e09e79b246f73bc4d0611b4a4833966b
Author: Christophe Rhodes <csr21@cantab.net>
Date:   Sat Sep 21 16:41:03 2019 +0100

    translate a not-quite-ASCII data file to ASCII
    
    This is not an elegant remedy to the fact that we can assume almost
    nothing about our host compiler's character encoding.  For maximum
    portability, anything that is read as text in the course of the SBCL
    build should constrain itself to the standard-char repertoire, which
    most but not all Unicode data files do: unfortunately, CaseFolding has
    some explanatory comments at its head, including a brief discussion of
    the fact that a single character, such as the German eszet, can upcase
    to more: which would be fine except that the data file contains a
    UTF-8 encoded eszet.
    
    Treat the CaseFolding.txt data file as a binary artifact, and rewrite
    it such that it can be read as text in the build.
---
 tools-for-build/ucd.lisp | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/tools-for-build/ucd.lisp b/tools-for-build/ucd.lisp
index f28f4d414..d70e7cdaf 100644
--- a/tools-for-build/ucd.lisp
+++ b/tools-for-build/ucd.lisp
@@ -556,7 +556,37 @@ Length should be adjusted when the standard changes.")
              (setf (ucd-misc (gethash code-point *ucd-entries*)) new-misc))))))
 
 (defun fixup-casefolding ()
-  (with-input-txt-file (s "CaseFolding")
+  ;; KLUDGE: CaseFolding.txt as distributed by Unicode contains a
+  ;; non-ASCII character, an eszet, within a comment to act as an
+  ;; example.  We can't in general assume that our host lisp will let
+  ;; us read that, and we can't portably write that we don't care
+  ;; about the text content of anything on a line after a hash because
+  ;; text decoding happens at a lower level.  So here we rewrite the
+  ;; CaseFolding.txt file to exclude the UTF-8 sequence corresponding
+  ;; to the eszet character.
+  (with-open-file (in (make-pathname :name "CaseFolding" :type "txt"
+                                     :defaults *unicode-character-database*)
+                      :element-type '(unsigned-byte 8))
+    (with-open-file (out (make-pathname :name "CaseFolding" :type "txt"
+                                        :defaults *output-directory*)
+                         :direction :output
+                         :if-exists :supersede
+                         :if-does-not-exist :create
+                         :element-type '(unsigned-byte 8))
+      ;; KLUDGE: it's inefficient, though simple, to do the I/O
+      ;; byte-by-bite.
+      (do ((inbyte (read-byte in nil nil) (read-byte in nil nil))
+           (eszet (map '(vector (unsigned-byte 8)) 'char-code "<eszet>")))
+          ((null inbyte))
+        (if (= inbyte #xc3)
+            (let ((second (read-byte in nil nil)))
+              (cond
+                ((null second) (write-byte inbyte out) (return nil))
+                ((= second #x9f) (write-sequence eszet out))
+                (t (write-byte inbyte out) (write-byte second out))))
+            (write-byte inbyte out)))))
+  (with-open-file (s (make-pathname :name "CaseFolding" :type "txt"
+                                    :defaults *output-directory*))
     (loop for line = (read-line s nil nil)
        while line
        unless (or (not (position #\; line)) (equal (position #\# line) 0))

-----------------------------------------------------------------------


hooks/post-receive
-- 
SBCL


_______________________________________________
Sbcl-commits mailing list
Sbcl-commits@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-commits
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic