[prev in list] [next in list] [prev in thread] [next in thread]
List: sbcl-commits
Subject: [Sbcl-commits] master: translate a not-quite-ASCII data file to ASCII
From: Christophe Rhodes via Sbcl-commits <sbcl-commits () lists ! sourceforge ! net>
Date: 2019-09-28 17:17:34
Message-ID: 1569691055.620.12838 () sfp-scm-3 ! v30 ! lw ! sourceforge ! com
[Download RAW message or body]
The branch "master" has been updated in SBCL:
via 88d63bb7e09e79b246f73bc4d0611b4a4833966b (commit)
from bb63caa5683c83459eca45a9de7881096bd5c68f (commit)
- Log -----------------------------------------------------------------
commit 88d63bb7e09e79b246f73bc4d0611b4a4833966b
Author: Christophe Rhodes <csr21@cantab.net>
Date: Sat Sep 21 16:41:03 2019 +0100
translate a not-quite-ASCII data file to ASCII
This is not an elegant remedy to the fact that we can assume almost
nothing about our host compiler's character encoding. For maximum
portability, anything that is read as text in the course of the SBCL
build should constrain itself to the standard-char repertoire, which
most but not all Unicode data files do: unfortunately, CaseFolding has
some explanatory comments at its head, including a brief discussion of
the fact that a single character, such as the German eszet, can upcase
to more: which would be fine except that the data file contains a
UTF-8 encoded eszet.
Treat the CaseFolding.txt data file as a binary artifact, and rewrite
it such that it can be read as text in the build.
---
tools-for-build/ucd.lisp | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/tools-for-build/ucd.lisp b/tools-for-build/ucd.lisp
index f28f4d414..d70e7cdaf 100644
--- a/tools-for-build/ucd.lisp
+++ b/tools-for-build/ucd.lisp
@@ -556,7 +556,37 @@ Length should be adjusted when the standard changes.")
(setf (ucd-misc (gethash code-point *ucd-entries*)) new-misc))))))
(defun fixup-casefolding ()
- (with-input-txt-file (s "CaseFolding")
+ ;; KLUDGE: CaseFolding.txt as distributed by Unicode contains a
+ ;; non-ASCII character, an eszet, within a comment to act as an
+ ;; example. We can't in general assume that our host lisp will let
+ ;; us read that, and we can't portably write that we don't care
+ ;; about the text content of anything on a line after a hash because
+ ;; text decoding happens at a lower level. So here we rewrite the
+ ;; CaseFolding.txt file to exclude the UTF-8 sequence corresponding
+ ;; to the eszet character.
+ (with-open-file (in (make-pathname :name "CaseFolding" :type "txt"
+ :defaults *unicode-character-database*)
+ :element-type '(unsigned-byte 8))
+ (with-open-file (out (make-pathname :name "CaseFolding" :type "txt"
+ :defaults *output-directory*)
+ :direction :output
+ :if-exists :supersede
+ :if-does-not-exist :create
+ :element-type '(unsigned-byte 8))
+ ;; KLUDGE: it's inefficient, though simple, to do the I/O
+ ;; byte-by-bite.
+ (do ((inbyte (read-byte in nil nil) (read-byte in nil nil))
+ (eszet (map '(vector (unsigned-byte 8)) 'char-code "<eszet>")))
+ ((null inbyte))
+ (if (= inbyte #xc3)
+ (let ((second (read-byte in nil nil)))
+ (cond
+ ((null second) (write-byte inbyte out) (return nil))
+ ((= second #x9f) (write-sequence eszet out))
+ (t (write-byte inbyte out) (write-byte second out))))
+ (write-byte inbyte out)))))
+ (with-open-file (s (make-pathname :name "CaseFolding" :type "txt"
+ :defaults *output-directory*))
(loop for line = (read-line s nil nil)
while line
unless (or (not (position #\; line)) (equal (position #\# line) 0))
-----------------------------------------------------------------------
hooks/post-receive
--
SBCL
_______________________________________________
Sbcl-commits mailing list
Sbcl-commits@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-commits
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic