[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xml-dev
Subject:    [xml-dev] How to make better files?
From:       Roger L Costello <costello () mitre ! org>
Date:       2022-08-12 20:32:21
Message-ID: SA9PR09MB5952AC158E6B5DD252DDEB58C8679 () SA9PR09MB5952 ! namprd09 ! prod ! outlook ! com
[Download RAW message or body]

Hi Folks,
Scenario: There is a file.
What's in the file? What kind of file is it? Who produced it? When? What kind of data \
does it hold? Is it safe to open? Where will you find answers to those question?
Old school Unix used a stream-of-bytes metaphor for files.  Every file is just a \
sequence of bytes. Some authors refer to this as formatless files. Michael Kay points \
out that, in reality, the files are not formatless; rather, their format is simply \
not known at some level of the system, and it is up to applications to determine the \
file's format. Michael Kay wrote: Applications are left to guess by making inferences \
from  the file name extension, or by sniffing the content, all of
             which is unreliable and insecure.
Liam pointed out that there is a Unix command called "file" which does a pretty \
decent job of inspecting files and figuring out what they are. There is a spectrum of \
"file knowingness." At one end of the spectrum is old school Unix: a file is a stream \
of bytes. Nothing is known about the file. You need to sniff its content and make \
inferences. What lies at the other end of the spectrum? How would you characterize \
that end of the spectrum? How about this characterization: We know virtually \
everything about files. We know its character encoding. We know what application \
produced it. How long it is. When it was created. Where it was created. What kind of \
data it contains. What kind of applications can process it. Whether it is or isn't \
safe to open. Do you agree with that characterization? What else would you add? At \
which end of the spectrum do you want your files? Is one end of the spectrum better? \
Better in what way? Should we all strive to transition our files to one end of the \
spectrum? Where does XML live in the spectrum? I suspect it lives somewhere in the \
middle. Michael Kay argues that XML doesn't do a particularly good job of "file \
knowingness," as he wrote: Conventions like putting the encoding in a header or using
strings like xmlns="..." to identify the vocabulary, are ad-hoc
and unsystematic, and they're very often at the wrong level
of the system (you should know the encoding before you start
trying to interpret the characters).
How can we make better XML?
How can we make better files?
/Roger


[Attachment #3 (text/html)]

<html xmlns:v="urn:schemas-microsoft-com:vml" \
xmlns:o="urn:schemas-microsoft-com:office:office" \
xmlns:w="urn:schemas-microsoft-com:office:word" \
xmlns:x="urn:schemas-microsoft-com:office:excel" \
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" \
xmlns="http://www.w3.org/TR/REC-html40"> <head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin-top:0in;
	margin-right:0in;
	margin-bottom:8.0pt;
	margin-left:0in;
	line-height:106%;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri",sans-serif;
	color:black;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri",sans-serif;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi Folks,<o:p></o:p></p>
<p class="MsoNormal"><b>Scenario</b>: There is a file. <o:p></o:p></p>
<p class="MsoNormal">What&#8217;s in the file? What kind of file is it? Who produced \
it? When? What kind of data does it hold? Is it safe to open?<o:p></o:p></p> <p \
class="MsoNormal">Where will you find answers to those question?<o:p></o:p></p> <p \
class="MsoNormal">Old school Unix used a stream-of-bytes metaphor for files.&nbsp; \
Every file is just a sequence of bytes. Some authors refer to this as formatless \
files. Michael Kay points out that, in reality, the files are not formatless; rather, \
their format  is simply not known at some level of the system, and it is up to \
applications to determine the file&#8217;s format. Michael Kay wrote:<o:p></o:p></p> \
<p class="MsoNormal" style="text-indent:.5in">Applications are left to guess by \
making inferences from<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the file \
name extension, or by sniffing the content, all of<br> \
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; which is \
unreliable and insecure.<o:p></o:p></p> <p class="MsoNormal">Liam pointed out that \
there is a Unix command called &#8220;file&#8221; which does a pretty decent job of \
inspecting files and figuring out what they are.<o:p></o:p></p> <p \
class="MsoNormal">There is a spectrum of &#8220;file knowingness.&#8221; At one end \
of the spectrum is old school Unix: a file is a stream of bytes. Nothing is known \
about the file. You need to sniff its content and make inferences. What lies at the \
other end of the  spectrum? How would you characterize that end of the spectrum? How \
about this characterization: We know virtually everything about files. We know its \
character encoding. We know what application produced it. How long it is. When it was \
created. Where it was  created. What kind of data it contains. What kind of \
applications can process it. Whether it is or isn&#8217;t safe to open. Do you agree \
with that characterization? What else would you add?<o:p></o:p></p> <p \
class="MsoNormal">At which end of the spectrum do you want your files? Is one end of \
the spectrum better? Better in what way? Should we all strive to transition our files \
to one end of the spectrum?<o:p></o:p></p> <p class="MsoNormal">Where does XML live \
in the spectrum? I suspect it lives somewhere in the middle. Michael Kay argues that \
XML doesn&#8217;t do a particularly good job of &#8220;file knowingness,&#8221; as he \
wrote:<o:p></o:p></p> <p class="MsoNormal" style="margin-left:.5in">Conventions like \
putting the encoding in a header or using<br> strings like xmlns=&quot;...&quot; to \
identify the vocabulary, are ad-hoc<br> and unsystematic, and they're very often at \
the wrong level<br> of the system (you should know the encoding before you start<br>
trying to interpret the characters).<o:p></o:p></p>
<p class="MsoNormal">How can we make better XML? <o:p></o:p></p>
<p class="MsoNormal">How can we make better files?<o:p></o:p></p>
<p class="MsoNormal">/Roger<o:p></o:p></p>
<p class="MsoNormal"><span style="color:black"><o:p>&nbsp;</o:p></span></p>
</div>
</body>
</html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic