[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-commits
Subject:    svn commit: r1414978 - in /avro/trunk: CHANGES.txt lang/java/avro/src/main/java/org/apache/avro/io/p
From:       cutting () apache ! org
Date:       2012-11-28 22:42:35
Message-ID: 20121128224236.045EE2388A4A () eris ! apache ! org
[Download RAW message or body]

Author: cutting
Date: Wed Nov 28 22:42:34 2012
New Revision: 1414978

URL: http://svn.apache.org/viewvc?rev=1414978&view=rev
Log:
AVRO-1178. Java: Fix typos in parsing document. Contributed by Martin Kleppmann.

Modified:
    avro/trunk/CHANGES.txt
    avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html


Modified: avro/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/avro/trunk/CHANGES.txt?rev=1414978&r1=1414977&r2=1414978&view=diff
 ==============================================================================
--- avro/trunk/CHANGES.txt (original)
+++ avro/trunk/CHANGES.txt Wed Nov 28 22:42:34 2012
@@ -39,6 +39,9 @@ Trunk (not yet released)
     AVRO-1210. Java: Fix mistakes in AvroMultipleOutputs error messages.
     (Dave Beech via cutting)
 
+    AVRO-1178. Java: Fix typos in parsing document.
+    (Martin Kleppmann via cutting)
+
   BUG FIXES
 
     AVRO-1171. Java: Don't call configure() twice on mappers & reducers.

Modified: avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html
                
URL: http://svn.apache.org/viewvc/avro/trunk/lang/java/avro/src/main/java/org/apache/a \
vro/io/parsing/doc-files/parsing.html?rev=1414978&r1=1414977&r2=1414978&view=diff \
                ==============================================================================
                
--- avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html \
                (original)
+++ avro/trunk/lang/java/avro/src/main/java/org/apache/avro/io/parsing/doc-files/parsing.html \
Wed Nov 28 22:42:34 2012 @@ -24,7 +24,7 @@
 
 This document shows how an Avro schema can be interpreted as the definition of a \
context-free grammar in LL(1).  We use such an interpretation for two use-cases.  In \
one use-case, we use them to validate readers and writers of data against a single \
Avro schema.  Specifically, sequences of <code>Encoder.writeXyz</code> methods can be \
validated against a schema, and similarly sequences of <code>Decoder.readXyz</code> \
methods can be validated against a schema.  
-The second use-case is using grammars to perform schema resolution.  For this \
use-case, we've developed a subclass of <code>Decoder</code> which takes two Avro \
schemas as input -- a reader and a writer schema.  This subclass accepts an input \
stream written according to the writer schema, and presents it to a client expecting \
the reader schema.  If the writer writes a long, for example, where the reader \
expects a double, then the <code>Decoder.readDoubl</code> method will convert the \
writer's long into a double. +The second use-case is using grammars to perform schema \
resolution.  For this use-case, we've developed a subclass of <code>Decoder</code> \
which takes two Avro schemas as input -- a reader and a writer schema.  This subclass \
accepts an input stream written according to the writer schema, and presents it to a \
client expecting the reader schema.  If the writer writes a long, for example, where \
the reader expects a double, then the <code>Decoder.readDouble</code> method will \
convert the writer's long into a double.  
 This document looks at grammars in the context of these two use-cases.  We first \
look at the single-schema case, then the double-schema case.  In the future, we \
believe the interpretation of Avro schemas as CFGs will find other uses (for example, \
to determine whether or not a schema admits finite-sized values).  
@@ -33,7 +33,7 @@ This document looks at grammars in the c
 
 <p> We parse a schema into a set of JSON objects.  For each record, map, array, \
union schema inside this set, this parse is going to generate a unique identifier \
"n<sub>i</sub>" (the "pointer" to the schema).  By convention, n<sub>0</sub> is the \
identifier for the "top-level" schema (i.e., the schema we want to read or write).  \
In addition, where n<sub>i</sub> is a union, the parse will generate a unique \
identifier "b<sub>ij</sub>" for each branch of the union.  
-<p> A context-free grammar (CFG) consists of a set of terminal-symbols, a set of \
non-terminal symbols, a set of productions, and a start symbol.  Here's how we \
interpret an Avro schema as a CFG: +<p> A context-free grammar (CFG) consists of a \
set of terminal symbols, a set of non-terminal symbols, a set of productions, and a \
start symbol.  Here's how we interpret an Avro schema as a CFG:  
 <p> <b>Terminal symbols:</b> The terminal symbols of the CFG consist of \
<code>null</code>, <code>bool</code>, <code>int</code>, <code>long</code>, \
<code>float</code>, <code>double</code>, <code>string</code>, <code>bytes</code>, \
<code>enum</code>, <code>fixed</code>, <code>arraystart</code>, \
<code>arrayend</code>, <code>mapstart</code>, <code>mapend</code>, and \
<code>union</code>.  In addition, we define the special terminals <code>"1"</code>, \
<code>"2"</code>, <code>"3"</code>, <code>...</code> which designate the "tag" of a \
union (i.e., which branch of the union is actually being written or was found in the \
data).  
@@ -206,7 +206,7 @@ Note that <code>T</code> is defined as <
 
 <p>The first section ("The interpretation") informally describes the grammer \
generated by an Avro schema.  This section provides a more formal description using a \
set of induction rules.  The earlier description in section one is fine for \
describing how a single Avro schema generates a grammar.  But soon we're going to \
describe how two schemas together define a "resolving" grammar, and for that \
description we'll need the more formal mechanism described here.  
-<p>The terminal and non-terminal symbols in our grammar are as described in the \
first section.  Our induction rules will define a function "C(S)=&lt;G,a&gt;", which \
takes an Avro schema "S" and returns a pair consisting of a set of productions "X" \
and a symbol "a".  This symbol "a" -- which is either a terminal, or a non-terminal \
defined by G -- generates the values described by schema S. +<p>The terminal and \
non-terminal symbols in our grammar are as described in the first section.  Our \
induction rules will define a function "C(S)=&lt;G,a&gt;", which takes an Avro schema \
"S" and returns a pair consisting of a set of productions "G" and a symbol "a".  This \
symbol "a" -- which is either a terminal, or a non-terminal defined by G -- generates \
the values described by schema S.  
 <p>The first rule applies to all Avro primitive types:
 
@@ -223,14 +223,14 @@ Note that <code>T</code> is defined as <
 <table align=center>
   <tr><td align=center>
   <table cellspacing=0 cellpadding=0><tr><td>S=</td><td><code>{"type":"record", \
                "name":</code>a<code>,</code></td></tr>
-         <tr><td></td><td><code>"fields":[{"name":</code>F<sub>1</sub><code>, \
"type":</code>S<sub>1</sub><code>},</code>...<code>, \
{"name":</code>F<sub>n</sub><code>, \
"type":</code>S<sub>n</sub><code>}]}</code></td></tr></table></td></tr> +         \
<tr><td></td><td><code>"fields":[{"name":</code>F<sub>1</sub><code>, \
"type":</code>S<sub>1</sub><code>}, ..., {"name":</code>F<sub>n</sub><code>, \
"type":</code>S<sub>n</sub><code>}]}</code></td></tr></table></td></tr>  <tr \
align=center><td>C(S<sub>j</sub>)=&lt;G<sub>j</sub>, f<sub>j</sub>&gt;</td></tr>  <tr \
align=center><td><hr></td></tr>  <tr align=center><td>C(S)=&lt;G<sub>1</sub> &#8746; \
... &#8746; G<sub>n</sub> &#8746; {a::=f<sub>1</sub> f<sub>2</sub> ... \
f<sub>n</sub>}, a&gt;</td></tr>  </tr>
 </table>
 
-<p>In this case, the set of output-productions consists of all the productions \
generated by the element-types of the record, plus a production that defines the \
non-terminal "n" to be the sequence of field-types.  We return "n"as the grammar \
symbol representing this record-schema. +<p>In this case, the set of \
output-productions consists of all the productions generated by the element-types of \
the record, plus a production that defines the non-terminal "a" to be the sequence of \
field-types.  We return "a" as the grammar symbol representing this record-schema.  
 <p>Next, we define the rule for arrays:
 
@@ -241,7 +241,7 @@ Note that <code>T</code> is defined as <
   <tr align=center><td>C(S)=&lt;G<sub>e</sub> &#8746; {r ::= e r, r ::= &#949;, a \
::= <code>arraystart</code> r <code>arrayend</code>}, a&gt;</td></tr>  </table>
 
-<p>For arrays, the set of output productions again contains all productions \
generated by the element-type.  In addition, we define <em>two</em> productions for \
"r", which represents the repetition of this element type.  The first production is \
the recursive case, which consists of the element-type followed by "r" all over \
again.  The next case is the base case, which is the empty production.  Having \
defined this repetition, we can then define "n" as this repetation bracketed by the \
terminal symbols <code>arraystart</code> and <code>arrayend</code>. +<p>For arrays, \
the set of output productions again contains all productions generated by the \
element-type.  In addition, we define <em>two</em> productions for "r", which \
represents the repetition of this element type.  The first production is the \
recursive case, which consists of the element-type followed by "r" all over again.  \
The next case is the base case, which is the empty production.  Having defined this \
repetition, we can then define "a" as this repetition bracketed by the terminal \
symbols <code>arraystart</code> and <code>arrayend</code>.  
 <p>The rule for maps is almost identical to that for arrays:
 
@@ -257,7 +257,7 @@ Note that <code>T</code> is defined as <
 <p>The rule for unions:
 <table align=center>
 <tr align=center>
- <td>S=[S<sub>1</sub>, S<sub>2</sub><code>, ..., S<sub>n</sub>]</td>
+<td>S=<code>[S<sub>1</sub>, S<sub>2</sub>, ..., S<sub>n</sub>]</code></td>
 </tr>
 <tr align=center>
  <td>C(S<sub>j</sub>)=&lt;G<sub>j</sub>, b<sub>j</sub>&gt;</td>
@@ -266,30 +266,30 @@ Note that <code>T</code> is defined as <
 <tr align=center><td>C(S)=&lt;G<sub>1</sub> &#8746; ... &#8746; G<sub>n</sub> \
&#8746; {u::=1 b<sub>1</sub>, u::=2 b<sub>2</sub>, ..., u::=n b<sub>n</sub>, \
a::=<code>union</code> u}, a&gt;</td></tr>  </table>
 
-<p>In this rule, we again accumulate productions (G<sub>j</sub>)generated by each of \
the sub-schemas contained by the top-level schemas.  If there are "k" branches, we \
define "k" different productions for the non-terminal symbol "u", one for each branch \
in the union.  These per-branch productions consist of the index of the branch (1 for \
the first branch, 2 for the second, and so-forth), followed by the symbol \
representing the schema of that branch.  With these productions for "u" defined, we \
can define "n" as simply the terminal-symbol <code>union</code> followed by this \
non-terminal "u". +<p>In this rule, we again accumulate productions (G<sub>j</sub>) \
generated by each of the sub-schemas for each branch of the union.  If there are "k" \
branches, we define "k" different productions for the non-terminal symbol "u", one \
for each branch in the union.  These per-branch productions consist of the index of \
the branch (1 for the first branch, 2 for the second, and so forth), followed by the \
symbol representing the schema of that branch.  With these productions for "u" \
defined, we can define "a" as simply the terminal symbol <code>union</code> followed \
by this non-terminal "u".  
 
 <p>The rule for fixed size binaries:
 <table align=center>
 <tr align=center>
- <td>S=<code>{"type"="fixed", "name"=a, "size"=s}</code></td>
+ <td>S=<code>{"type":"fixed", "name":a, "size":s}</code></td>
 </tr>
 <tr align=center><td><hr></td></tr>
 <tr align=center><td>C(S)=&lt;{a::=<code>fixed</code> f, f::=&#949;}, \
a&gt;</td></tr>  </table>
 
-<p>In this rule, we define a new non-termial f which has associated size of the \
fixed-binary. +<p>In this rule, we define a new non-terminal f which has associated \
size of the fixed-binary.  
 <p>The rule for enums:
 <table align=center>
 <tr align=center>
- <td>S=<code>{"type"="enum", "name"=a, "symbols"=["s1", "s2", "s3", \
...]}</code></td> + <td>S=<code>{"type":"enum", "name":a, "symbols":["s1", "s2", \
"s3", ...]}</code></td>  </tr>
 <tr align=center><td><hr></td></tr>
 <tr align=center><td>C(S)=&lt;{a::=<code>enum</code> e, e::=&#949;}, a&gt;</td></tr>
 </table>
 
-<p>In this rule, we define a new non-termial f which has associated range of values.
+<p>In this rule, we define a new non-terminal e which has associated range of \
values.  
 <h1>Resolution using action symbols</h1>
 
@@ -308,12 +308,12 @@ We want to use grammars to represent Avr
 
 <p> <li> <b>Enum actions:</b> when we have reader- and writer-schema has \
enumerations, enum actions are used to map the writer's numerical value to the \
reader's numeric value.  
-<p> <li> <b>Error actions:</b> in general, errors in schema-resolution can only be \
detected when data is being read.  For example, if the writer writers a \
<code>[long,&nbsp;string]</code> union, and the reader is expecting just a \
<code>long</code>, an error is only reported when the writer sends a string rather \
than a long.  Further, the Avro spec recommends that <em>all</em> errors be detected \
at reading-time, even if they could be detected earlier.  Error actions support the \
deferral of errors. +<p> <li> <b>Error actions:</b> in general, errors in \
schema-resolution can only be detected when data is being read.  For example, if the \
writer writes a <code>[long,&nbsp;string]</code> union, and the reader is expecting \
just a <code>long</code>, an error is only reported when the writer sends a string \
rather than a long.  Further, the Avro spec recommends that <em>all</em> errors be \
detected at reading-time, even if they could be detected earlier.  Error actions \
support the deferral of errors.  </ul>
 
 <p>These actions will become "action symbols" in our grammar.  Action symbols are \
symbols that cause our parser to perform special activities when they appear on the \
top of the parsing stack.  For example, when the skip-action makes it to the top of \
the stack, the parser will automatically skip the next value in the input stream.  \
(Again, Fischer and LeBlanc has a nice description of action symbols.)  
-<p>We're going to use induction rules to define a grammar.  This time, our induction \
rules will define a two-argument function "C(W,R)=&lt;G,a&gt;", which takes two \
schema, the writer's and reader's schemas respectively.  The results of this function \
the same as they where for the single-schema case. +<p>We're going to use induction \
rules to define a grammar.  This time, our induction rules will define a two-argument \
function "C(W,R)=&lt;G,a&gt;", which takes two schema, the writer's and reader's \
schemas respectively.  The results of this function are the same as they were for the \
single-schema case.  
 <p>The first rule applies to all Avro primitive types:
 
@@ -337,7 +337,7 @@ We want to use grammars to represent Avr
 
 <p> When this parameterized action is encountered, the parser will resolve the \
writer's value into the reader's expected-type for that value.  In the parsing loop, \
when we encounter this symbol, we use the "r" parameter of this symbol to check that \
the reader is asking for the right type of value, and we use the "w" parameter to \
figure out how to parse the data in the input stream.  
-<p>On final possibility for pimitive types are incompatible types:
+<p>One final possibility for primitive types is that they are incompatible types:
 
 <table align=center>
   <tr align=center><td>The w,r pair does not fit the previous two rules, AND \
neither</td></tr> @@ -347,21 +347,21 @@ We want to use grammars to represent Avr
   <tr align=center><td>C(w,r)=&lt;{}, ErrorAction&gt;</td></tr>
 </table>
 
-<p> When this parameterized action is encountered, the parser will throw an error.  \
Keep in mind that this symbol might be generated in the middle of a recursive call to \
"G."  For example, if the reader's schema is long, and the writers is \
[long,&nbsp;string], we'll generate an error symbol for the string-branch of the \
union; if this branch is occurred in actual input, an error will then be generated. \
+<p> When this parameterized action is encountered, the parser will throw an error.  \
Keep in mind that this symbol might be generated in the middle of a recursive call to \
"G."  For example, if the reader's schema is long, and the writer's is \
[long,&nbsp;string], we'll generate an error symbol for the string-branch of the \
union; if this branch is occurred in actual input, an error will then be generated.  
-<p>The next rule deals with resolution fixed size binaries:
+<p>The next rule deals with resolution of fixed size binaries:
 
 <table align=center>
-  <tr align=center><td>w = {"type"="fixed", "name":"n1", "size"=s1}</td></tr>
-  <tr align=center><td>r = {"type"="fixed", "name":"n2", "size"=s2}</td></tr>
+  <tr align=center><td>w = <code>{"type":"fixed", "name":"n1", \
"size":s1}</code></td></tr> +  <tr align=center><td>r = <code>{"type":"fixed", \
"name":"n2", "size":s2}</code></td></tr>  <tr align=center><td>n1 != n2 or s1 != \
s2</td></tr>  <tr><td><hr></td></tr>
   <tr align=center><td>C(w,r)=&lt;{}, ErrorAction&gt;</td></tr>
 </table>
 
 <table align=center>
-  <tr align=center><td>w = {"type"="fixed", "name":"n1", "size"=s1}</td></tr>
-  <tr align=center><td>r = {"type"="fixed", "name":"n2", "size"=s2}</td></tr>
+  <tr align=center><td>w = <code>{"type":"fixed", "name":"n1", \
"size":s1}</code></td></tr> +  <tr align=center><td>r = <code>{"type":"fixed", \
"name":"n2", "size":s2}</code></td></tr>  <tr align=center><td>n1 == n2 and s1 == \
s2</td></tr>  <tr><td><hr></td></tr>
   <tr align=center><td>C(w,r)=&lt;{ a::=<code>fixed</code> f, f::=&#949;}, \
a&gt;</td></tr> @@ -369,11 +369,11 @@ We want to use grammars to represent Avr
 
 If the names are identical and sizes are identical, then we match otherwise an error \
is generated.  
-<p>The next rule deals with resolution enums:
+<p>The next rule deals with resolution of enums:
 
 <table align=center>
-  <tr align=center><td>w = {"type"="enum", "symbols":[sw<sub>1</sub>, \
                sw<sub>2</sub>, ..., sw<sub>m</sub>] }</td></tr>
-  <tr align=center><td>r = {"type"="enum", "symbols":[sr<sub>1</sub>, \
sr<sub>2</sub>, ..., sr<sub>n</sub>] }</td></tr> +  <tr align=center><td>w = \
<code>{"type":"enum", "symbols":[sw<sub>1</sub>, sw<sub>2</sub>, ..., sw<sub>m</sub>] \
}</code></td></tr> +  <tr align=center><td>r = <code>{"type":"enum", \
"symbols":[sr<sub>1</sub>, sr<sub>2</sub>, ..., sr<sub>n</sub>] }</code></td></tr>  \
<tr align=center><td>f<sub>i</sub> = EnumAction(i, j) if sw<sub>i</sub> == \
sr<sub>j</sub></td></tr>  <tr align=center><td>f<sub>i</sub> = ErrorAction if \
sw<sub>i</sub> does not match any sr<sub>j</sub></td></tr>  <tr><td><hr></td></tr>
@@ -456,11 +456,11 @@ The symbol e has the set of actions f<su
 
 <p>The substance of this rule lies in the definion of the "f'<sub>j</sub>".  If the \
writer's field F<sub>j</sub> is not a member of the reader's schema, then a \
skip-action is generated, which will cause the parser to automatically skip over the \
field without the reader knowing.  (In this case, note that we use the \
<em>single</em>-argument version of "C", i.e., the version defined in the previous \
section!)  
-If the wrtier's field F<sub>j</sub> <em>is</em> a member f the reader's schema, then \
"f'<sub>j</sub>" is a two-symbol sequence: the first symbol is a (parameterized) \
field-action which is used to tell the reader which of it's own fields is coming \
next, followed by the symbol for parsing the value written by the writer. +If the \
writer's field F<sub>j</sub> <em>is</em> a member f the reader's schema, then \
"f'<sub>j</sub>" is a two-symbol sequence: the first symbol is a (parameterized) \
field-action which is used to tell the reader which of its own fields is coming next, \
followed by the symbol for parsing the value written by the writer.  
 <p>The above rule for records works only when the reader and writer have the same \
name, and the reader's fields are subset of the writer's.  In other cases, an error \
is producted.  
-<p> The rule for arrays is straight forward:
+<p>The rule for arrays is straightforward:
 
 <table align=center>
 <tr align=center>
@@ -473,7 +473,7 @@ If the wrtier's field F<sub>j</sub> <em>
  <td>C(S<sub>w</sub>, S<sub>r</sub>)=&lt;G<sub>e</sub>,e&gt;
 </tr>
 <tr><td><hr></td></tr>
-<tr align=center><td>C(W,R)=&lt;G<sub>e</sub> U {r ::= e r, r ::= &#949;, a ::= \
<code>arraystart</code> r <code>arrayend}, a&gt;</td></tr> +<tr \
align=center><td>C(W,R)=&lt;G<sub>e</sub> &#8746; {r ::= e r, r ::= &#949;, a ::= \
<code>arraystart</code> r <code>arrayend}, a&gt;</td></tr>  </table>
 
 <p>Here the rule is largely the same as for the single-schema case, although the \
recursive use of G may result in productions that are very different.  The rule for \
maps changes in a similarly-small way, so we don't bother to detail that case in this \
document. @@ -522,7 +522,7 @@ If the wrtier's field F<sub>j</sub> <em>
  <td>R=[R<sub>1</sub>, ..., R<sub>n</sub>]</td>
 </tr>
 <tr><td align=center>Branch "j" of R is the best match for W</td></tr>
-<tr><td align=center>C(W,R<sub>j</sub>)=&lt;&nbsp;G,w&gt;</td></tr>
+<tr><td align=center>C(W,R<sub>j</sub>)=&lt;G,w&gt;</td></tr>
 <tr><td><hr></td></tr>
 <tr><td align=center>C(W,R)=&lt;G, ReaderUnionAction(j,w)&gt;</td></tr>
 </table>
@@ -589,7 +589,7 @@ Here's a stylized version of the actual 
           else, T(X,t) is undefined, so throw an error;
 
       X = stack.pop();
-    }
+
     // We've left the loop, so X is a terminal symbol:
     case X:
       ResolvingTable(w,r):
@@ -611,5 +611,5 @@ Here's a stylized version of the actual 
       
     // Fall-through case:
     if (X == t) then return X
-    else throw an aerror 
+    else throw an error
 </pre>


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic