[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xmlbeans-dev
Subject:    RE: [jira] Updated: (XMLBEANS-135) bad handling of embeded CDATA
From:       "Radu Preotiuc-Pietro" <radup () bea ! com>
Date:       2005-04-21 0:53:30
Message-ID: 4B2B4C417991364996F035E1EE39E2E102C66A08 () uskiex01 ! bea ! com
[Download RAW message or body]

Right, my intention was for the first occurrence of "]]>" to mark the end of the first CDATA. Then, the following ">" would be outside all CDATA sections and it would go through fine.
So starting with (I was able to parse the following doc)

<c><![CDATA[<b><![CDATA[<a>
bla<bla
</a>]]]]>><![CDATA[</b>]]></c>

after one parse, the content will be

<b><![CDATA[<a>
bla<bla
</a>]]></b>

and after the second one

<a>
bla<bla
</a>

So, in my opinion, this should work ok.
Radu

-----Original Message-----
From: Martin Hamel [mailto:martin@komunide.com] 
Sent: Saturday, April 16, 2005 6:54 AM
To: dev@xmlbeans.apache.org
Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded CDATA

because of the CDATA end string. The second CDATA is not parsed has something 
special. Thought the first encouter of ]]> marks the end of the first CDATA. 
The parsers do not have counters telling them how much CDATA start they 
encountered till now. So they can't unstack the CDATA endings. The 
specification is clearly talking about it in the quote below: 

"Within a CDATA section, only the CDEnd string is recognized as markup, so 
that left angle brackets and ampersands may occur in their literal form; they 
need not (and cannot) be escaped using "&lt;" and "&amp;". CDATA sections 
cannot nest."
http://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-cdsection

Le 15 Avril 2005 22:47, Radu Preotiuc-Pietro a écrit :
> Why is having a CDATA + ><!CDATA[]]> illegal?
>
> -----Original Message-----
> From: Martin Hamel [mailto:martin@komunide.com]
> Sent: Friday, April 15, 2005 3:36 PM
> To: dev@xmlbeans.apache.org
> Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded CDATA
>
> Hi Radu,
>
> Your proposition is not valid xml:
> <c><![CDATA[<b><![CDATA[<a>
> blabla
> </a>
> </b>
> ]]>><!CDATA[]]>
> </c>
>
> The CDATA enclosed "b" tag in there is seen has beeing closed at the first
> encounter of ]]>. So when parsing that tag "c" contains a CDATA and
> "><!CDATA[]]>". Which is illegal.
>
> This is why xmlbeans escapes the inner CDATA closing giving this file:
> <c><![CDATA[<b><![CDATA[<a>
> blabla
> </a>
> </b>
> ]]&gt;><!CDATA[]]>
> </c>
>
> But most parsers do not handle that at deserialization since all characters
> included in a CDATA should stay has is. The specs says: "Within a CDATA
> section, only the CDEnd string is recognized as markup, so that left angle
> brackets and ampersands may occur in their literal form; they need not (and
> cannot) be escaped using "&lt;" and "&amp;". CDATA sections cannot nest."
> http://www.w3.org/TR/2000/WD-xml-2e-20000814#dt-cdsection
>
> That is why most parsers are not supporting that,
>
> Le 15 Avril 2005 22:05, Radu Preotiuc-Pietro a écrit :
> > My proposition nr. 1 would also be automatic. Why wouldn't it work for
> > you? It was my intention for the outer CDATA to get closed so that the
> > '>' sign appears outside of a CDATA section.
> >
> > Radu
> >
> > -----Original Message-----
> > From: Martin Hamel [mailto:martin@komunide.com]
> > Sent: Friday, April 15, 2005 5:50 AM
> > To: dev@xmlbeans.apache.org
> > Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded CDATA
> >
> > Hi Radu,
> >
> > The problem here is that the inner CDATA ends the outer CDATA. I get the
> > error: "The content of elements must consist of well-formed character
> > data or markup."
> >
> > That is why the inner cdata ending was escaped in the first place from
> > ]]> to ]]&gt;
> >
> > I agree that disbling CDATA from an option would do the job too. My
> > proposition was just more automatic. But manual could also do the job.
> >
> > Le 15 Avril 2005 03:08, Radu Preotiuc-Pietro a écrit :
> > > Let me chime in on this.
> > > In my opinion, we have two issues here:
> > > 1. The document that XmlBeans generates, while valid XML, does not
> > > correctly represent the infoset that was being saved, because of the
> > > encoded "&gt;" which is not equivalent to ">" when inside of a CDATA.
> > > This needs to be fixed either by not using CDATA at all in this
> > > situation (approach that I don't personally like because it has a
> > > performance impact even when there would be no issue, and also because
> > > the resulting text looks uglier), or by ending the CDATA right before
> > > the problematic '>' as in:
> > >
> > > <c><![CDATA[<b><![CDATA[<a>
> > > blabla
> > > </a>
> > > </b>
> > > ]]>><!CDATA[]]>
> > > </c>
> > >
> > > which seems good enough for piccolo to parse back (there is an empty
> > > CDATA there because the code doesn't know that there are no more
> > > characters left but that's ok). Would that work for you, Martin?
> > >
> > > 2. XmlBeans doesn't offer any mechanism to disable CDATA altogether,
> > > which is something that would be useful in other (albeit contrieved)
> > > scenarios.
> > >
> > > Radu
> > >
> > > -----Original Message-----
> > > From: Martin Hamel [mailto:martin@komunide.com]
> > > Sent: Thursday, April 14, 2005 2:59 PM
> > > To: dev@xmlbeans.apache.org
> > > Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded
> > > CDATA
> > >
> > > Hi Jacob,
> > >
> > > I agree that step 3 is ok and does not really have embeded CDATA. The
> > > problem is that most sax parsers (all the ones we tried it with) can't
> > > work with that 2nd CDATA escaping. They do not unescape the &amp; since
> > > it is in a CDATA. Which kind of make sense. And that is our problem.
> > > The expected part is syntacly correct and easy to understand by any Sax
> > > Parser. That is why we need it.
> > >
> > > We'll get to it :-)
> > >
> > > Le 14 Avril 2005 21:32, Jacob Danner a écrit :
> > > > Hi Martin,
> > > > Thanks for the info, but I'm still not sure I understand.
> > > > The CData Element is used to escape blocks of text. What happens in
> > > > step 3 (I marked the steps below) and what you are expecting are not
> > > > syntactically equivalent. Step 3 says - I've got content that should
> > > > not be recognized as markup. The expected says - here's entitized
> > > > content that will look like an entitized CDATA section. The contents
> > > > of <c /> in step 3 is ==
> > > > <b><![CDATA[<a>blabla>]]&amp;</a></b> The contents of <c /> that you
> > > > are expecting is a string with entitized elements ==
> > > > &lt;b>&lt;![CDATA[&lt;a>blabla>]]&amp;&lt;/a>?lt;/b>]]> So the value
> > > > of the expected != the value of step3.
> > > >
> > > > In step 3 the content does not embedded multiple CDATA, because there
> > > > exists only one CDEnd string (']]>'). All other content is not
> > > > recognized as markup. In the content you expect there is no CDATA.
> > > >
> > > > Am I still misunderstanding what you expect?
> > > > -Jacobd
> > > >
> > > > P.S.
> > > > In the example of the code snippet you posted in the bug, xmlText()
> > > > and toString() perform differently. I think in your snippet you may
> > > > want to try toString().
> > > >
> > > > -----Original Message-----
> > > > From: Martin Hamel [mailto:martin@komunide.com]
> > > > Sent: Thursday, April 14, 2005 11:43 AM
> > > > To: dev@xmlbeans.apache.org
> > > > Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded
> > > > CDATA
> > > >
> > > > Hi Jacob,
> > > >
> > > > The problem nnderstanding embeeded CDATA. The problem is that
> > > > XMLBeans is producing it in some circomstences that we have here. ;-)
> > > >  So I use XMLBeans to produce the data and without the patch it
> > > > produces embeded CDATA. Here is what is happening with the current
> > > > xmlbeans:
> > > >
> > > > # Step 1 -
> > > > document A
> > > > <a>
> > > > blabla
> > > > </a>
> > > >
> > > > # Step 2 -
> > > > I put document a in document B
> > > > <b><![CDATA[<a>
> > > > blabla>
> > > > ]]>
> > > > </a>
> > > > </b>
> > > >
> > > > # Step 3 -
> > > > I put document B in document C and the problem arise.
> > > > <c><![CDATA[<b><![CDATA[<a>
> > > > blabla>
> > > > ]]&amp;
> > > > </a>
> > > > </b>
> > > > ]]>
> > > > </c>
> > > >
> > > > # Expected -
> > > > My patch is producing that which is good having just one CDATA (but
> > > > harder to read by a human but readable by a machine since it is good
> > > > xml): <c>&lt;b>&lt;![CDATA[&lt;a>
> > > > blabla>
> > > > ]]&amp;
> > > > &lt;/a>
> > > > ?lt;/b>
> > > > ]]>
> > > > </c>
> > > >
> > > >
> > > >
> > > > Indeed the specs says that I should not embeed CDATA. That is just
> > > > what my fix is doing. If there is already a CDATA in the string, it
> > > > will not enclose it with another one but escape the charaters that
> > > > needs to be escaped. This way the spec is respected. This is not what
> > > > XMLBeans is doing right now. It is blindly embeding CDATA and thus
> > > > violate the spec.
> > > >
> > > > Thanks :-)
> > > >
> > > > Le 14 Avril 2005 18:26, Jacob Danner a écrit :
> > > > > Hi Martin,
> > > > > By probably not in the V2 release, it means that the committers
> > > > > probably don't need to look into a fix for the v2 release. The TBD
> > > > > assignment means that it is not scheduled to be fixed in the
> > > > > current release.
> > > > >
> > > > > Now onto this issue, I'm unsure the fix would be wanted. If I
> > > > > understand the bug and the patch correctly, you want to be able to
> > > > > understand nested CDATA items. This is against the section 2.7 of
> > > > > the xml spec and breaking conformance with the xml spec, or the xsd
> > > > > spec is usually frowned upon.
> > > > >
> > > > > What I would ask is why the XML you receive is invalid to begin
> > > > > with? Please let me know if I understand your issue correctly?
> > > > > Thanks,
> > > > > -Jacobd
> > > > >
> > > > > -----Original Message-----
> > > > > From: Martin Hamel [mailto:martin@komunide.com]
> > > > > Sent: Thursday, April 14, 2005 5:36 AM
> > > > > To: Jacob Danner (JIRA)
> > > > > Subject: Re: [jira] Updated: (XMLBEANS-135) bad handling of embeded
> > > > > CDATA
> > > > >
> > > > > hi,
> > > > >
> > > > > By "probably not in the v2 release", do you imply that it could be
> > > > > in a 1.0.5 release? We are currently maintaining our own version to
> > > > > have this fix. That is something we really do not want. Is there a
> > > > > 1.0.5 that will come out? If yes, is there a schedule?
> > > > >
> > > > > Cordially
> > > > >
> > > > > Le 13 Avril 2005 22:56, vous avez écrit :
> > > > > >      [
> > > > > > http://issues.apache.org/jira/browse/XMLBEANS-135?page=history ]
> > > > > >
> > > > > > Jacob Danner updated XMLBEANS-135:
> > > > > > ----------------------------------
> > > > > >
> > > > > >     Fix Version: TBD
> > > > > >
> > > > > > probably not in the v2 release
> > > > > >
> > > > > > > bad handling of embeded CDATA
> > > > > > > -----------------------------
> > > > > > >
> > > > > > >          Key: XMLBEANS-135
> > > > > > >          URL: http://issues.apache.org/jira/browse/XMLBEANS-135
> > > > > > >      Project: XMLBeans
> > > > > > >         Type: Bug
> > > > > > >     Versions: Version 1.0.3, Version 1.0.4, Version 2 Beta 1
> > > > > > >  Environment: I arrived to it on windows with jdk 1.4.2.
> > > > > > >     Reporter: Martin Hamel
> > > > > > >      Fix For: TBD
> > > > > > >
> > > > > > >
> > > > > > > I have a case of bad xml. It is an envelope document that
> > > > > > > includes another document. The parser expect the enclosed
> > > > > > > document to be in CDATA. The problem is that the second
> > > > > > > document now include a third document which is also expected to
> > > > > > > be a CDATA.
> > > > > > > I create document A with an XMLBean. I put it has a text
> > > > > > > element of document B after I transformed Document A to a
> > > > > > > string with xmlText(). I then do the same with document B by
> > > > > > > putting it in Document C. Everything works well and
> > > > > > > automatically and it creates CDATA everytime it needs to.
> > > > > > > //fragment
> > > > > > >  XmlOptions options = new XmlOptions();
> > > > > > >         options.setSavePrettyPrint();
> > > > > > >         Field field = getAssessmentFields().addNewField();
> > > > > > >         field.setFieldName("AssessmentContent");
> > > > > > >         field.setFieldValue(answersDocument.xmlText(options));
> > > > > > >   ..
> > > > > > > The problem is that on the second escaping the CDATA end
> > > > > > > ([[>)is escaped to "&gt;". The SAX parser that read all this
> > > > > > > (Xalan) just can't do it. Also, the specification says that
> > > > > > > there should not be any CDATA containing a CDATA. Here is the
> > > > > > > modification I made for embeded CDATA. Do you think that would
> > > > > > > be worty of beeing included? here is the entitizeContent method
> > > > > > > in Saver.java: Pattern cdataPattern = Pattern.compile("CDATA");
> > > > > > > private void entitizeContent ( )
> > > > > > >         {
> > > > > > >             if (_lastEmitCch == 0)
> > > > > > >                 return;
> > > > > > >             int i = _lastEmitIn;
> > > > > > >             final int n = _buf.length;
> > > > > > >             boolean hasOutOfRange = false;
> > > > > > >
> > > > > > >             int count = 0;
> > > > > > >             for ( int cch = _lastEmitCch ; cch > 0 ; cch-- )
> > > > > > >             {
> > > > > > >                 char ch = _buf[ i ];
> > > > > > >                 if (ch == '<' || ch == '&')
> > > > > > >                     count++;
> > > > > > >                 else if (isBadChar( ch ))
> > > > > > >                     hasOutOfRange = true;
> > > > > > >                 if (++i == n)
> > > > > > >                     i = 0;
> > > > > > >             }
> > > > > > >             if (count == 0 && !hasOutOfRange)
> > > > > > >                 return;
> > > > > > >             i = _lastEmitIn;
> > > > > > >             //
> > > > > > >             // Heuristic for knowing when to save out stuff as
> > > > > > > a CDATA. //
> > > > > > >
> > > > > > >             // Well check if we have a cdata in the buffer.
> > > > > > >             // If we do, we won't nest another one.
> > > > > > >             CharBuffer charBuffer = CharBuffer.wrap(_buf);
> > > > > > >             boolean hasCDATA =
> > > > > > > cdataPattern.matcher(charBuffer).find(); if (_lastEmitCch > 32
> > > > > > > && count > 5 &&
> > > > > > >                     count * 100 / _lastEmitCch > 1 &&
> > > > > > > !hasCDATA) {
> > > > > > >                 boolean lastWasBracket = _buf[ i ] == ']';
> > > > > > >                 i = replace( i, "<![CDATA[" + _buf[ i ] );
> > > > > > >                 boolean secondToLastWasBracket =
> > > > > > > lastWasBracket; lastWasBracket = _buf[ i ] == ']';
> > > > > > >                 if (++i == _buf.length)
> > > > > > >                     i = 0;
> > > > > > >                 for ( int cch = _lastEmitCch ; cch > 0 ; cch--
> > > > > > > ) {
> > > > > > >                     char ch = _buf[ i ];
> > > > > > >                     if (ch == '>' && secondToLastWasBracket &&
> > > > > > > lastWasBracket) i = replace( i, "&gt;" );
> > > > > > >                     else if (isBadChar( ch ))
> > > > > > >                         i = replace( i, "?" );
> > > > > > >                     else
> > > > > > >                         i++;
> > > > > > >                     secondToLastWasBracket = lastWasBracket;
> > > > > > >                     lastWasBracket = ch == ']';
> > > > > > >                     if (i == _buf.length)
> > > > > > >                         i = 0;
> > > > > > >                 }
> > > > > > >                 emit( "]]>" );
> > > > > > >             }
> > > > > > >             else
> > > > > > >             {
> > > > > > >                 for ( int cch = _lastEmitCch ; cch > 0 ; cch--
> > > > > > > ) {
> > > > > > >                     char ch = _buf[ i ];
> > > > > > >                     if (ch == '<')
> > > > > > >                         i = replace( i, "&lt;" );
> > > > > > >                     else if (hasCDATA && ch == '>')
> > > > > > >                         i = replace(i, "&gt;");
> > > > > > >                     else if (ch == '&')
> > > > > > >                         i = replace( i, "&amp;" );
> > > > > > >                     else if (isBadChar( ch ))
> > > > > > >                         i = replace( i, "?" );
> > > > > > >                     else
> > > > > > >                         i++;
> > > > > > >                     if (i == _buf.length)
> > > > > > >                         i = 0;
> > > > > > >                 }
> > > > > > >             }
> > > > > > >         }

-- 
Martin Hamel
téléphone: (418)261-2222
clé pgp: 0xA6D61023


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xmlbeans.apache.org
For additional commands, e-mail: dev-help@xmlbeans.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic