'Re: UTF-8 well-formedness for SimpleTextCodec'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: UTF-8 well-formedness for SimpleTextCodec
From:       Adrien Grand <jpountz () gmail ! com>
Date:       2023-12-19 16:50:36
Message-ID: CAPsWd+M+XZ07UiUrdh-jE5Dx3ntF=8TvaJFr2pbx-k3ODf1SFg () mail ! gmail ! com
[Download RAW message or body]

Hey Michael,

Writing well-formed UTF-8 with SimpleTextformat sounds desirable indeed,
e.g. your PR makes sense. I don't think we would want to be heroic about
it, but if we can serialize the same information easily, then it sounds
like something we should do. Thanks for improving SimpleTextCodec!

On Mon, Dec 18, 2023 at 6:01 PM Michael Froh <msfroh@gmail.com> wrote:

> Hi there,
>
> I was recently writing up a short Lucene file format tutorial (
> https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
> using SimpleTextCodec for educational purposes.
>
> I found that SimpleTextSegmentInfo tries to output the segment ID as raw
> bytes, which will often result in malformed UTF-8 output. I wrote a little
> fix to output as the text representation of a byte array (
> https://github.com/apache/lucene/pull/12897). I noticed that it's a
> similar sort of thing with binary doc values (where the bytes get written
> directly).
>
> Is there any general desire for SImpleTextCodec to output well-formed
> UTF-8 where possible?
>
> Thanks,
> Froh
>


-- 
Adrien

[Attachment #3 (text/html)]

<div dir="ltr">Hey Michael,<div><br></div><div>Writing well-formed UTF-8 with \
SimpleTextformat sounds desirable indeed, e.g. your PR makes sense. I don&#39;t think \
we would want to be heroic about it, but if we can serialize the same information \
easily, then it sounds like something we should do. Thanks for improving \
SimpleTextCodec!</div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Mon, Dec 18, 2023 at 6:01 PM Michael Froh &lt;<a \
href="mailto:msfroh@gmail.com">msfroh@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi there,<div><br></div><div>I was \
recently writing  up a short Lucene file format tutorial (<a \
href="https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html" \
target="_blank">https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html</a>), \
using SimpleTextCodec for educational purposes.</div><div><br></div><div>I found that \
SimpleTextSegmentInfo tries to output the segment ID as raw bytes,  which will often \
result in malformed UTF-8 output. I wrote a little fix to output as the text \
representation of a byte array (<a href="https://github.com/apache/lucene/pull/12897" \
target="_blank">https://github.com/apache/lucene/pull/12897</a>). I noticed that \
it&#39;s a similar sort of thing with binary doc values (where the bytes get written \
directly).  </div><div><br></div><div>Is there any general desire for SImpleTextCodec \
to output well-formed UTF-8 where \
possible?</div><div><br></div><div>Thanks,</div><div>Froh</div></div> \
</blockquote></div><br clear="all"><div><br></div><span \
class="gmail_signature_prefix">-- </span><br><div dir="ltr" \
class="gmail_signature">Adrien</div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic