[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: [HACKERS] Html parsing and inline elements
From:       Oleg Bartunov <obartunov () gmail ! com>
Date:       2016-04-30 21:43:37
Message-ID: CAF4Au4zH2Rc4=-nj92y=NVcF_QTN1PCHE0Rj8_mh9zA0NH9JoA () mail ! gmail ! com
[Download RAW message or body]

On Wed, Apr 13, 2016 at 6:57 PM, Marcelo Zabani <mzabani@gmail.com> wrote:

> Hi, Tom,
>
> You're right, I don't think one can argue that the default parser should
> know HTML.
> How about your suggestion of there being an HTML parser, is it feasible? I
> ask this because I think that a lot of people store HTML documents these
> days, and although there probably aren't lots of HTML with words written
> along multiple inline elements, it would certainly be nice to have a proper
> parser for these use cases.
>
> What do you think?
>

I think it could be useful separate parser. But the problem is how to fully
utilize it to facilitate ranking, for example, words in title could be
considered more important than in the body, etc. Currently, setweight()
functions provides this separately from parser.

Parser outputs tokid and token:

select * from ts_parse('default','<title>parser</title><body>text</body>');
 tokid |  token
-------+----------
    13 | <title>
     1 | parser
    13 | </title>
    13 | <body>
     1 | text
    13 | </body>
(6 rows)

If we change parser to output also rank flag, then we could use it to
assign different weights.



>
> On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Marcelo Zabani <mzabani@gmail.com> writes:
>> > I was here wondering whether HTML parsing should separate tokens that
>> are
>> > not separated by spaces in the original text, but are separated by an
>> > inline element. Let me show you an example:
>>
>> > *SELECT to_tsvector('english', 'Hello<p>neighbor</p>, you are
>> > <strong>n</strong>i<em>ce</em>')*
>> > *Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*
>>
>> > "Hello" and "neighbor" should really be separated, because *<p>* is a
>> block
>> > element, but "nice" should be a single word there, since there is no
>> visual
>> > separation when rendered (*<em>* and *<strong>* are inline elements).
>>
>> I can't imagine that we want to_tsvector to know that much about HTML.
>> It doesn't, really, even have license to assume that its input *is*
>> HTML.  So even if you see things that look like <foo> and </foo> in the
>> string, it could easily be XML or SGML or some other SGML-like markup
>> format with different semantics for the markup keywords.
>>
>> Perhaps it'd be sane to do something like this as long as the
>> HTML-specific behavior was broken out into a separate function.
>> (Or maybe it could be done within to_tsvector as a separate parser
>> or separate dictionary?)  But I don't think it should be part of
>> the default behavior.
>>
>>                         regards, tom lane
>>
>
>

[Attachment #3 (text/html)]

<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr \
13, 2016 at 6:57 PM, Marcelo Zabani <span dir="ltr">&lt;<a \
href="mailto:mzabani@gmail.com" target="_blank">mzabani@gmail.com</a>&gt;</span> \
wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi, \
Tom,<div><br></div><div>You&#39;re right, I don&#39;t think one can argue that the \
default parser should know HTML.</div><div>How about your suggestion of there being \
an HTML parser, is it feasible? I ask this because I think that a lot of people store \
HTML documents these days, and although there probably aren&#39;t lots of HTML with \
words written along multiple inline elements, it would certainly be nice to have a \
proper parser for these use cases.</div><div><br></div><div>What do you \
think?</div></div></blockquote><div><br></div><div>I think it could be useful \
separate parser. But the problem is how to fully utilize it to facilitate ranking, \
for example, words in title could be considered more important than in the body, etc. \
Currently, setweight() functions provides this separately from parser.   \
<br><br>Parser outputs tokid and token:<br><br>select * from \
ts_parse(&#39;default&#39;,&#39;&lt;title&gt;parser&lt;/title&gt;&lt;body&gt;text&lt;/body&gt;&#39;);<br> \
tokid |   token<br>-------+----------<br>       13 | &lt;title&gt;<br>         1 | \
parser<br>       13 | &lt;/title&gt;<br>       13 | &lt;body&gt;<br>         1 | \
text<br>       13 | &lt;/body&gt;<br>(6 rows)<br></div><div><br></div><div>If we \
change parser to output also rank flag, then we could use it to assign different \
weights.<br></div><div><br>  </div><blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div \
class=""><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On \
Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <span dir="ltr">&lt;<a \
href="mailto:tgl@sss.pgh.pa.us" target="_blank">tgl@sss.pgh.pa.us</a>&gt;</span> \
wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>Marcelo Zabani \
&lt;<a href="mailto:mzabani@gmail.com" target="_blank">mzabani@gmail.com</a>&gt; \
writes:<br> &gt; I was here wondering whether HTML parsing should separate tokens \
that are<br> &gt; not separated by spaces in the original text, but are separated by \
an<br> &gt; inline element. Let me show you an example:<br>
<br>
</span>&gt; *SELECT to_tsvector(&#39;english&#39;, \
&#39;Hello&lt;p&gt;neighbor&lt;/p&gt;, you are<br> &gt; \
&lt;strong&gt;n&lt;/strong&gt;i&lt;em&gt;ce&lt;/em&gt;&#39;)*<br> &gt; *Results:** \
&quot;&#39;ce&#39;:7 &#39;hello&#39;:1 &#39;n&#39;:5 &#39;neighbor&#39;:2&quot;*<br> \
<br> &gt; &quot;Hello&quot; and &quot;neighbor&quot; should really be separated, \
because *&lt;p&gt;* is a block<br> <span>&gt; element, but &quot;nice&quot; should be \
a single word there, since there is no visual<br> </span>&gt; separation when \
rendered (*&lt;em&gt;* and *&lt;strong&gt;* are inline elements).<br> <br>
I can&#39;t imagine that we want to_tsvector to know that much about HTML.<br>
It doesn&#39;t, really, even have license to assume that its input *is*<br>
HTML.   So even if you see things that look like &lt;foo&gt; and &lt;/foo&gt; in \
the<br> string, it could easily be XML or SGML or some other SGML-like markup<br>
format with different semantics for the markup keywords.<br>
<br>
Perhaps it&#39;d be sane to do something like this as long as the<br>
HTML-specific behavior was broken out into a separate function.<br>
(Or maybe it could be done within to_tsvector as a separate parser<br>
or separate dictionary?)   But I don&#39;t think it should be part of<br>
the default behavior.<br>
<br>
                                    regards, tom lane<br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic