[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: [GENERAL] Fastest Index/Algorithm to find similar sentences
From:       Beena Emerson <memissemerson () gmail ! com>
Date:       2013-07-31 14:08:22
Message-ID: CAOG9ApEaGjHaFtm2XrVGYc6WbYFva3JzLxa6ANSFFyW_-mFkQA () mail ! gmail ! com
[Download RAW message or body]

I am sorry, I just re-read your mail and realized  you have already tried
with pg_trgm.



On Wed, Jul 31, 2013 at 7:23 PM, Beena Emerson <memissemerson@gmail.com>wrote:

> On Sat, Jul 27, 2013 at 10:34 PM, Janek Sendrowski <janek12@web.de> wrote:
>
>> Hi Sergey Konoplev,
>>
>> If I'm searching for a sentence like "The tiger is the largest cat
>> species" for example.
>>
>> I can only find the sentences, which include the words "tiger, largest,
>> cat, species", but I also like to have the sentences with only three or
>> even two of these words.
>>
>> Janek
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>
> Hi,
>
> You may use similarity functions of pg_trgm.
>
> Example:
> =# \d+ test
>                         Table "public.test"
>  Column | Type | Modifiers | Storage  | Stats target | Description
> --------+------+-----------+----------+--------------+-------------
>  col    | text |           | extended |              |
> Indexes:
>     "test_idx" gin (col gin_trgm_ops)
> Has OIDs: no
>
> # SELECT * FROM test;
>                    col
> -----------------------------------------
>  The tiger is the largest cat species
>  The cheetah is the fastest  cat species
>  The peacock is the largest bird species
> (3 rows)
>
> =# SELECT show_limit();
>  show_limit
> ------------
>         0.3
> (1 row)
>
> =# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
> sml
>   FROM test WHERE col % 'The tiger is the largest cat species'
>   ORDER BY sml DESC, col;
>                    col                   |   sml
> -----------------------------------------+----------
>  The tiger is the largest cat species    |        1
>  The peacock is the largest bird species | 0.511111
>  The cheetah is the fastest  cat species | 0.466667
> (3 rows)
>
> =# SELECT set_limit(0.5);
>  set_limit
> -----------
>        0.5
> (1 row)
>
> =# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
> sml
>   FROM test WHERE col % 'The tiger is the largest cat species'
>   ORDER BY sml DESC, col;
>                    col                   |   sml
> -----------------------------------------+----------
>  The tiger is the largest cat species    |        1
>  The peacock is the largest bird species | 0.511111
> (2 rows)
>
> =# SELECT set_limit(0.9);
>  set_limit
> -----------
>        0.9
> (1 row)
>
> =# SELECT col, similarity(col, 'The tiger is the largest cat species') AS
> sml
>   FROM test WHERE col % 'The tiger is the largest cat species'
>   ORDER BY sml DESC, col;
>                  col                  | sml
> --------------------------------------+-----
>  The tiger is the largest cat species |   1
> (1 row)
>
>
> When you set a higher limit, you get more exact matches.
>
>
> --
> Beena Emerson
>
>


-- 
Beena Emerson

[Attachment #3 (text/html)]

<div dir="ltr"><br><div>I am sorry, I just re-read your mail and realized  you have \
already tried with pg_trgm.</div><div><br></div></div><div \
class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Jul 31, 2013 at 7:23 PM, \
Beena Emerson <span dir="ltr">&lt;<a href="mailto:memissemerson@gmail.com" \
target="_blank">memissemerson@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr"><div><div class="h5">On Sat, Jul 27, 2013 at \
10:34 PM, Janek Sendrowski <span dir="ltr">&lt;<a href="mailto:janek12@web.de" \
target="_blank">janek12@web.de</a>&gt;</span> wrote:<br> </div></div><div \
class="gmail_extra"><div><div class="h5"><div class="gmail_quote"> <blockquote \
class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Hi \
Sergey Konoplev,<br>  <br>
If I&#39;m searching for a sentence like &quot;The tiger is the largest cat \
species&quot; for example.<br>  <br>
I can only find the sentences, which include the words &quot;tiger, largest, cat, \
species&quot;, but I also like to have the sentences with only three or even two of \
these words.<br> <span><font color="#888888"> <br>
Janek<br>
</font></span><div><div><br>
<br>
--<br>
Sent via pgsql-general mailing list (<a href="mailto:pgsql-general@postgresql.org" \
target="_blank">pgsql-general@postgresql.org</a>)<br> To make changes to your \
subscription:<br> <a href="http://www.postgresql.org/mailpref/pgsql-general" \
target="_blank">http://www.postgresql.org/mailpref/pgsql-general</a><br> \
</div></div></blockquote></div><br></div></div><div>Hi,<br></div><div><div \
dir="ltr"><br></div><div dir="ltr">You may use similarity functions of \
pg_trgm.</div><div dir="ltr"><br></div><div dir="ltr">Example:</div><div dir="ltr"> \
=# \d+ test<br> </div><div dir="ltr">                        Table \
&quot;public.test&quot;</div><div dir="ltr"> Column | Type | Modifiers | Storage  | \
Stats target | Description </div><div \
dir="ltr">--------+------+-----------+----------+--------------+-------------</div>

<div dir="ltr"> col    | text |           | extended |              | </div><div \
dir="ltr">Indexes:</div><div dir="ltr">    &quot;test_idx&quot; gin (col \
gin_trgm_ops)</div><div dir="ltr">Has OIDs: no</div><div dir="ltr">

<br></div><div dir="ltr"><div dir="ltr"># SELECT * FROM test;</div><div dir="ltr">    \
col                   </div><div \
dir="ltr">-----------------------------------------</div><div class="im"><div \
dir="ltr">  The tiger is the largest cat species</div>
</div><div dir="ltr"> The cheetah is the fastest  cat species</div><div dir="ltr"> \
The peacock is the largest bird species</div><div dir="ltr">(3 rows)</div></div><div \
dir="ltr"><br></div><div dir="ltr">=# SELECT show_limit();</div>

<div dir="ltr"> show_limit </div><div dir="ltr">------------</div><div dir="ltr">     \
0.3</div><div dir="ltr">(1 row)</div><div dir="ltr"><br></div><div dir="ltr">=# \
SELECT col, similarity(col, &#39;The tiger is the largest cat species&#39;) AS \
sml</div>

<div dir="ltr">  FROM test WHERE col % &#39;The tiger is the largest cat \
species&#39;</div><div dir="ltr">  ORDER BY sml DESC, col;</div><div dir="ltr">       \
col                   |   sml    </div><div dir="ltr">

-----------------------------------------+----------</div><div dir="ltr"> The tiger \
is the largest cat species    |        1</div><div dir="ltr"> The peacock is the \
largest bird species | 0.511111</div><div dir="ltr"> The cheetah is the fastest  cat \
species | 0.466667</div>

<div dir="ltr">(3 rows)</div><div dir="ltr"><br></div><div dir="ltr">=# SELECT \
set_limit(0.5);</div><div dir="ltr"> set_limit </div><div \
dir="ltr">-----------</div><div dir="ltr">       0.5</div><div dir="ltr">(1 \
row)</div>

<div dir="ltr"><br></div><div dir="ltr">=# SELECT col, similarity(col, &#39;The tiger \
is the largest cat species&#39;) AS sml</div><div dir="ltr">  FROM test WHERE col % \
&#39;The tiger is the largest cat species&#39;</div>

<div dir="ltr">  ORDER BY sml DESC, col;</div><div dir="ltr">                   col   \
|   sml    </div><div \
dir="ltr">-----------------------------------------+----------</div><div dir="ltr"> \
The tiger is the largest cat species    |        1</div>

<div dir="ltr"> The peacock is the largest bird species | 0.511111</div><div \
dir="ltr">(2 rows)</div><div dir="ltr"><br></div><div dir="ltr">=# SELECT \
set_limit(0.9);</div><div dir="ltr"> set_limit </div><div dir="ltr">-----------</div>

<div dir="ltr">       0.9</div><div dir="ltr">(1 row)</div><div \
dir="ltr"><br></div><div dir="ltr">=# SELECT col, similarity(col, &#39;The tiger is \
the largest cat species&#39;) AS sml</div><div dir="ltr">  FROM test WHERE col % \
&#39;The tiger is the largest cat species&#39;</div>

<div dir="ltr">  ORDER BY sml DESC, col;</div><div dir="ltr">                 col     \
| sml </div><div dir="ltr">--------------------------------------+-----</div><div \
dir="ltr"> The tiger is the largest cat species |   1</div>

<div dir="ltr">(1 row)</div><div dir="ltr"><br></div><div dir="ltr"><br></div><div \
dir="ltr">When you set a higher limit, you get more exact matches.</div></div><span \
class="HOEnZb"><font color="#888888"><div><br></div><div> <br></div>-- <br><div \
dir="ltr"><span style="border-collapse:collapse"><div \
style="font-family:arial,sans-serif;color:rgb(34,34,34)"> <span \
style="font-family:arial,helvetica,sans-serif">Beena \
Emerson</span><br></div></span><br></div> </font></span></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr"><span \
style="border-collapse:collapse"><div \
style="font-family:arial,sans-serif;color:rgb(34,34,34)"><span \
style="font-family:arial,helvetica,sans-serif">Beena Emerson</span><br> \
</div></span><br></div> </div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic