[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-users
Subject:    Re: [postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Pyth
From:       Shaozhong SHI <shishaozhong () gmail ! com>
Date:       2020-05-12 2:56:32
Message-ID: CA+i5JwbT4MJA4RDKkqJgRskcC5Ks7_70ZGkEneaGfSv7QD27Bw () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Sorry.  It is a very interesting topic, so far as our field is concerned.

Regards,

Shao

On Tue, 12 May 2020 at 03:55, Shaozhong SHI <shishaozhong@gmail.com> wrote:

> Dear All Friends,
>
> Thank you very much for all the information.
>
> I will spend time to digest and think over all these.
>
> It is certainly a very interest topic.
>
> Regards,
>
> Shao
>
> On Mon, 11 May 2020 at 18:34, James Klassen <klassen.js@gmail.com> wrote:
>
>> There are lots of classes of tricky cases:
>>
>> For example:
>>
>>     1234 County Road E, Hudson
>>
>> Is actually "County Road E", not "County Road East".  The county roads
>> there are indexed by letters.  There is also a County Road F, and a County
>> Road N, and a County Road FF, etc.)  Last I checked, even Google, with all
>> their data, gets this one wrong.
>>
>> In general, address parsing is a hard problem and there are always (at
>> least in the US) going to be cases that are ambiguous to parse.  The people
>> who have been answering your questions on this list have been working on
>> this problem for as long as I can remember.  And they have spent many years
>> developing libraries that make dealing with many of the corner cases of
>> address parsing easier.  They started these libraries because the quick and
>> simple solutions you seem to be hoping for failed to deal with the
>> realities of addressing.  There is not a perfect solution, but what they
>> are suggesting are the best solutions available.
>>
>> If you want to start over and reinvent a simpler solution, go ahead.  But
>> be advised that the people answering you have been down that path.  They
>> know what the next problem will be, and the problem after that, and they
>> know that at the end of the chain of problems you will discover you need
>> something like the tools and libraries they are suggesting.
>>
>> On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <pramsey@cleverelephant.ca>
>> wrote:
>>
>>> There's a huge range of potential functionality/features that can reside
>>> under the banner "fuzzy address matching".
>>>
>>> At the simplest, you're trying to catch common lexicographic
>>> differences, like off-by-one addresses or alternate spellings. This is the
>>> realm of trigrams and levenstein distances.
>>> Then you start dealing with different forms of abbreviations (rd, road,
>>> r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less
>>> standardizers.
>>> Then you start dealing with larger forms of aliasing. Standard city and
>>> state names, recognizing major components of addresses. This is the real of
>>> dictionary-backed standardizers.
>>> Then you start dealing with all forms of aliasing. This is actually
>>> getting down into geocoding, and address standardizers backed by complete
>>> address and road databases.
>>>
>>> What people mean when they say "can you do address standardization" can
>>> vary massively. It can also be frustrating because the performance of
>>> something like Google's geocoder algorithm, backed by the largest, and most
>>> up-to-data geographic database in the world gets casually compared to
>>> simple format and dictionary backed standardizer by folks with no
>>> understanding of the complexity or amount of data that lives under the
>>> covers of this "simple" task. I think people are far more understanding of
>>> something like machine translation and "get" that it's a really hard
>>> problem, because learning a new language is hard. But geocoding is "easy",
>>> anyone can look at an address and then look that address up in a map book
>>> (ha ha, well anyone over 40 at least).
>>>
>>> P.
>>>
>>>
>>> > On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu@gmail.com> wrote:
>>> >
>>> > > ... Fuzzy Address Matching? ...
>>> > > For instance, South Great Avenue, A City, Planet Earth may be put
>>> down as the following:
>>> >
>>> > sometimes .. the same  country/city/building   -  has a different
>>> names ..
>>> >
>>> > "A City"  example:
>>> > https://en.wikipedia.org/wiki/Vienna
>>> >
>>> > Vienna:
>>> > "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech
>>> or Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj
>>> (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň
>>> (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese),
>>> Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*,
>>> Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*,
>>> Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin
>>> (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וי ה
>>> (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne
>>> (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana
>>> (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena -
>>> 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien
>>> (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*,
>>> Wina (Indonesian), فيينا (Arabic), وين (Persian)"
>>> > "
>>> > see:
>>> https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
>>> >
>>> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
>>> >
>>> > or see the Nominatim names for Vienna:
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
>>> >
>>> >
>>> > And sometimes - the buildings has a different names ...
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
>>> > - Wien Rathaus (name)
>>> > -  атуша (name:be)
>>> > - Vídeňská radnice (name:cs)
>>> > - Rathaus (name:de)
>>> > - Vienna City Hall (name:en)
>>> > - 市庁舎 (name:ja)
>>> > -  атуша (name:ru)
>>> > - Віденська ратуша (name:uk)
>>> >
>>> >
>>> >
>>> > Best,
>>> >  Imre
>>> >
>>> >
>>> >
>>> >
>>> > Shaozhong SHI <shishaozhong@gmail.com> ezt írta (időpont: 2020. máj.
>>> 11., H, 18:24):
>>> > Hello,
>>> >
>>> > I got a few questions as follows:
>>> >
>>> > 1.  Which one is the best way for Fuzzy Address Matching?
>>> >
>>> > 2.  FME FuzzyStringComparer uses  Python difflib module.  Which one in
>>> Postgres is equivalent or similar to it?
>>> >
>>> > 3.  Often, addresses collected by different people may well be
>>> correct.  But, there may be typing errors, or addresses are composed not in
>>> a consistent manner.
>>> >
>>> > For instance, South Great Avenue, A City, Planet Earth may be put down
>>> as the following:
>>> >
>>> > S. Great Aveue, City A, Earth Planet
>>> > Great Avene South, A City, Earth Planet
>>> > Great Avenue S, A City, Planet Earth
>>> >
>>> > Surely, there would be solutions to deal with this problem.
>>> >
>>> > Can anyone enlighten me?
>>> >
>>> > Regards,
>>> >
>>> > Shao
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users@lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users@lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users@lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users@lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>

[Attachment #5 (text/html)]

<div dir="ltr">Sorry.   It is a very interesting topic, so far as our field is \
concerned.<div><br></div><div>Regards,</div><div><br></div><div>Shao</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 12 May 2020 at 03:55, \
Shaozhong SHI &lt;<a \
href="mailto:shishaozhong@gmail.com">shishaozhong@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Dear \
All Friends,<div><br></div><div>Thank you very much for all the \
information.</div><div><br></div><div>I will spend time to digest and think over all \
these.</div><div><br></div><div>It is certainly a very interest \
topic.</div><div><br></div><div>Regards,</div><div><br></div><div>Shao</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 11 May 2020 at 18:34, \
James Klassen &lt;<a href="mailto:klassen.js@gmail.com" \
target="_blank">klassen.js@gmail.com</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>There are lots of classes of \
tricky cases:</div><div><br></div><div>For example:</div><div><br></div><div>      \
1234 County Road E, Hudson  </div><div><br></div><div>Is actually &quot;County Road \
E&quot;, not &quot;County Road East&quot;.   The county roads there are indexed by \
letters.   There is also a County Road F, and a County Road N, and a County Road FF, \
etc.)   Last I checked, even Google, with all their data, gets this one \
wrong.</div><div><br></div><div>In general, address  parsing is a hard problem and \
there are always (at least in the US) going to be cases that are ambiguous to parse.  \
The people who have been answering your questions on this list have been working on \
this problem for as long as I can remember.   And they have spent many years \
developing libraries that make dealing with many of the corner cases of address \
parsing easier.   They started these  libraries because the quick and simple \
solutions you seem  to be hoping  for failed to deal with the realities of \
addressing.   There  is not a perfect solution, but what they are suggesting are the \
best solutions available.</div><div><br></div><div>If you want to start over and \
reinvent a simpler solution, go ahead.   But be advised that the people answering you \
have been down that path.   They know what the next problem will be, and the problem \
after that, and they know that at the end of the chain of problems you will discover \
you need something like the tools and libraries they are \
suggesting.</div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Mon, May 11, 2020 at 12:31 PM Paul Ramsey &lt;<a \
href="mailto:pramsey@cleverelephant.ca" \
target="_blank">pramsey@cleverelephant.ca</a>&gt; wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">There&#39;s a huge range of potential \
functionality/features that can reside under the banner &quot;fuzzy address \
matching&quot;. <br> <br>
At the simplest, you&#39;re trying to catch common lexicographic differences, like \
off-by-one addresses or alternate spellings. This is the realm of trigrams and \
levenstein distances.<br> Then you start dealing with different forms of \
abbreviations (rd, road, r) and formats (unit 4, #4, apt 4, 4). This is the realm of \
data-less standardizers.<br> Then you start dealing with larger forms of aliasing. \
Standard city and state names, recognizing major components of addresses. This is the \
real of dictionary-backed standardizers.<br> Then you start dealing with all forms of \
aliasing. This is actually getting down into geocoding, and address standardizers \
backed by complete address and road databases.<br> <br>
What people mean when they say &quot;can you do address standardization&quot; can \
vary massively. It can also be frustrating because the performance of something like \
Google&#39;s geocoder algorithm, backed by the largest, and most up-to-data \
geographic database in the world gets casually compared to simple format and \
dictionary backed standardizer by folks with no understanding of the complexity or \
amount of data that lives under the covers of this &quot;simple&quot; task. I think \
people are far more understanding of something like machine translation and \
&quot;get&quot; that it&#39;s a really hard problem, because learning a new language \
is hard. But geocoding is &quot;easy&quot;, anyone can look at an address and then \
look that address up in a map book (ha ha, well anyone over 40 at least).<br> <br>
P.<br>
<br>
<br>
&gt; On May 11, 2020, at 10:15 AM, Imre Samu &lt;<a \
href="mailto:pella.samu@gmail.com" target="_blank">pella.samu@gmail.com</a>&gt; \
wrote:<br> &gt; <br>
&gt; &gt; ... Fuzzy Address Matching? ... <br>
&gt; &gt; For instance, South Great Avenue, A City, Planet Earth may be put down as \
the following:<br> &gt; <br>
&gt; sometimes .. the same   country/city/building     -   has a different names .. \
<br> &gt; <br>
&gt; &quot;A City&quot;   example:<br>
&gt; <a href="https://en.wikipedia.org/wiki/Vienna" rel="noreferrer" \
target="_blank">https://en.wikipedia.org/wiki/Vienna</a><br> &gt; <br>
&gt; Vienna:<br>
&gt; &quot;Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech or \
Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj (Slovene)*, \
Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň (Czech)*, \
Viden&#39; / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese), Viena / \
Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*, Lithuanian, \
Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*, Vienne (French)*, \
Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin (Estonian), Vin - װין \
(Yiddish), Vín (Irish, Icelandic), Vina - וי ה (Hebrew), Vínarborg (Icelandic \
variant), Vindobona (Latin), Vīne (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), \
Vjenna (Maltese), Vyana (Azeri), Wean (local Viennese, Austrian and Bavarian \
dialects)*, Weiyena - 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń \
(Polish)*, Wien (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - \
ウィーン (Japanese)*, Wina (Indonesian), فيينا (Arabic), وين \
(Persian)&quot;<br> &gt; &quot;<br>
&gt; see: <a href="https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V" \
rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V</a><br>
 &gt; <a href="https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages" \
rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages</a><br>
 &gt; <br>
&gt; or see the Nominatim names for Vienna:<br>
&gt; <a href="https://nominatim.openstreetmap.org/details.php?osmtype=R&amp;osmid=109166&amp;class=boundary" \
rel="noreferrer" target="_blank">https://nominatim.openstreetmap.org/details.php?osmtype=R&amp;osmid=109166&amp;class=boundary</a><br>
 &gt; <br>
&gt; <br>
&gt; And sometimes - the buildings has a different names ...<br>
&gt; <a href="https://nominatim.openstreetmap.org/details.php?osmtype=R&amp;osmid=11101&amp;class=amenity" \
rel="noreferrer" target="_blank">https://nominatim.openstreetmap.org/details.php?osmtype=R&amp;osmid=11101&amp;class=amenity</a><br>
 &gt; - Wien Rathaus (name)<br>
&gt; -  атуша (name:be)<br>
&gt; - Vídeňská radnice (name:cs)<br>
&gt; - Rathaus (name:de)<br>
&gt; - Vienna City Hall (name:en)<br>
&gt; - 市庁舎 (name:ja)<br>
&gt; -  атуша (name:ru)<br>
&gt; - Віденська ратуша (name:uk)<br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; Best,<br>
&gt;   Imre<br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; <br>
&gt; Shaozhong SHI &lt;<a href="mailto:shishaozhong@gmail.com" \
target="_blank">shishaozhong@gmail.com</a>&gt; ezt írta (időpont: 2020. máj. 11., \
H, 18:24):<br> &gt; Hello,<br>
&gt; <br>
&gt; I got a few questions as follows:<br>
&gt; <br>
&gt; 1.   Which one is the best way for Fuzzy Address Matching?<br>
&gt; <br>
&gt; 2.   FME FuzzyStringComparer uses   Python difflib module.   Which one in \
Postgres is equivalent or similar to it?<br> &gt; <br>
&gt; 3.   Often, addresses collected by different people may well be correct.   But, \
there may be typing errors, or addresses are composed not in a consistent manner.<br> \
&gt; <br> &gt; For instance, South Great Avenue, A City, Planet Earth may be put down \
as the following:<br> &gt; <br>
&gt; S. Great Aveue, City A, Earth Planet<br>
&gt; Great Avene South, A City, Earth Planet<br>
&gt; Great Avenue S, A City, Planet Earth<br>
&gt; <br>
&gt; Surely, there would be solutions to deal with this problem.<br>
&gt; <br>
&gt; Can anyone enlighten me?<br>
&gt; <br>
&gt; Regards,<br>
&gt; <br>
&gt; Shao<br>
&gt; _______________________________________________<br>
&gt; postgis-users mailing list<br>
&gt; <a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> &gt; <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a><br> &gt; \
_______________________________________________<br> &gt; postgis-users mailing \
list<br> &gt; <a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> &gt; <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a><br> <br>
_______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a></blockquote></div>
 _______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a></blockquote></div>
 </blockquote></div>


[Attachment #6 (text/plain)]

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic