[prev in list] [next in list] [prev in thread] [next in thread]
List: postgis-users
Subject: Re: [postgis-users] Fuzzy Address Matching - PostgreSql equivalent to FuzzyStringComparer using Pyth
From: Shaozhong SHI <shishaozhong () gmail ! com>
Date: 2020-05-12 2:56:32
Message-ID: CA+i5JwbT4MJA4RDKkqJgRskcC5Ks7_70ZGkEneaGfSv7QD27Bw () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
Sorry. It is a very interesting topic, so far as our field is concerned.
Regards,
Shao
On Tue, 12 May 2020 at 03:55, Shaozhong SHI <shishaozhong@gmail.com> wrote:
> Dear All Friends,
>
> Thank you very much for all the information.
>
> I will spend time to digest and think over all these.
>
> It is certainly a very interest topic.
>
> Regards,
>
> Shao
>
> On Mon, 11 May 2020 at 18:34, James Klassen <klassen.js@gmail.com> wrote:
>
>> There are lots of classes of tricky cases:
>>
>> For example:
>>
>> 1234 County Road E, Hudson
>>
>> Is actually "County Road E", not "County Road East". The county roads
>> there are indexed by letters. There is also a County Road F, and a County
>> Road N, and a County Road FF, etc.) Last I checked, even Google, with all
>> their data, gets this one wrong.
>>
>> In general, address parsing is a hard problem and there are always (at
>> least in the US) going to be cases that are ambiguous to parse. The people
>> who have been answering your questions on this list have been working on
>> this problem for as long as I can remember. And they have spent many years
>> developing libraries that make dealing with many of the corner cases of
>> address parsing easier. They started these libraries because the quick and
>> simple solutions you seem to be hoping for failed to deal with the
>> realities of addressing. There is not a perfect solution, but what they
>> are suggesting are the best solutions available.
>>
>> If you want to start over and reinvent a simpler solution, go ahead. But
>> be advised that the people answering you have been down that path. They
>> know what the next problem will be, and the problem after that, and they
>> know that at the end of the chain of problems you will discover you need
>> something like the tools and libraries they are suggesting.
>>
>> On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <pramsey@cleverelephant.ca>
>> wrote:
>>
>>> There's a huge range of potential functionality/features that can reside
>>> under the banner "fuzzy address matching".
>>>
>>> At the simplest, you're trying to catch common lexicographic
>>> differences, like off-by-one addresses or alternate spellings. This is the
>>> realm of trigrams and levenstein distances.
>>> Then you start dealing with different forms of abbreviations (rd, road,
>>> r) and formats (unit 4, #4, apt 4, 4). This is the realm of data-less
>>> standardizers.
>>> Then you start dealing with larger forms of aliasing. Standard city and
>>> state names, recognizing major components of addresses. This is the real of
>>> dictionary-backed standardizers.
>>> Then you start dealing with all forms of aliasing. This is actually
>>> getting down into geocoding, and address standardizers backed by complete
>>> address and road databases.
>>>
>>> What people mean when they say "can you do address standardization" can
>>> vary massively. It can also be frustrating because the performance of
>>> something like Google's geocoder algorithm, backed by the largest, and most
>>> up-to-data geographic database in the world gets casually compared to
>>> simple format and dictionary backed standardizer by folks with no
>>> understanding of the complexity or amount of data that lives under the
>>> covers of this "simple" task. I think people are far more understanding of
>>> something like machine translation and "get" that it's a really hard
>>> problem, because learning a new language is hard. But geocoding is "easy",
>>> anyone can look at an address and then look that address up in a map book
>>> (ha ha, well anyone over 40 at least).
>>>
>>> P.
>>>
>>>
>>> > On May 11, 2020, at 10:15 AM, Imre Samu <pella.samu@gmail.com> wrote:
>>> >
>>> > > ... Fuzzy Address Matching? ...
>>> > > For instance, South Great Avenue, A City, Planet Earth may be put
>>> down as the following:
>>> >
>>> > sometimes .. the same country/city/building - has a different
>>> names ..
>>> >
>>> > "A City" example:
>>> > https://en.wikipedia.org/wiki/Vienna
>>> >
>>> > Vienna:
>>> > "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech
>>> or Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj
>>> (Slovene)*, Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň
>>> (Czech)*, Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese),
>>> Viena / Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*,
>>> Lithuanian, Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*,
>>> Vienne (French)*, Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin
>>> (Estonian), Vin - װין (Yiddish), Vín (Irish, Icelandic), Vina - וי ה
>>> (Hebrew), Vínarborg (Icelandic variant), Vindobona (Latin), Vīne
>>> (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), Vjenna (Maltese), Vyana
>>> (Azeri), Wean (local Viennese, Austrian and Bavarian dialects)*, Weiyena -
>>> 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń (Polish)*, Wien
>>> (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - ウィーン (Japanese)*,
>>> Wina (Indonesian), فيينا (Arabic), وين (Persian)"
>>> > "
>>> > see:
>>> https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V
>>> >
>>> https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages
>>> >
>>> > or see the Nominatim names for Vienna:
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary
>>> >
>>> >
>>> > And sometimes - the buildings has a different names ...
>>> >
>>> https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity
>>> > - Wien Rathaus (name)
>>> > - атуша (name:be)
>>> > - Vídeňská radnice (name:cs)
>>> > - Rathaus (name:de)
>>> > - Vienna City Hall (name:en)
>>> > - 市庁舎 (name:ja)
>>> > - атуша (name:ru)
>>> > - Віденська ратуша (name:uk)
>>> >
>>> >
>>> >
>>> > Best,
>>> > Imre
>>> >
>>> >
>>> >
>>> >
>>> > Shaozhong SHI <shishaozhong@gmail.com> ezt írta (időpont: 2020. máj.
>>> 11., H, 18:24):
>>> > Hello,
>>> >
>>> > I got a few questions as follows:
>>> >
>>> > 1. Which one is the best way for Fuzzy Address Matching?
>>> >
>>> > 2. FME FuzzyStringComparer uses Python difflib module. Which one in
>>> Postgres is equivalent or similar to it?
>>> >
>>> > 3. Often, addresses collected by different people may well be
>>> correct. But, there may be typing errors, or addresses are composed not in
>>> a consistent manner.
>>> >
>>> > For instance, South Great Avenue, A City, Planet Earth may be put down
>>> as the following:
>>> >
>>> > S. Great Aveue, City A, Earth Planet
>>> > Great Avene South, A City, Earth Planet
>>> > Great Avenue S, A City, Planet Earth
>>> >
>>> > Surely, there would be solutions to deal with this problem.
>>> >
>>> > Can anyone enlighten me?
>>> >
>>> > Regards,
>>> >
>>> > Shao
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users@lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>> > _______________________________________________
>>> > postgis-users mailing list
>>> > postgis-users@lists.osgeo.org
>>> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>>>
>>> _______________________________________________
>>> postgis-users mailing list
>>> postgis-users@lists.osgeo.org
>>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>>
>> _______________________________________________
>> postgis-users mailing list
>> postgis-users@lists.osgeo.org
>> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
[Attachment #5 (text/html)]
<div dir="ltr">Sorry. It is a very interesting topic, so far as our field is \
concerned.<div><br></div><div>Regards,</div><div><br></div><div>Shao</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 12 May 2020 at 03:55, \
Shaozhong SHI <<a \
href="mailto:shishaozhong@gmail.com">shishaozhong@gmail.com</a>> \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Dear \
All Friends,<div><br></div><div>Thank you very much for all the \
information.</div><div><br></div><div>I will spend time to digest and think over all \
these.</div><div><br></div><div>It is certainly a very interest \
topic.</div><div><br></div><div>Regards,</div><div><br></div><div>Shao</div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 11 May 2020 at 18:34, \
James Klassen <<a href="mailto:klassen.js@gmail.com" \
target="_blank">klassen.js@gmail.com</a>> wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>There are lots of classes of \
tricky cases:</div><div><br></div><div>For example:</div><div><br></div><div> \
1234 County Road E, Hudson </div><div><br></div><div>Is actually "County Road \
E", not "County Road East". The county roads there are indexed by \
letters. There is also a County Road F, and a County Road N, and a County Road FF, \
etc.) Last I checked, even Google, with all their data, gets this one \
wrong.</div><div><br></div><div>In general, address parsing is a hard problem and \
there are always (at least in the US) going to be cases that are ambiguous to parse. \
The people who have been answering your questions on this list have been working on \
this problem for as long as I can remember. And they have spent many years \
developing libraries that make dealing with many of the corner cases of address \
parsing easier. They started these libraries because the quick and simple \
solutions you seem to be hoping for failed to deal with the realities of \
addressing. There is not a perfect solution, but what they are suggesting are the \
best solutions available.</div><div><br></div><div>If you want to start over and \
reinvent a simpler solution, go ahead. But be advised that the people answering you \
have been down that path. They know what the next problem will be, and the problem \
after that, and they know that at the end of the chain of problems you will discover \
you need something like the tools and libraries they are \
suggesting.</div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Mon, May 11, 2020 at 12:31 PM Paul Ramsey <<a \
href="mailto:pramsey@cleverelephant.ca" \
target="_blank">pramsey@cleverelephant.ca</a>> wrote:<br></div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">There's a huge range of potential \
functionality/features that can reside under the banner "fuzzy address \
matching". <br> <br>
At the simplest, you're trying to catch common lexicographic differences, like \
off-by-one addresses or alternate spellings. This is the realm of trigrams and \
levenstein distances.<br> Then you start dealing with different forms of \
abbreviations (rd, road, r) and formats (unit 4, #4, apt 4, 4). This is the realm of \
data-less standardizers.<br> Then you start dealing with larger forms of aliasing. \
Standard city and state names, recognizing major components of addresses. This is the \
real of dictionary-backed standardizers.<br> Then you start dealing with all forms of \
aliasing. This is actually getting down into geocoding, and address standardizers \
backed by complete address and road databases.<br> <br>
What people mean when they say "can you do address standardization" can \
vary massively. It can also be frustrating because the performance of something like \
Google's geocoder algorithm, backed by the largest, and most up-to-data \
geographic database in the world gets casually compared to simple format and \
dictionary backed standardizer by folks with no understanding of the complexity or \
amount of data that lives under the covers of this "simple" task. I think \
people are far more understanding of something like machine translation and \
"get" that it's a really hard problem, because learning a new language \
is hard. But geocoding is "easy", anyone can look at an address and then \
look that address up in a map book (ha ha, well anyone over 40 at least).<br> <br>
P.<br>
<br>
<br>
> On May 11, 2020, at 10:15 AM, Imre Samu <<a \
href="mailto:pella.samu@gmail.com" target="_blank">pella.samu@gmail.com</a>> \
wrote:<br> > <br>
> > ... Fuzzy Address Matching? ... <br>
> > For instance, South Great Avenue, A City, Planet Earth may be put down as \
the following:<br> > <br>
> sometimes .. the same country/city/building - has a different names .. \
<br> > <br>
> "A City" example:<br>
> <a href="https://en.wikipedia.org/wiki/Vienna" rel="noreferrer" \
target="_blank">https://en.wikipedia.org/wiki/Vienna</a><br> > <br>
> Vienna:<br>
> "Beč (Croatian, Serbian, older Bulgarian), Beç (older Turkish)*, Bech or \
Vidnya (Romani), Bécs (Hungarian)*, Bin / Pin - 빈 (Korean), Dunaj (Slovene)*, \
Fienna (Welsh), Vedunia (Celtic), Vena - Вена (Russian), Vídeň (Czech)*, \
Viden' / Videň (Ukrainian)*, Viedeň (Slovak), Viên (Vietnamese), Viena / \
Vijena/ Виена (Belarusian, Bulgarian, Macedonian), Viena (Catalan*, Lithuanian, \
Portuguese*, Romanian*, Spanish*, Tagalog*), Vienna (Italian)*, Vienne (French)*, \
Viénni - Βιέννη (Greek), Vieno (Esperanto), Viin (Estonian), Vin - װין \
(Yiddish), Vín (Irish, Icelandic), Vina - וי ה (Hebrew), Vínarborg (Icelandic \
variant), Vindobona (Latin), Vīne (Latvian)*, Viyana (Turkish)*, Vjenë (Albanian), \
Vjenna (Maltese), Vyana (Azeri), Wean (local Viennese, Austrian and Bavarian \
dialects)*, Weiyena - 維也納 (Chinese)*, Wene (Afrikaans), Wenen (Dutch)*, Wiedeń \
(Polish)*, Wien (Danish*, Finnish*, German*, Norwegian*, Swedish*), Wīn - \
ウィーン (Japanese)*, Wina (Indonesian), فيينا (Arabic), وين \
(Persian)"<br> > "<br>
> see: <a href="https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V" \
rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages_(U%E2%80%93Z)#V</a><br>
> <a href="https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages" \
rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/List_of_names_of_European_cities_in_different_languages</a><br>
> <br>
> or see the Nominatim names for Vienna:<br>
> <a href="https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary" \
rel="noreferrer" target="_blank">https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=109166&class=boundary</a><br>
> <br>
> <br>
> And sometimes - the buildings has a different names ...<br>
> <a href="https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity" \
rel="noreferrer" target="_blank">https://nominatim.openstreetmap.org/details.php?osmtype=R&osmid=11101&class=amenity</a><br>
> - Wien Rathaus (name)<br>
> - атуша (name:be)<br>
> - Vídeňská radnice (name:cs)<br>
> - Rathaus (name:de)<br>
> - Vienna City Hall (name:en)<br>
> - 市庁舎 (name:ja)<br>
> - атуша (name:ru)<br>
> - Віденська ратуша (name:uk)<br>
> <br>
> <br>
> <br>
> Best,<br>
> Imre<br>
> <br>
> <br>
> <br>
> <br>
> Shaozhong SHI <<a href="mailto:shishaozhong@gmail.com" \
target="_blank">shishaozhong@gmail.com</a>> ezt írta (időpont: 2020. máj. 11., \
H, 18:24):<br> > Hello,<br>
> <br>
> I got a few questions as follows:<br>
> <br>
> 1. Which one is the best way for Fuzzy Address Matching?<br>
> <br>
> 2. FME FuzzyStringComparer uses Python difflib module. Which one in \
Postgres is equivalent or similar to it?<br> > <br>
> 3. Often, addresses collected by different people may well be correct. But, \
there may be typing errors, or addresses are composed not in a consistent manner.<br> \
> <br> > For instance, South Great Avenue, A City, Planet Earth may be put down \
as the following:<br> > <br>
> S. Great Aveue, City A, Earth Planet<br>
> Great Avene South, A City, Earth Planet<br>
> Great Avenue S, A City, Planet Earth<br>
> <br>
> Surely, there would be solutions to deal with this problem.<br>
> <br>
> Can anyone enlighten me?<br>
> <br>
> Regards,<br>
> <br>
> Shao<br>
> _______________________________________________<br>
> postgis-users mailing list<br>
> <a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> > <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a><br> > \
_______________________________________________<br> > postgis-users mailing \
list<br> > <a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> > <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a><br> <br>
_______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a></blockquote></div>
_______________________________________________<br>
postgis-users mailing list<br>
<a href="mailto:postgis-users@lists.osgeo.org" \
target="_blank">postgis-users@lists.osgeo.org</a><br> <a \
href="https://lists.osgeo.org/mailman/listinfo/postgis-users" rel="noreferrer" \
target="_blank">https://lists.osgeo.org/mailman/listinfo/postgis-users</a></blockquote></div>
</blockquote></div>
[Attachment #6 (text/plain)]
_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic