[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgis-users
Subject:    Re: [postgis-users] Various ways to handle addresses in postgresql
From:       Stephen Woodbridge <stephenwoodbridge37 () gmail ! com>
Date:       2020-04-29 17:52:29
Message-ID: be8fc181-957e-1ab3-731d-7dbb295e2e09 () gmail ! com
[Download RAW message or body]

On 4/29/2020 9:15 AM, Shaozhong SHI wrote:
> Hi, Steve,
> 
> I have address_standardizer extension created.
> 
> It is not like other functions.  How to review its code? Does it need 
> to adapted to handle other countries addresses?

As I tried to indicate, the postGIS address_standardizer extension is 
extremely hard to understand and nearly impossible to customize the 
files. I migrated the PAGC code into this extension but I barely 
understood the code. Conceptually, the ideas are straight forward, but 
the code is very hard to follow and understand so I portted it as a 
black box.

If the exisiting address_standardizer_data_us does not work for you or 
you need to support other countries, you will need to build and install 
the one here:

https://github.com/woodbri/address-standardizer

This was written as a replacement
This one has sample files for 25 countries here:

https://github.com/woodbri/address-standardizer/tree/develop/data/sample

that you can customize and has (hopefully) good documentation here:

https://github.com/woodbri/address-standardizer/blob/develop/README.md
https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
https://github.com/woodbri/address-standardizer/tree/develop/data

Also, this code is written in C++ and I hope it is written to be easier 
to understand and review.

Not sure I can help much more than this unless you have questions on 
https://github.com/woodbri/address-standardizer which we should probably 
take off the PostGIS list as this is not part of PostGIS.

-Steve

> 
> Regards,
> 
> Shao
> 
> On Sun, 26 Apr 2020 at 17:23, Stephen Woodbridge 
> <stephenwoodbridge37@gmail.com <mailto:stephenwoodbridge37@gmail.com>> 
> wrote:
> 
> Shao,
> 
> I just remembered the lex, gaz, and rules data is in a separate
> extension. The correct way to install it is with:
> 
> create extension address_standardizer_data_us;
> 
> -Steve
> 
> On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> > Hi, Steve,
> > 
> > Thanks.
> > 
> > 2 questions.
> > 
> > 1.  How can we remove things like Room 2a, Buildings 2-6b and
> etc with
> > regexp replace?
> > 2.  Once extensions created, can these functions be adapted?  Are
> > codes available ?    I will see whether to put it into a
> project, so
> > that our programmers can have work to do.
> > 
> > Regards,
> > 
> > Shao
> > 
> > On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> > <stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>
> <mailto:stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>>>
> > wrote:
> > 
> > On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
> > > Hi, Steve,
> > > 
> > > Many thanks.  Please send me the link to parse_address() and
> > > standardize_address().
> > 
> > If you already have postGIS installed then
> > 
> > create extension address_standardizer;
> > 
> > # \df parse_address
> > List of functions
> > Schema |     Name      | Result data type | Argument data
> types
> > > Type
> > 
> --------+---------------+------------------+---------------------------------------- \
> -------------------------------------------------------------------------------------------------------------------+--------
> 
> > public | parse_address | record           | text, OUT num
> text, OUT
> > street text, OUT street2 text, OUT address1 text, OUT city
> text, OUT
> > state text, OUT zip text, OUT zipplus text, OUT country text
> > normal
> > (1 row)
> > 
> > # select * from parse_address('123-2 main street city ny');
> > num  |   street    | street2 |     address1      | city |
> state
> > > zip
> > > zipplus | country
> > 
> -------+-------------+---------+-------------------+------+-------+-----+---------+---------
> 
> > 123-2 | main street |         | 123-2 main street | city |
> NY |
> > > > US
> > (1 row)
> > 
> > # \df standardize*
> > List of functions
> > Schema |        Name         | Result data type |
> > Argument data types                      | Type
> > 
> --------+---------------------+------------------+---------------------------------------------------------------+--------
> 
> > public | standardize_address | stdaddr          | lextab text,
> > gaztab
> > text, rultab text, address text           | normal
> > public | standardize_address | stdaddr          | lextab text,
> > gaztab
> > text, rultab text, micro text, macro text | normal
> > 
> > You need tables for the lexicon, gazetteer, and rules, which
> > should be
> > included in the extension but I'm not seeing them. So you
> can grab
> > these
> > from:
> > 
> > 
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
> 
> > 
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
> 
> > 
> https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
> 
> > 
> > and load them like:
> > 
> > psql mydb -f us-gaz.sql
> > psql mydb -f us-lex.sql
> > psql mydb -f us-rules.sql
> > 
> > # select * from standardize_address('lex', 'gaz', 'rules',
> '123-2
> > main
> > street city ny');
> > building | house_num | predir | qual | pretype | name  |
> suftype |
> > sufdir | ruralroute | extra | city |  state   | country |
> postcode
> > > box
> > > unit
> > 
> ----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
> 
> > > 123       |        |      |         | 2 MAIN |
> STREET
> > > > > > CITY | NEW YORK | USA     |
> > > 
> > (1 row)
> > 
> > 
> > This is a good example of why parsing addresses is so
> difficult. The
> > rules for standardize_address do not account for a house
> number like
> > "123-2", but the regexp in parse_address do handle it. It is
> easy
> > to get
> > the 80% right and very hard to get it much above that.
> > 
> > -Steve
> > 
> > 
> > > 
> > > I need to find these first before test-running.
> > > 
> > > Regards,
> > > 
> > > Shao
> > > 
> > > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> > > <stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>
> > <mailto:stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>>
> > <mailto:stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>
> > <mailto:stephenwoodbridge37@gmail.com
> <mailto:stephenwoodbridge37@gmail.com>>>>
> > > wrote:
> > > 
> > > Shao,
> > > 
> > > '^( *Building *[0-9]+)?[- 0-9]*'
> > > 
> > > or something like that should do it. But I think you will
> > find that a
> > > more robust solution is to use parse_address() and/or
> > > standardize_address() as they will recognize a lot of
> other
> > address
> > > constructs, like "apt 3a" for for example.
> > > 
> > > parse_address() that a text field and breaks it into
> "house
> > number
> > > street name" and "city state zip", but only works well
> in North
> > > America.
> > > 
> > > standardize_address() that comes with postGIS, breaks the
> > address
> > > down
> > > into its components and can separate out things like
> > buildings, and
> > > apartment/unit specifiers so you can then take the fields
> > you are
> > > interested in and recombine just them in a new string.
> > Again, this
> > > works
> > > best in North America.
> > > 
> > > My github address-standardizer is built to recognize
> address
> > for most
> > > counties, but it can also be configured to recognize
> address
> > > standards
> > > for any county without too much effort. It compiles and
> > installs as
> > > postgresql extension.
> > > 
> > > Addresses are generally very messy and unless your
> addresses
> > are vary
> > > simple you will be constantly fighting with this or that
> > exception.
> > > 
> > > -Steve
> > > 
> > > On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
> > > > Is there a way to left trim including the building
> and number?
> > > > 
> > > > Building 3  21-1              Great Avenue, a city, a
> > country, this
> > > > planet.
> > > > 
> > > > How to take way those things which are too local to an
> > address?
> > > > 
> > > > Regards,
> > > > 
> > > > Shao
> > > > 
> > > > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
> > > <shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com> <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>>
> > <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com> <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>>>
> > > > <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>
> > <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>> <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>
> > <mailto:shishaozhong@gmail.com
> <mailto:shishaozhong@gmail.com>>>>>
> > > wrote:
> > > > 
> > > > I find this is a simple, but important question.
> > > > 
> > > > How best to split numbers and the rest of address?
> > > > 
> > > > For instance, one tricky one is as follows:
> > > > 
> > > > 21-1 Great Avenue, a city, a country, this planet
> > > > 
> > > > How to turn this into the following:
> > > > 
> > > > column 1,       column 2
> > > > 
> > > > 21-1              Great Avenue, a city, a country,
> > this planet
> > > > 
> > > > Note:  there is a hyphen in  21-1
> > > > 
> > > > Any clue?
> > > > 
> > > > Regards,
> > > > 
> > > > Shao
> > > > 
> > > > 
> > > > _______________________________________________
> > > > postgis-users mailing list
> > > > postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>>
> > > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > > 
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> > <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > > 
> > > 
> > > _______________________________________________
> > > postgis-users mailing list
> > > postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > 
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>
> <mailto:postgis-users@lists.osgeo.org
> <mailto:postgis-users@lists.osgeo.org>>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> > 
> > 
> > _______________________________________________
> > postgis-users mailing list
> > postgis-users@lists.osgeo.org <mailto:postgis-users@lists.osgeo.org>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> 
> _______________________________________________
> postgis-users mailing list
> postgis-users@lists.osgeo.org <mailto:postgis-users@lists.osgeo.org>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
> 
> 
> _______________________________________________
> postgis-users mailing list
> postgis-users@lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
postgis-users@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic