'[Bug 7215] New: Towards supporting IDNA (Internationalizing Domain Names in Applications)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-devel
Subject:    [Bug 7215] New: Towards supporting IDNA (Internationalizing Domain Names in Applications)
From:       bugzilla-daemon () bugzilla ! spamassassin ! org
Date:       2015-06-22 19:06:26
Message-ID: bug-7215-26 () https ! bz ! apache ! org/SpamAssassin/
[Download RAW message or body]

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7215

            Bug ID: 7215
           Summary: Towards supporting IDNA (Internationalizing Domain
                    Names in Applications)
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: Mark.Martinec@ijs.si

Opening this ticket to coordinate our efforts towards supporting
Internationalizing Domain Names (which is also coupled with
better use of Unicode features of Perl).

As Kevin's plan is to play with this during Summer, I'm attaching
my current work in this area to avoid duplicating work. None of
this is yet set in stone, so it is open to reshuffling of code
or ditching/replacing/reorganizing it altogether. The main idea
is to provide some tools and example code.

It makes use of a perl module Net::LibIDN, and issues a warning
if this module is not available (and then the feature is off).
Should be compatible with existing code. Might even work with
perl 5.8.9, although 5.12 or later would be a better choice
for its much improved Unicode support.

I'm running this changed code (in SA trunk (4.0), Perl 5.20 and
5.22) for the last two months: it solves my immediate problem
in turning U-labels (in Unicode URIs) to ASCII Compatible Encoding
(ACE) for the purpose of URI lookups against black/white-lists.
Not perfect, but better than nothing.

The main problem there is that a text parser (or HTML parser)
does a poor job of extracting Unicode URIs from text, e.g. it
has no notion of Unicode whitespace or a set of characters
allowed in U-labels. Instead of the more complex task of fixing
the text parser, as a stop-gap solution I added some sanitation
code for extracted URIs: trimming prefix and suffix characters
that cannot appear in a valid Unicode URI. This sanitation code
would eventually be removed when a parser is improved.

Provided general-purpose subroutines are:
- MS::Util::idn_to_ascii
- MS::Util::is_valid_utf_8

and the three user-defined character classes:
- InIDNAWhitespace,
- InIDNAFullStop,
- InIDNA2008

-- 
You are receiving this mail because:
You are the assignee for the bug.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic