[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-developers
Subject:    [Nutch-dev] [ nutch-Bugs-993380 ] figure out charset from the meta tag
From:       "SourceForge.net" <noreply () sourceforge ! net>
Date:       2004-07-22 16:37:03
Message-ID: E1BngZ5-0002cI-00 () sc8-sf-web1 ! sourceforge ! net
[Download RAW message or body]

Bugs item #993380, was opened at 2004-07-18 11:04
Message generated for change (Comment added) made by cutting
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548

Category: plugin: parse-html
Group: None
>Status: Closed
>Resolution: Accepted
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: figure out charset from the meta tag

Initial Comment:
The majority of html documents on the web are served
without 'charset' specified in Content-Type HTTP
header. Currently, Nutch doesn't look 'into' html files
to figure out the character encoding (charset) that is
often specified in 'meta tag' like this:

<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-7">

I changed the html parser to look 'into' html files to
read off the value of 'charset'. SAX xml parser can't
be used because it needs to know the encoding before
parsing. My patch uses a technique often used by
browsers (I know for sure Mozilla does this), which is
to inflates 'byte sequences' (blindly) to 2bytes (by
zero-padding) and to use a regular expression. The
first 2000 bytes are looked into this way and I was
able to figure out the charset of most of documents
whose encoding wouldn't be known otherwise. 

I also explicitly set the fallback charset (when both
HTTP C-T header and meta tag are missing) to
'windows-1252' (a superset of ISO-8859-1). Probably,
this should be made even smarter in two ways:

1) this should be configurable 
2) use per-TLD / ccTLD fall back charset mapping table
  
In addition, I'm storing the value of the 'detected'
character encoding as 'metadata' so that 'cached.jsp'
can make use of it. I'll file a new bug and upload a
patch to cached.jsp.



----------------------------------------------------------------------

>Comment By: Doug Cutting (cutting)
Date: 2004-07-22 09:37

Message:
Logged In: YES 
user_id=21778

I just committed this.  Thanks!

----------------------------------------------------------------------

Comment By: Jungshik Shin (jshin)
Date: 2004-07-21 22:29

Message:
Logged In: YES 
user_id=307557

this is a new patch addressing Doug's concerns.


----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2004-07-20 08:57

Message:
Logged In: YES 
user_id=21778

I guess resolveEncodingAlias is okay in StringUtils,
although I wonder if it might be better on a new class like
TextUtils or I18NUtils... But I'm okay with that in StringUtils.

----------------------------------------------------------------------

Comment By: Jungshik Shin (jshin)
Date: 2004-07-19 21:39

Message:
Logged In: YES 
user_id=307557

Thanks for taking a look at the patch. I'll do what you
suggested (later this week) and upload a new patch. Btw,
charset alias-resolution is not only for html but also
useful for other 'plugins' (say, text/plain). GIven that,
how about keeping 'String resolveEncodingAlias' in
StringUtils while moving 'String sniffCharacterEncoding' to
htmlParser? 

----------------------------------------------------------------------

Comment By: Doug Cutting (cutting)
Date: 2004-07-19 14:21

Message:
Logged In: YES 
user_id=21778

Sorry I missed this one.

Could you please move the utility methods from StringUtil
into HtmlParser?  These are not really universal string
utilities.

Also, please add a default charset to conf/nutch-default.xml
and use NutchConf to access it.  Look at other classes for
how they do this, or ask for help if it is not obvious.

Thanks!

----------------------------------------------------------------------

Comment By: Jungshik Shin (jshin)
Date: 2004-07-19 13:53

Message:
Logged In: YES 
user_id=307557

Doug, thanks for applying my patch for bug 993385
(cached.jsp fix :
https://sourceforge.net/tracker/index.php?func=detail&aid=993385&group_id=59548&atid=491356),
but that patch doesn't work without my patch for this bug.
Can you take a look at this patch and commit it as you see
fit? Thanks.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=993380&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic