[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-developers
Subject:    [Nutch-dev] [ nutch-Bugs-995730 ] needs 'character encoding' detector
From:       "SourceForge.net" <noreply () sourceforge ! net>
Date:       2004-07-22 6:56:23
Message-ID: E1BnXV9-0001jj-00 () sc8-sf-web4 ! sourceforge ! net
[Download RAW message or body]

Bugs item #995730, was opened at 2004-07-22 02:56
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=995730&group_id=59548

Category: plugin: other
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jungshik Shin (jshin)
Assigned to: Nobody/Anonymous (nobody)
Summary: needs 'character encoding' detector

Initial Comment:
this is a follow-up to bug 993380 (figure out 'charset'
from the meta tag).

Although we can cover a lot of ground using the 'C-T'
header field in in the HTTP header and the
corresponding meta tag in html documents (and in case
of XML, we have to use a similar but a different
'parsing'), in the wild, there are a lot of documents
without any information about the character encoding
used. Browsers like Mozilla and search engines like
Google use character encoding detectors to deal with
these 'unlabelled' documents. 

Mozilla's character encoding detector is GPL/MPL'd and
we might be able to port it to Java. Unfortunately,
it's not fool-proof. However, along with some other
heuristic used by Mozilla and elsewhere, it'll be
possible to achieve a high rate of the detection. 

The following page has links to some other related pages.

http://trainedmonkey.com/week/2004/26

In addition to the character encoding detection, we
also need to detect the language of a document, which
is even harder and should be a separate bug (although
it's related). 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=995730&group_id=59548


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic