[prev in list] [next in list] [prev in thread] [next in thread]
List: webkit-changes
Subject: [webkit-changes] [42023] trunk/WebCore
From: eric () webkit ! org
Date: 2009-03-27 0:28:47
Message-ID: 20090327002848.E1CF012FF0AD () beta ! macosforge ! org
[Download RAW message or body]
[Attachment #2 (text/html)]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>[42023] trunk/WebCore</title>
</head>
<body>
<style type="text/css"><!--
#msg dl.meta { border: 1px #006 solid; background: #369; padding: 6px; color: #fff; }
#msg dl.meta dt { float: left; width: 6em; font-weight: bold; }
#msg dt:after { content:':';}
#msg dl, #msg dt, #msg ul, #msg li, #header, #footer, #logmsg { font-family: \
verdana,arial,helvetica,sans-serif; font-size: 10pt; } #msg dl a { font-weight: \
bold} #msg dl a:link { color:#fc3; }
#msg dl a:active { color:#ff0; }
#msg dl a:visited { color:#cc6; }
h3 { font-family: verdana,arial,helvetica,sans-serif; font-size: 10pt; font-weight: \
bold; } #msg pre { overflow: auto; background: #ffc; border: 1px #fa0 solid; padding: \
6px; } #logmsg { background: #ffc; border: 1px #fa0 solid; padding: 1em 1em 0 1em; }
#logmsg p, #logmsg pre, #logmsg blockquote { margin: 0 0 1em 0; }
#logmsg p, #logmsg li, #logmsg dt, #logmsg dd { line-height: 14pt; }
#logmsg h1, #logmsg h2, #logmsg h3, #logmsg h4, #logmsg h5, #logmsg h6 { margin: .5em \
0; } #logmsg h1:first-child, #logmsg h2:first-child, #logmsg h3:first-child, #logmsg \
h4:first-child, #logmsg h5:first-child, #logmsg h6:first-child { margin-top: 0; } \
#logmsg ul, #logmsg ol { padding: 0; list-style-position: inside; margin: 0 0 0 1em; \
} #logmsg ul { text-indent: -1em; padding-left: 1em; }#logmsg ol { text-indent: \
-1.5em; padding-left: 1.5em; } #logmsg > ul, #logmsg > ol { margin: 0 0 1em 0; }
#logmsg pre { background: #eee; padding: 1em; }
#logmsg blockquote { border: 1px solid #fa0; border-left-width: 10px; padding: 1em \
1em 0 1em; background: white;} #logmsg dl { margin: 0; }
#logmsg dt { font-weight: bold; }
#logmsg dd { margin: 0; padding: 0 0 0.5em 0; }
#logmsg dd:before { content:'\00bb';}
#logmsg table { border-spacing: 0px; border-collapse: collapse; border-top: 4px solid \
#fa0; border-bottom: 1px solid #fa0; background: #fff; } #logmsg table th { \
text-align: left; font-weight: normal; padding: 0.2em 0.5em; border-top: 1px dotted \
#fa0; } #logmsg table td { text-align: right; border-top: 1px dotted #fa0; padding: \
0.2em 0.5em; } #logmsg table thead th { text-align: center; border-bottom: 1px solid \
#fa0; } #logmsg table th.Corner { text-align: left; }
#logmsg hr { border: none 0; border-top: 2px dashed #fa0; height: 1px; }
#header, #footer { color: #fff; background: #636; border: 1px #300 solid; padding: \
6px; } #patch { width: 100%; }
#patch h4 {font-family: \
verdana,arial,helvetica,sans-serif;font-size:10pt;padding:8px;background:#369;color:#fff;margin:0;}
#patch .propset h4, #patch .binary h4 {margin:0;}
#patch pre {padding:0;line-height:1.2em;margin:0;}
#patch .diff {width:100%;background:#eee;padding: 0 0 10px 0;overflow:auto;}
#patch .propset .diff, #patch .binary .diff {padding:10px 0;}
#patch span {display:block;padding:0 10px;}
#patch .modfile, #patch .addfile, #patch .delfile, #patch .propset, #patch .binary, \
#patch .copfile {border:1px solid #ccc;margin:10px 0;} #patch ins \
{background:#dfd;text-decoration:none;display:block;padding:0 10px;} #patch del \
{background:#fdd;text-decoration:none;display:block;padding:0 10px;} #patch .lines, \
.info {color:#888;background:#fff;}
--></style>
<div id="msg">
<dl class="meta">
<dt>Revision</dt> <dd><a \
href="http://trac.webkit.org/projects/webkit/changeset/42023">42023</a></dd> \
<dt>Author</dt> <dd>eric@webkit.org</dd> <dt>Date</dt> <dd>2009-03-26 17:28:47 -0700 \
(Thu, 26 Mar 2009)</dd> </dl>
<h3>Log Message</h3>
<pre> No additional review, committing previously reviewed files for build fix \
only.
Add files I missed when commiting Jungshik's patch in r42022.
https://bugs.webkit.org/show_bug.cgi?id=16482
* icu/unicode/ucsdet.h: Added.
* platform/text/TextEncodingDetector.h: Added.
* platform/text/TextEncodingDetectorICU.cpp: Added.
(WebCore::detectTextEncoding):
* platform/text/TextEncodingDetectorNone.cpp: Added.
(WebCore::detectTextEncoding):</pre>
<h3>Modified Paths</h3>
<ul>
<li><a href="#trunkWebCoreChangeLog">trunk/WebCore/ChangeLog</a></li>
</ul>
<h3>Added Paths</h3>
<ul>
<li><a href="#trunkWebCoreicuunicodeucsdeth">trunk/WebCore/icu/unicode/ucsdet.h</a></li>
<li><a href="#trunkWebCoreplatformtextTextEncodingDetectorh">trunk/WebCore/platform/text/TextEncodingDetector.h</a></li>
<li><a href="#trunkWebCoreplatformtextTextEncodingDetectorICUcpp">trunk/WebCore/platform/text/TextEncodingDetectorICU.cpp</a></li>
<li><a href="#trunkWebCoreplatformtextTextEncodingDetectorNonecpp">trunk/WebCore/platform/text/TextEncodingDetectorNone.cpp</a></li>
</ul>
</div>
<div id="patch">
<h3>Diff</h3>
<a id="trunkWebCoreChangeLog"></a>
<div class="modfile"><h4>Modified: trunk/WebCore/ChangeLog (42022 => 42023)</h4>
<pre class="diff"><span>
<span class="info">--- trunk/WebCore/ChangeLog 2009-03-27 00:01:58 UTC (rev 42022)
+++ trunk/WebCore/ChangeLog 2009-03-27 00:28:47 UTC (rev 42023)
</span><span class="lines">@@ -1,3 +1,17 @@
</span><ins>+2009-03-26 Eric Seidel <eric@webkit.org>
+
+ No additional review, committing previously reviewed files for build fix \
only. +
+ Add files I missed when commiting Jungshik's patch in r42022.
+ https://bugs.webkit.org/show_bug.cgi?id=16482
+
+ * icu/unicode/ucsdet.h: Added.
+ * platform/text/TextEncodingDetector.h: Added.
+ * platform/text/TextEncodingDetectorICU.cpp: Added.
+ (WebCore::detectTextEncoding):
+ * platform/text/TextEncodingDetectorNone.cpp: Added.
+ (WebCore::detectTextEncoding):
+
</ins><span class="cx"> 2009-03-26 Jungshik Shin <jshin@chromium.org>
</span><span class="cx">
</span><span class="cx"> Reviewed by Alexey Proskuryakov.
</span></span></pre></div>
<a id="trunkWebCoreicuunicodeucsdeth"></a>
<div class="addfile"><h4>Added: trunk/WebCore/icu/unicode/ucsdet.h (0 => 42023)</h4>
<pre class="diff"><span>
<span class="info">--- trunk/WebCore/icu/unicode/ucsdet.h \
(rev 0)
+++ trunk/WebCore/icu/unicode/ucsdet.h 2009-03-27 00:28:47 UTC (rev 42023)
</span><span class="lines">@@ -0,0 +1,350 @@
</span><ins>+/*
+ **********************************************************************
+ * Copyright (C) 2005-2006, International Business Machines
+ * Corporation and others. All Rights Reserved.
+ **********************************************************************
+ * file name: ucsdet.h
+ * encoding: US-ASCII
+ * indentation:4
+ *
+ * created on: 2005Aug04
+ * created by: Andy Heninger
+ *
+ * ICU Character Set Detection, API for C
+ *
+ * Draft version 18 Oct 2005
+ *
+ */
+
+#ifndef __UCSDET_H
+#define __UCSDET_H
+
+#include "unicode/utypes.h"
+
+#if !UCONFIG_NO_CONVERSION
+#include "unicode/uenum.h"
+
+/**
+ * \file
+ * \brief C API: Charset Detection API
+ *
+ * This API provides a facility for detecting the
+ * charset or encoding of character data in an unknown text format.
+ * The input data can be from an array of bytes.
+ * <p>
+ * Character set detection is at best an imprecise operation. The detection
+ * process will attempt to identify the charset that best matches the \
characteristics + * of the byte data, but the process is partly statistical in \
nature, and + * the results can not be guaranteed to always be correct.
+ * <p>
+ * For best accuracy in charset detection, the input data should be primarily
+ * in a single language, and a minimum of a few hundred bytes worth of plain text
+ * in the language are needed. The detection process will attempt to
+ * ignore html or xml style markup that could otherwise obscure the content.
+ */
+
+
+struct UCharsetDetector;
+/**
+ * Structure representing a charset detector
+ * @draft ICU 3.6
+ */
+typedef struct UCharsetDetector UCharsetDetector;
+
+struct UCharsetMatch;
+/**
+ * Opaque structure representing a match that was identified
+ * from a charset detection operation.
+ * @draft ICU 3.6
+ */
+typedef struct UCharsetMatch UCharsetMatch;
+
+/**
+ * Open a charset detector.
+ *
+ * @param status Any error conditions occurring during the open
+ * operation are reported back in this variable.
+ * @return the newly opened charset detector.
+ * @draft ICU 3.6
+ */
+U_DRAFT UCharsetDetector * U_EXPORT2
+ucsdet_open(UErrorCode *status);
+
+/**
+ * Close a charset detector. All storage and any other resources
+ * owned by this charset detector will be released. Failure to
+ * close a charset detector when finished with it can result in
+ * memory leaks in the application.
+ *
+ * @param ucsd The charset detector to be closed.
+ * @draft ICU 3.6
+ */
+U_DRAFT void U_EXPORT2
+ucsdet_close(UCharsetDetector *ucsd);
+
+/**
+ * Set the input byte data whose charset is to detected.
+ *
+ * Ownership of the input text byte array remains with the caller.
+ * The input string must not be altered or deleted until the charset
+ * detector is either closed or reset to refer to different input text.
+ *
+ * @param ucsd the charset detector to be used.
+ * @param textIn the input text of unknown encoding. .
+ * @param len the length of the input text, or -1 if the text
+ * is NUL terminated.
+ * @param status any error conditions are reported back in this variable.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT void U_EXPORT2
+ucsdet_setText(UCharsetDetector *ucsd, const char *textIn, int32_t len, UErrorCode \
*status); +
+
+/** Set the declared encoding for charset detection.
+ * The declared encoding of an input text is an encoding obtained
+ * by the user from an http header or xml declaration or similar source that
+ * can be provided as an additional hint to the charset detector.
+ *
+ * How and whether the declared encoding will be used during the
+ * detection process is TBD.
+ *
+ * @param ucsd the charset detector to be used.
+ * @param encoding an encoding for the current data obtained from
+ * a header or declaration or other source outside
+ * of the byte data itself.
+ * @param length the length of the encoding name, or -1 if the name string
+ * is NUL terminated.
+ * @param status any error conditions are reported back in this variable.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT void U_EXPORT2
+ucsdet_setDeclaredEncoding(UCharsetDetector *ucsd, const char *encoding, int32_t \
length, UErrorCode *status); +
+
+/**
+ * Return the charset that best matches the supplied input data.
+ *
+ * Note though, that because the detection
+ * only looks at the start of the input data,
+ * there is a possibility that the returned charset will fail to handle
+ * the full set of input data.
+ * <p>
+ * The returned UCharsetMatch object is owned by the UCharsetDetector.
+ * It will remain valid until the detector input is reset, or until
+ * the detector is closed.
+ * <p>
+ * The function will fail if
+ * <ul>
+ * <li>no charset appears to match the data.</li>
+ * <li>no input text has been provided</li>
+ * </ul>
+ *
+ * @param ucsd the charset detector to be used.
+ * @param status any error conditions are reported back in this variable.
+ * @return a UCharsetMatch representing the best matching charset,
+ * or NULL if no charset matches the byte data.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT const UCharsetMatch * U_EXPORT2
+ucsdet_detect(UCharsetDetector *ucsd, UErrorCode *status);
+
+
+/**
+ * Find all charset matches that appear to be consistent with the input,
+ * returning an array of results. The results are ordered with the
+ * best quality match first.
+ *
+ * Because the detection only looks at a limited amount of the
+ * input byte data, some of the returned charsets may fail to handle
+ * the all of input data.
+ * <p>
+ * The returned UCharsetMatch objects are owned by the UCharsetDetector.
+ * They will remain valid until the detector is closed or modified
+ *
+ * <p>
+ * Return an error if
+ * <ul>
+ * <li>no charsets appear to match the input data.</li>
+ * <li>no input text has been provided</li>
+ * </ul>
+ *
+ * @param ucsd the charset detector to be used.
+ * @param matchesFound pointer to a variable that will be set to the
+ * number of charsets identified that are consistent with
+ * the input data. Output only.
+ * @param status any error conditions are reported back in this variable.
+ * @return A pointer to an array of pointers to UCharSetMatch objects.
+ * This array, and the UCharSetMatch instances to which it \
refers, + * are owned by the UCharsetDetector, and will remain \
valid until + * the detector is closed or modified.
+ * @draft ICU 3.4
+ */
+U_DRAFT const UCharsetMatch ** U_EXPORT2
+ucsdet_detectAll(UCharsetDetector *ucsd, int32_t *matchesFound, UErrorCode *status);
+
+
+
+/**
+ * Get the name of the charset represented by a UCharsetMatch.
+ *
+ * The storage for the returned name string is owned by the
+ * UCharsetMatch, and will remain valid while the UCharsetMatch
+ * is valid.
+ *
+ * The name returned is suitable for use with the ICU conversion APIs.
+ *
+ * @param ucsm The charset match object.
+ * @param status Any error conditions are reported back in this variable.
+ * @return The name of the matching charset.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT const char * U_EXPORT2
+ucsdet_getName(const UCharsetMatch *ucsm, UErrorCode *status);
+
+/**
+ * Get a confidence number for the quality of the match of the byte
+ * data with the charset. Confidence numbers range from zero to 100,
+ * with 100 representing complete confidence and zero representing
+ * no confidence.
+ *
+ * The confidence values are somewhat arbitrary. They define an
+ * an ordering within the results for any single detection operation
+ * but are not generally comparable between the results for different input.
+ *
+ * A confidence value of ten does have a general meaning - it is used
+ * for charsets that can represent the input data, but for which there
+ * is no other indication that suggests that the charset is the correct one.
+ * Pure 7 bit ASCII data, for example, is compatible with a
+ * great many charsets, most of which will appear as possible matches
+ * with a confidence of 10.
+ *
+ * @param ucsm The charset match object.
+ * @param status Any error conditions are reported back in this variable.
+ * @return A confidence number for the charset match.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT int32_t U_EXPORT2
+ucsdet_getConfidence(const UCharsetMatch *ucsm, UErrorCode *status);
+
+/**
+ * Get the RFC 3066 code for the language of the input data.
+ *
+ * The Charset Detection service is intended primarily for detecting
+ * charsets, not language. For some, but not all, charsets, a language is
+ * identified as a byproduct of the detection process, and that is what
+ * is returned by this function.
+ *
+ * CAUTION:
+ * 1. Language information is not available for input data encoded in
+ * all charsets. In particular, no language is identified
+ * for UTF-8 input data.
+ *
+ * 2. Closely related languages may sometimes be confused.
+ *
+ * If more accurate language detection is required, a linguistic
+ * analysis package should be used.
+ *
+ * The storage for the returned name string is owned by the
+ * UCharsetMatch, and will remain valid while the UCharsetMatch
+ * is valid.
+ *
+ * @param ucsm The charset match object.
+ * @param status Any error conditions are reported back in this variable.
+ * @return The RFC 3066 code for the language of the input data, or
+ * an empty string if the language could not be determined.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT const char * U_EXPORT2
+ucsdet_getLanguage(const UCharsetMatch *ucsm, UErrorCode *status);
+
+
+/**
+ * Get the entire input text as a UChar string, placing it into
+ * a caller-supplied buffer. A terminating
+ * NUL character will be appended to the buffer if space is available.
+ *
+ * The number of UChars in the output string, not including the terminating
+ * NUL, is returned.
+ *
+ * If the supplied buffer is smaller than required to hold the output,
+ * the contents of the buffer are undefined. The full output string length
+ * (in UChars) is returned as always, and can be used to allocate a buffer
+ * of the correct size.
+ *
+ *
+ * @param ucsm The charset match object.
+ * @param buf A UChar buffer to be filled with the converted text data.
+ * @param cap The capacity of the buffer in UChars.
+ * @param status Any error conditions are reported back in this variable.
+ * @return The number of UChars in the output string.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT int32_t U_EXPORT2
+ucsdet_getUChars(const UCharsetMatch *ucsm,
+ UChar *buf, int32_t cap, UErrorCode *status);
+
+
+
+/**
+ * Get an iterator over the set of all detectable charsets -
+ * over the charsets that are known to the charset detection
+ * service.
+ *
+ * The returned UEnumeration provides access to the names of
+ * the charsets.
+ *
+ * The state of the Charset detector that is passed in does not
+ * affect the result of this function, but requiring a valid, open
+ * charset detector as a parameter insures that the charset detection
+ * service has been safely initialized and that the required detection
+ * data is available.
+ *
+ * @param ucsd a Charset detector.
+ * @param status Any error conditions are reported back in this variable.
+ * @return an iterator providing access to the detectable charset names.
+ * @draft ICU 3.6
+ */
+
+U_DRAFT UEnumeration * U_EXPORT2
+ucsdet_getAllDetectableCharsets(const UCharsetDetector *ucsd, UErrorCode *status);
+
+
+/**
+ * Test whether input filtering is enabled for this charset detector.
+ * Input filtering removes text that appears to be HTML or xml
+ * markup from the input before applying the code page detection
+ * heuristics.
+ *
+ * @param ucsd The charset detector to check.
+ * @return TRUE if filtering is enabled.
+ * @draft ICU 3.4
+ */
+U_DRAFT UBool U_EXPORT2
+ucsdet_isInputFilterEnabled(const UCharsetDetector *ucsd);
+
+
+/**
+ * Enable filtering of input text. If filtering is enabled,
+ * text within angle brackets ("<" and ">") will be \
removed + * before detection, which will remove most HTML or xml markup.
+ *
+ * @param ucsd the charset detector to be modified.
+ * @param filter <code>true</code> to enable input text filtering.
+ * @return The previous setting.
+ *
+ * @draft ICU 3.6
+ */
+U_DRAFT UBool U_EXPORT2
+ucsdet_enableInputFilter(UCharsetDetector *ucsd, UBool filter);
+
+#endif
+#endif /* __UCSDET_H */
+
+
</ins></span></pre></div>
<a id="trunkWebCoreplatformtextTextEncodingDetectorh"></a>
<div class="addfile"><h4>Added: trunk/WebCore/platform/text/TextEncodingDetector.h (0 \
=> 42023)</h4> <pre class="diff"><span>
<span class="info">--- trunk/WebCore/platform/text/TextEncodingDetector.h \
(rev 0)
+++ trunk/WebCore/platform/text/TextEncodingDetector.h 2009-03-27 00:28:47 UTC (rev \
42023) </span><span class="lines">@@ -0,0 +1,48 @@
</span><ins>+/*
+ * Copyright (C) 2009 Google Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef TextEncodingDetector_h
+#define TextEncodingDetector_h
+
+namespace WebCore {
+
+ class TextEncoding;
+
+ // Given a sequence of bytes in |data| of length |len| and an optional
+ // hintEncodingName, detect the most likely character encoding.
+ // The way hintEncodingName is used is up to an implementation.
+ // Currently, the only caller sets it to the parent frame encoding.
+ bool detectTextEncoding(const char* data, size_t len,
+ const char* hintEncodingName,
+ TextEncoding* detectedEncoding);
+
+} // namespace WebCore
+
+#endif
</ins></span></pre></div>
<a id="trunkWebCoreplatformtextTextEncodingDetectorICUcpp"></a>
<div class="addfile"><h4>Added: \
trunk/WebCore/platform/text/TextEncodingDetectorICU.cpp (0 => 42023)</h4> <pre \
class="diff"><span> <span class="info">--- \
trunk/WebCore/platform/text/TextEncodingDetectorICU.cpp (rev \
0)
+++ trunk/WebCore/platform/text/TextEncodingDetectorICU.cpp 2009-03-27 00:28:47 UTC \
(rev 42023) </span><span class="lines">@@ -0,0 +1,127 @@
</span><ins>+/*
+ * Copyright (C) 2008, 2009 Google Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "config.h"
+#include "TextEncodingDetector.h"
+
+#include "TextEncoding.h"
+#ifndef BUILDING_ON_TIGER
+#include "unicode/ucnv.h"
+#include "unicode/ucsdet.h"
+#endif
+
+namespace WebCore {
+
+bool detectTextEncoding(const char* data, size_t len,
+ const char* hintEncodingName,
+ TextEncoding* detectedEncoding)
+{
+ *detectedEncoding = TextEncoding();
+#ifdef BUILDING_ON_TIGER
+ // Tiger came with ICU 3.2 and does not have the encoding detector.
+ UNUSED_PARAM(data);
+ UNUSED_PARAM(len);
+ UNUSED_PARAM(hintEncodingName);
+ return false;
+#else
+ int matchesCount = 0;
+ UErrorCode status = U_ZERO_ERROR;
+ UCharsetDetector* detector = ucsdet_open(&status);
+ if (U_FAILURE(status))
+ return false;
+ ucsdet_enableInputFilter(detector, true);
+ ucsdet_setText(detector, data, static_cast<int32_t>(len), &status);
+ if (U_FAILURE(status))
+ return false;
+
+ // FIXME: A few things we can do other than improving
+ // the ICU detector itself.
+ // 1. Use ucsdet_detectAll and pick the most likely one given
+ // "the context" (parent-encoding, referrer encoding, etc).
+ // 2. 'Emulate' Firefox/IE's non-Universal detectors (e.g.
+ // Chinese, Japanese, Russian, Korean and Hebrew) by picking the
+ // encoding with a highest confidence among the detetctor-specific
+ // limited set of candidate encodings.
+ // Below is a partial implementation of the first part of what's outlined
+ // above.
+ const UCharsetMatch** matches = ucsdet_detectAll(detector, &matchesCount, \
&status); + if (U_FAILURE(status)) {
+ ucsdet_close(detector);
+ return false;
+ }
+
+ const char* encoding = 0;
+ if (hintEncodingName) {
+ TextEncoding hintEncoding(hintEncodingName);
+ // 10 is the minimum confidence value consistent with the codepoint
+ // allocation in a given encoding. The size of a chunk passed to
+ // us varies even for the same html file (apparently depending on
+ // the network load). When we're given a rather short chunk, we
+ // don't have a sufficiently reliable signal other than the fact that
+ // the chunk is consistent with a set of encodings. So, instead of
+ // setting an arbitrary threshold, we have to scan all the encodings
+ // consistent with the data.
+ const int32_t kThresold = 10;
+ for (int i = 0; i < matchesCount; ++i) {
+ int32_t confidence = ucsdet_getConfidence(matches[i], &status);
+ if (U_FAILURE(status)) {
+ status = U_ZERO_ERROR;
+ continue;
+ }
+ if (confidence < kThresold)
+ break;
+ const char* matchEncoding = ucsdet_getName(matches[i], &status);
+ if (U_FAILURE(status)) {
+ status = U_ZERO_ERROR;
+ continue;
+ }
+ if (TextEncoding(matchEncoding) == hintEncoding) {
+ encoding = hintEncodingName;
+ break;
+ }
+ }
+ }
+ // If no match is found so far, just pick the top match.
+ // This can happen, say, when a parent frame in EUC-JP refers to
+ // a child frame in Shift_JIS and both frames do NOT specify the encoding
+ // making us resort to auto-detection (when it IS turned on).
+ if (!encoding && matchesCount > 0)
+ encoding = ucsdet_getName(matches[0], &status);
+ if (U_SUCCESS(status)) {
+ *detectedEncoding = TextEncoding(encoding);
+ ucsdet_close(detector);
+ return true;
+ }
+ ucsdet_close(detector);
+ return false;
+#endif
+}
+
+}
</ins></span></pre></div>
<a id="trunkWebCoreplatformtextTextEncodingDetectorNonecpp"></a>
<div class="addfile"><h4>Added: \
trunk/WebCore/platform/text/TextEncodingDetectorNone.cpp (0 => 42023)</h4> <pre \
class="diff"><span> <span class="info">--- \
trunk/WebCore/platform/text/TextEncodingDetectorNone.cpp (rev \
0)
+++ trunk/WebCore/platform/text/TextEncodingDetectorNone.cpp 2009-03-27 00:28:47 UTC \
(rev 42023) </span><span class="lines">@@ -0,0 +1,50 @@
</span><ins>+/*
+ * Copyright (C) 2009 Google Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "config.h"
+#include "TextEncodingDetector.h"
+
+#include "TextEncoding.h"
+
+namespace WebCore {
+
+bool detectTextEncoding(const char* data, size_t len,
+ const char* hintEncodingName,
+ TextEncoding* detectedEncoding)
+{
+ UNUSED_PARAM(data)
+ UNUSED_PARAM(len)
+ UNUSED_PARAM(hintEncodingName)
+
+ *detectedEncoding = TextEncoding();
+ return false;
+}
+
+}
</ins></span></pre>
</div>
</div>
</body>
</html>
_______________________________________________
webkit-changes mailing list
webkit-changes@lists.webkit.org
http://lists.webkit.org/mailman/listinfo.cgi/webkit-changes
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic