[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-bugs-dist
Subject:    Bug#809: <html> with leading spaces (#809)
From:       David Faure <faure () kde ! org>
Date:       1999-03-10 13:09:25
[Download RAW message or body]

Fix for #809 explained below, for those interested.


On Tue, Mar 09, 1999 at 05:27:03PM -0500, Dawit Alemayehu wrote:
> On Tue, 09 Mar 1999, David Faure wrote:
> >The HTML standard says, AFAIK, that the document has to start with <HTML>.
> >Remove the leading spaces, and the document will be displayed in kfm.
> 
> David, this is not correct.  The HTML 4.0 spec explicitly states that both
> <HTML> & </HTML> tags are optional.  In fact, so are <HEAD></HEAD> and
> <BODY></BODY> tags.  And I think this was also true for HTML 3.2 spec., but I
> could be wrong about the 3.2 (Wilbur) spec's.  

Ahem. Ok. I was wrong.

I had a look at while kmimemagic doesn't match the given file to be HTML.
Mimemagic does two things :
- first using $KDEDIR/share/mimelnk/magic file - but this works for fixed
offsets only (not for <HTML> with spaces before). The real 'file' command does
like this too.
- then using some internal table, in case the file was not found or nothing
matched. This method simply looks for keywords.
<html> and <head> tend towards HMTL ; "the" tends towards English.
The real 'file' command simply takes the first keyword found, but our
kmimemagic does it by looking all keywords, and assigning weights (I don't
know if this is because Torben implemented it or because the old 'file'
command he took the code from was like that...)

Anyway, with the example file, it resulted in :

Adding type 1 as token <html> was found
Adding type 1 as token <head> was found               (1 == text/html)
Adding type 32 as token the was found
Adding type 32 as token the was found
Adding type 32 as token the was found                 (32 == text/english)

With the current weight, this resulted in text being stronger than HTML.

I easily fixed the problem for the particular file given in the bug report,
by increasing the weight of HTML (and also fixing the number of words for
HTML which was too high - the number of match is divided by the number of
words describing the type).

But you might wonder : why use weights at all, and not simple first-keyword 
matching ? Well, a file containing the only line :

A file containing <body > is not necessarily a HTML file... 

is a text file, right ? file says it's HTML, because the first keyword
found is <body > :). Put 'The' as first word, and then it's English text.

On the other hand, 'file' says that the following is English text :

  <html>
<body>
This document briefly describes the features of the Pmw megawidget
framework and how to use the megawidgets
</body>
</html>

... but that's a bug in file (it looks for "<body ", not for "<body>").

Another reason for weighting in kfm is the following example :

The file containing the tag <HTML> is not necessarily the HTML file...
Well, another the and we're done.

This one is now recognized as English text by kfm, because it contains a
lot of 'the' and very few html tags... Hum. A big HTML page could look
exactly the same.

But the file

A file containing <HTML> is not necessarily a HTML file... 

is recognized as HTML by kfm, because it doesn't have 'the' word in it. We
could add other words that look like English text... but then those words
would appear in HTML files, and here we go again :)

What really misses in fact is the ability to look for spaces only before
<HTML>, to make the difference with real text before <HTML>... But that's
another story... the magic file syntax doesn't allow regexps...

That's all. Sorry for being so long !
With the new numbers, most cases should be ok now, but as you can see
above, not all of them :)
The real solution should be done in the file command, allowing regexps or
at least leading-space removal.

Anybody wants to contact the author of 'file' ? :)

-- 
David FAURE
david.faure@insa-lyon.fr, faure@kde.org
http://www.insa-lyon.fr/People/AEDI/dfaure/index.html 
KDE, Making The Future of Computing Available Today

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic