[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Bug#809: <html> with leading spaces (#809)
From:       Stephan Kulow <coolo () kde ! org>
Date:       1999-03-10 13:23:20
[Download RAW message or body]

David Faure wrote:
> 
> Fix for #809 explained below, for those interested.
> 
> On Tue, Mar 09, 1999 at 05:27:03PM -0500, Dawit Alemayehu wrote:
> > On Tue, 09 Mar 1999, David Faure wrote:
> > >The HTML standard says, AFAIK, that the document has to start with <HTML>.
> > >Remove the leading spaces, and the document will be displayed in kfm.
> >
> > David, this is not correct.  The HTML 4.0 spec explicitly states that both
> > <HTML> & </HTML> tags are optional.  In fact, so are <HEAD></HEAD> and
> > <BODY></BODY> tags.  And I think this was also true for HTML 3.2 spec., but I
> > could be wrong about the 3.2 (Wilbur) spec's.
> 
> Ahem. Ok. I was wrong.
> 
> I had a look at while kmimemagic doesn't match the given file to be HTML.
> Mimemagic does two things :
> - first using $KDEDIR/share/mimelnk/magic file - but this works for fixed
> offsets only (not for <HTML> with spaces before). The real 'file' command does
> like this too.
> - then using some internal table, in case the file was not found or nothing
> matched. This method simply looks for keywords.
> <html> and <head> tend towards HMTL ; "the" tends towards English.
> The real 'file' command simply takes the first keyword found, but our
> kmimemagic does it by looking all keywords, and assigning weights (I don't
> know if this is because Torben implemented it or because the old 'file'
> command he took the code from was like that...)
> 
> Anyway, with the example file, it resulted in :
> 
> Adding type 1 as token <html> was found
> Adding type 1 as token <head> was found               (1 == text/html)
> Adding type 32 as token the was found
> Adding type 32 as token the was found
> Adding type 32 as token the was found                 (32 == text/english)
> 
> With the current weight, this resulted in text being stronger than HTML.
> 
> I easily fixed the problem for the particular file given in the bug report,
> by increasing the weight of HTML (and also fixing the number of words for
> HTML which was too high - the number of match is divided by the number of
> words describing the type).
> 
> But you might wonder : why use weights at all, and not simple first-keyword
> matching ? Well, a file containing the only line :
> 
> A file containing <body > is not necessarily a HTML file...
> 
> is a text file, right ? file says it's HTML, because the first keyword
> found is <body > :). Put 'The' as first word, and then it's English text.
> 
> On the other hand, 'file' says that the following is English text :
> 
>   <html>
> <body>
> This document briefly describes the features of the Pmw megawidget
> framework and how to use the megawidgets
> </body>
> </html>
> 
> ... but that's a bug in file (it looks for "<body ", not for "<body>").
> 
> Another reason for weighting in kfm is the following example :
> 
> The file containing the tag <HTML> is not necessarily the HTML file...
> Well, another the and we're done.
> 
> This one is now recognized as English text by kfm, because it contains a
> lot of 'the' and very few html tags... Hum. A big HTML page could look
> exactly the same.
> 
> But the file
> 
> A file containing <HTML> is not necessarily a HTML file...
> 
> is recognized as HTML by kfm, because it doesn't have 'the' word in it. We
> could add other words that look like English text... but then those words
> would appear in HTML files, and here we go again :)
> 
> What really misses in fact is the ability to look for spaces only before
> <HTML>, to make the difference with real text before <HTML>... But that's
> another story... the magic file syntax doesn't allow regexps...
> 
> That's all. Sorry for being so long !
> With the new numbers, most cases should be ok now, but as you can see
> above, not all of them :)
> The real solution should be done in the file command, allowing regexps or
> at least leading-space removal.
> 
> Anybody wants to contact the author of 'file' ? :)
> 
Well, the kmimemagic code is out of Apache :)

Greetings, Stephan

-- 
As long as Linux remains a religion of freeware fanatics,
Microsoft have nothing to worry about.  
                       By Michael Surkan, PC Week Online

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic