[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mandoc-tech
Subject:    Re: Improve catman mandocdb(8) heuristics.
From:       Ingo Schwarze <schwarze () usta ! de>
Date:       2011-12-08 1:10:01
Message-ID: 20111208011001.GC19643 () iris ! usta ! de
[Download RAW message or body]

Hi Kristaps,

Kristaps Dzonsons wrote on Wed, Dec 07, 2011 at 03:56:52PM +0100:

> Enclosed is a patch to de-backspace Nm/Nd lines for mandocdb(8).

I think that makes sense.

> This arose from seeing the results for some LAPACK manuals, which
> are notoriously shitty.  It also cleans up handling of the
> non-terminated string a bit

That part seems good, too.

> and adds a quick check to see if SYNOPSIS has been reached right
> after the NAME.  This occurs when manuals look like this:
> 
>  NAME
>  SYNOPSIS
>    Blah blah blah
> 
> Again, LAPACK...

I don't think i like that, it seems too specific, in particular
looking for the exact string "SYNOPSIS".

Perhaps we should move this check before stripping out the backspace
encoding and do just this:

	line = fgetln(stream, &len);
	if (NULL == line || ' ' != line[0] || '\n' != line[len-1]) {
		buf_appendb(dbuf, buf->cp, buf->size);
		hash_put(hash, buf, TYPE_Nd);
		fclose(stream);
		return;
	}
	fclose(stream);
	line[--len] = '\0';

That removes a lot of duplicate code and may even be better
heuristics.  ANY section header right after the first one
is fundamentally unusable, not just SYNOPSIS.

Note that your "} else if (0 == len) {" can be dropped as well
if we take that route.

> If it's relevant, a check for NAME could also occur, then loop back
> into the fgets().  Thoughts?

If there is another section before NAME, do we really want to use
the content of the NAME section?  I'd say just use the first
section.

Experience with the current makewhatis(8) tells me that being
clever and trying hard buys us very little: all reasonable pages
do not need clever tricks, so at best that finds a bit of additional
information from a small number of botched pages, i.e. most
probably not the most valuable info, and not the largest amount.

Being very resilient buys us more:  Never complain, always return
something at least semi-useful.  The biggest problem with the
current tool is that being so clever, it easily gets really badly
confused, and then it starts complaining loudly.

Yes, we should implement -t (test mode) later on, to help porters
spot pages with broken NAME sections.  But it is very important
to be absolutely silent and not too clever in production mode.

> This area still needs a bit more attention to handle situations like:
> 
>  foo -[\n]?
>  [whitespace]foo - bar[whitespace][\n]?

Oh well, i think delivering just

  foo(1) -
  foo(1) - bar

is good enough in these two cases, stripping trailing whitespace
and falling back to "foo(1) - foo" would be a bit more fancy
in the first case, probably worthwhile, but hardly critical.

One thing that i want to do is read through the current
Makewhatis::Formated and see which features should be ported
and which are better done in a simpler way.  Of course,
i won't complain if somebody beats me to it.

Yours,
  Ingo
--
 To unsubscribe send an email to tech+unsubscribe@mdocml.bsd.lv

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic