[prev in list] [next in list] [prev in thread] [next in thread]
List: ruby-talk
Subject: Re: Splitting a text file into sentences
From: Jeffrey Schwab <jeff () schwabcenter ! com>
Date: 2005-11-30 23:42:30
Message-ID: PLqjf.325$TU6.11994 () twister ! southeast ! rr ! com
[Download RAW message or body]
Dave Howell wrote:
> I think "right" or "wrong" are a tad strong for most of the cases sited.
> But as a professional book designer and typographer, there's
> unquestionably "better" and "worse."
>
> For improved legibility, inter-sentence space should generally be a bit
> greater than inter-word space.
>
> Typewriters only had one distance they could travel. Either 1/10th of an
> inch ("Pica") or 1/12th ("Elite"). So the only way to add extra space
> after a sentence was to double it. That's way too much extra space, but
> it was generally better than the alternative. The real problem was that
> the words were too far apart, not that the sentences were too close, but
> again, the fixed spacing was already an abominable situation.
>
> Proportional type, dating all the way back to Gutenberg, would generally
> use 1/3rd or 1/4th of the height of type type as the inter-word spacing.
> This would usually work out to about the width of a lower case "t" or "l".
>
> When setting modern (by which you may also read "all type before
> typewriters" as well) proportional type in fully justified form (left
> and right margins both even), the spaces must be stretched out on a
> line-by-line basis to fit. Really good typesetting programs (and really
> good typesetters sticking little bits of lead between their words (and
> I've done that, too)) will add more of the space between sentences than
> between words, so as the line stretches, the inter-word space to
> inter-sentence space ratio actually changes. (Take a look at a narrow
> newspaper column sometime.)
>
> More sophisticated approaches to space will ignore a user's attempt to
> sprinkle extraneous space in. Less sophisticated ones might allow it,
> and even treat them as individual spaces, stretching both of them during
> expansion. {shudder}
>
> The fact that both the MLA Guidelines and the Bedford Handbook encourage
> poor typography is regrettable. ("If you cannot type appropriate
> punctuation, e.g. an em-dash or en-dash, please use appropriate
> substitutions. For both dashes, substitute a pair of hyphens, which,
> like true dashes, are typed without adjacent spaces." There's still
> software out there that will happily wrap a line between the two
> hyphens. Ick!) Nevertheless, if you're submitting a paper to an
> institution that expects or requires that, then to not follow them is
> wrong, even if the legibility of the submission is better.
>
> What it all boils down to is "Putting two spaces after a period at the
> end of a sentence is an artifact left over from the days when the
> typewriter was the prevalent text-making tool. Unless you have a
> specific reason or requirement to do otherwise, it's preferable to put
> only one space between sentences."
>
> *****
>
> For breaking text into sentences, sometimes I find it easier to work
> backwards. Also, only very colloquial writing will have a one-word
> sentence, so you can solve all "Mr./Dr./Ph.D." cases by the fact that if
> a word starts with a cap and ends with a period, it's not a sentence.
> For a more sophisticated approach that's still not too complex to
> program, check the final word of a sentence against a dictionary. If
> it's found there without a final dot, then you're almost certainly
> looking at the end of a sentence. If it isn't, then is it found anywhere
> else in the document without a dot? If not, then you're probably looking
> at an abbreviation. (My mail program uses a monospaced font. If I
> thought most readers would read it with a proportional font, I'd have
> typed "Ph. D." above, since it should have a thin space before the D.)
This is what I love about Usenet. :)
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic