[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lyx-devel
Subject:    Re: dash documentation patch
From:       Guenter Milde <milde () users ! sf ! net>
Date:       2017-08-31 20:53:41
Message-ID: oo9t0k$ev3$1 () blaine ! gmane ! org
[Download RAW message or body]

On 2017-08-31, Enrico Forestieri wrote:
> On Wed, Aug 30, 2017 at 11:06:25PM +0200, Guenter Milde wrote:

>> +* Since version 2.2, -- and --- in the LyX source are output as -{}- and
>> +  -{}-{}- to prevent conversion to en- and em-dashes by TeX.
>> +  Occurences in pre-2.2 documents are converted to literal Unicode dashes.
>> +  In some cases this leads to different line breaks, as:
>> +  + there is an optional line break after hyphens (also -- and ---) but not
>> +    after literal dashes, and
>> +  + hyphenation is suppressed in words following hyphens but allowed after
>> +    literal dashes.

> I think the above does not belong to the release notes for 2.3. It
> would have been appropriate for the 2.2 release notes. IMO, this has to
> go in the docs.

Description of past behaviour is IMO not out of place in a section called
"!!Caveats when upgrading from earlier versions to 2.3.x".
Is there a CHANGELOG where this could be find a better place?

>> +  LyX 2.3 exports literal dashes as -- and --- by default. If you used
>> +  literal em- and en-dashes in pre-2.2 documents, you must manually unselect
>> +  "Document->Settings->Fonts->Output em- and en-dash as ligatures" to ensure
>> +  unchanged behaviour.

> The above might be useful information and thus is Ok.

>> +  It is no longer possible to differentiate dashes with/without optional
>> +  line break using --- and -- vs. literal dashes. Either convert one sort to
>> +  ERT or insert optional line break characters.

> This is true since 2.2, so this does not belong to the release notes for
> 2.3 and is better explained in the docs.

This also one of the caveats then upgrading from earlier versions (which are
not limited to upgrading from 2.2).

>> +  lyx2lyx deletes ZWSP characters following literal em- and en-dashes when
>> +  converting to 2.3 format. If you used literal ZWSP characters (u200b) as
>> +  optional line breaks after dashes, convert them to 0dd wide space insets
>> +  before opening your document with LyX 2.3 or the optional line breaks will
>> +  be lost!

> I find the above so technical to be not understood by the vast majority
> of users. Moreover, the chance that someone used ZWSP to achieve that
> effect is negligible. Better removed from release notes and maybe explained
> in a footnote in the docs.

There is evidence of this usage and the consequences are easily overseen
and (because of the data loss) hard to revert. Even if not understood by
a majority of users, it should be clear to affected users. In any case,
it is not more technical then the next paragraph:

  If using TeX fonts and en- and em-dashes are output as font ligatures,
  when exporting documents containing en- and em-dashes to the format of
  LyX 2.0 or earlier, the following line has to be manually added to the
  unicodesymbols file of that LyX version:<br>
  0x200b "\\hspace{0pt}" "" "" "" "" # ZERO WIDTH SPACE<br>
  This avoids "uncodable character" issues if the document is actually
  loaded by that LyX version. LyX 2.1 and later versions already have the
  necessary definition in their unicodesymbols file


IMO, it is best to render both of them superfluous by not adding ZWSP in the
back conversion and not removing it in the forward conversion.
This also opens the way for true compatibility when exporting to 2.1 and
older formats.

See patch below.


Günter



diff --git a/lib/lyx2lyx/lyx_2_3.py b/lib/lyx2lyx/lyx_2_3.py
index 73ac45cf00..6305d417b0 100644
--- a/lib/lyx2lyx/lyx_2_3.py
+++ b/lib/lyx2lyx/lyx_2_3.py
@@ -1841,103 +1841,57 @@ def revert_chapterbib(document):
 
 
 def convert_dashligatures(document):
-    " Remove a zero-length space (U+200B) after en- and em-dashes. "
-
+    " Set use_dash_ligatures according to content (literal vs. 'ligature' dashes) "
+    # Default:
+    use_dash_ligatures = False # TODO: Get the default from stdtemplate.lyx
+    # Look for dashes (followed by a word or no-break space):
+    # (Documents by LyX 2.1 or older have "\twohyphens\n" or "\threehyphens\n"
+    # as interim representation for dash ligatures in 2.2.)
+    has_literal_dashes = has_ligature_dashes = False
+    for i, line in enumerate(document.body):
+        if re.search(u"[\u2013\u2014]([\w\u00A0]|$)", line, flags=re.UNICODE):
+            has_literal_dashes = True
+        if re.search(ur"(\\twohyphens|\\threehyphens)", line, flags=re.UNICODE):
+            # print "dash in line ", i, document.body[i+1].encode('utf8')
+            if re.match(u"[\w\u00A0]", document.body[i+1], flags=re.UNICODE):
+                has_ligature_dashes = True
+    if has_literal_dashes and has_ligature_dashes:
+        # TODO: insert a warning note in the document?
+        document.warning("""This document contained both literal and "ligature" dashes.
+            Line break may have changed. See UserGuide chapter 3.9.1 for details.""")
+    elif has_literal_dashes:
+        # print "has literal dashes"
+        use_dash_ligatures = False
+    elif has_ligature_dashes:
+        # print "has ligature dashes"
+        use_dash_ligatures = True
+    # insert the setting
     i = find_token(document.header, "\\use_microtype", 0)
     if i != -1:
-        if document.initial_format > 474 and document.initial_format < 509:
-            # This was created by LyX 2.2
-            document.header[i+1:i+1] = ["\\use_dash_ligatures false"]
-        else:
-            # This was created by LyX 2.1 or earlier
-            document.header[i+1:i+1] = ["\\use_dash_ligatures true"]
-
-    i = 0
-    while i < len(document.body):
-        words = document.body[i].split()
-        # Skip some document parts where dashes are not converted
-        if len(words) > 1 and words[0] == "\\begin_inset" and \
-           words[1] in ["CommandInset", "ERT", "External", "Formula", \
-                        "FormulaMacro", "Graphics", "IPA", "listings"]:
-            j = find_end_of_inset(document.body, i)
-            if j == -1:
-                document.warning("Malformed LyX document: Can't find end of " \
-                                 + words[1] + " inset at line " + str(i))
-                i += 1
-            else:
-                i = j
-            continue
-        if len(words) > 0 and words[0] in ["\\leftindent", \
-                "\\paragraph_spacing", "\\align", "\\labelwidthstring"]:
-            i += 1
-            continue
-
-        start = 0
-        while True:
-            j = document.body[i].find(u"\u2013", start) # en-dash
-            k = document.body[i].find(u"\u2014", start) # em-dash
-            if j == -1 and k == -1:
-                break
-            if j == -1 or (k != -1 and k < j):
-                j = k
-            after = document.body[i][j+1:]
-            if after.startswith(u"\u200B"):
-                document.body[i] = document.body[i][:j+1] + after[1:]
-            else:
-                if len(after) == 0 and document.body[i+1].startswith(u"\u200B"):
-                    document.body[i+1] = document.body[i+1][1:]
-                    break
-            start = j+1
-        i += 1
-
+        document.header.insert(i+1, "\\use_dash_ligatures %s"
+                               % str(use_dash_ligatures).lower())
 
 def revert_dashligatures(document):
-    " Remove font ligature settings for en- and em-dashes. "
+    """ Remove font ligature settings for en- and em-dashes.
+    Revert conversion of \twodashes or \threedashes to literal dashes"""
     i = find_token(document.header, "\\use_dash_ligatures", 0)
     if i == -1:
         return
     use_dash_ligatures = get_bool_value(document.header, "\\use_dash_ligatures", i)
     del document.header[i]
-    use_non_tex_fonts = False
     i = find_token(document.header, "\\use_non_tex_fonts", 0)
-    if i != -1:
+    if i == -1:
+        use_non_tex_fonts = False
+    else:
         use_non_tex_fonts = get_bool_value(document.header, "\\use_non_tex_fonts", i)
     if not use_dash_ligatures or use_non_tex_fonts:
         return
-
-    # Add a zero-length space (U+200B) after en- and em-dashes
-    i = 0
-    while i < len(document.body):
-        words = document.body[i].split()
-        # Skip some document parts where dashes are not converted
-        if len(words) > 1 and words[0] == "\\begin_inset" and \
-           words[1] in ["CommandInset", "ERT", "External", "Formula", \
-                        "FormulaMacro", "Graphics", "IPA", "listings"]:
-            j = find_end_of_inset(document.body, i)
-            if j == -1:
-                document.warning("Malformed LyX document: Can't find end of " \
-                                 + words[1] + " inset at line " + str(i))
-                i += 1
-            else:
-                i = j
-            continue
-        if len(words) > 0 and words[0] in ["\\leftindent", \
-                "\\paragraph_spacing", "\\align", "\\labelwidthstring"]:
-            i += 1
-            continue
-
-        start = 0
-        while True:
-            j = document.body[i].find(u"\u2013", start) # en-dash
-            k = document.body[i].find(u"\u2014", start) # em-dash
-            if j == -1 and k == -1:
-                break
-            if j == -1 or (k != -1 and k < j):
-                j = k
-            after = document.body[i][j+1:]
-            document.body[i] = document.body[i][:j+1] + u"\u200B" + after
-            start = j+1
-        i += 1
+    new_body = []
+    for line in document.body:
+        line = '\\twohyphens\n'.join(line.split(u'\u2013'))
+        line = '\\threehyphens\n'.join(line.split(u'\u2014'))
+        new_body.extend(line.split('\n'))
+    document.body = new_body


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic