Make \c operate like Perl in EBCDIC environments.

2015-06-13 16:10:14 +00:00 · 2015-06-13 16:10:14 +00:00 · c146059c22
parent 149aa29209
commit c146059c22
7 changed files with 134 additions and 40 deletions
--- a/3
+++ b/3
@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/.
 41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
 instead of the EBCDIC value.

+42. The handling of \c in an EBCDIC environment has been revised so that it is
+now compatible with the specification in Perl's perlebcdic page.
+

 Version 10.10 06-March-2015
 ---------------------------
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
+.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters
 in patterns in a visible manner. There is no restriction on the appearance of
 non-printing characters in a pattern, but when a pattern is being prepared by
 text editing, it is often easier to use one of the following escape sequences
-than the binary character it represents:
+than the binary character it represents. In an ASCII or Unicode environment, 
+these escapes are as follows:
 .sp
  \ea        alarm, that is, the BEL character (hex 07)
-  \ecx       "control-x", where x is any ASCII character
+  \ecx       "control-x", where x is any printable ASCII character
  \ee        escape (hex 1B)
  \ef        form feed (hex 0C)
  \en        linefeed (hex 0A)
@ -351,27 +352,40 @@ than the binary character it represents:
  \eo{ddd..} character with octal code ddd..
  \exhh      character with hex code hh
  \ex{hhh..} character with hex code hhh.. (default mode)
-  \euhhhh    character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
+  \euhhhh    character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 .sp
 The precise effect of \ecx on ASCII characters is as follows: if x is a lower
 case letter, it is converted to upper case. Then bit 6 of the character (hex
 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
 but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
-code unit following \ec has a value greater than 127, a compile-time error
-occurs. This locks out non-ASCII characters in all modes.
+code unit following \ec has a value less than 32 or greater than 126, a
+compile-time error occurs. This locks out non-printable ASCII characters in all
+modes.
 .P
-The \ec facility was designed for use with ASCII characters, but with the
-extension to Unicode it is even less useful than it once was. It is, however,
-recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
-bytes. In this mode, all values are valid after \ec. If the next character is a
-lower case letter, it is converted to upper case. Then the 0xc0 bits of the
-byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
-the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
-characters also generate different values.
+When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
+generate the appropriate EBCDIC code values. The \ec escape is processed
+as specified for Perl in the \fBperlebcdic\fP document. The only characters
+that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
+other character provokes a compile-time error. The sequence \e@ encodes
+character code 0; the letters (in either case) encode characters 1-26 (hex 01
+to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
+\e? becomes either 255 (hex FF) or 95 (hex 5F).
+.P
+Thus, apart from \e?, these escapes generate the same character code values as
+they do in an ASCII environment, though the meanings of the values mostly 
+differ. For example, \eG always generates code value 7, which is BEL in ASCII
+but DEL in EBCDIC.
+.P
+The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
+because 127 is not a control character in EBCDIC, Perl makes it generate the 
+APC character. Unfortunately, there are several variants of EBCDIC. In most of 
+them the APC character has the value 255 (hex FF), but in the one Perl calls 
+POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC 
+values, PCRE2 makes \e? generate 95; otherwise it generates 255.
 .P
 After \e0 up to two further octal digits are read. If there are fewer than two
-digits, just those that are present are used. Thus the sequence \e0\ex\e07
-specifies two binary zeros followed by a BEL character (code value 7). Make
+digits, just those that are present are used. Thus the sequence \e0\ex\e015
+specifies two binary zeros followed by a CR character (code value 13). Make
 sure you supply two digits after the initial zero if the pattern character that
 follows is itself an octal digit.
 .P
@ -3347,6 +3361,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 19 May 2015
+Last updated: 13 June 2015
 Copyright (c) 1997-2015 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
+.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax.
 .
 .SH "ESCAPED CHARACTERS"
 .rs
+.sp
+This table applies to ASCII and Unicode environments.
 .sp
  \ea         alarm, that is, the BEL character (hex 07)
-  \ecx        "control-x", where x is any ASCII character
+  \ecx        "control-x", where x is any ASCII printing character
  \ee         escape (hex 1B)
  \ef         form feed (hex 0C)
  \en         newline (hex 0A)
@ -47,7 +49,8 @@ in the
 .\" HREF
 \fBpcre2pattern\fP
 .\"
-documentation.
+documentation, where details of escape processing in EBCDIC environments are 
+also given.
 .P
 When \ex is not followed by {, from zero to two hexadecimal digits are read,
 but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@ -567,6 +570,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 23 April 2015
+Last updated: 13 June 2015
 Copyright (c) 1997-2015 University of Cambridge.
 .fi
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -268,8 +268,9 @@ invalid. */
 in UTF-8 mode. It runs from '0' to 'z'. */

 #ifndef EBCDIC
-#define ESCAPES_FIRST  CHAR_0
-#define ESCAPES_LAST   CHAR_z
+#define ESCAPES_FIRST       CHAR_0
+#define ESCAPES_LAST        CHAR_z
+#define ESCAPES_UPPER_CASE  (-32)    /* Add this to upper case a letter */

 static const short int escapes[] = {
     0,                       0,
@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code
 is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
 because it is defined as 'a', which of course picks up the ASCII value. */

-#if 'a' == 0x81                 /* Check for a real EBCDIC environment */
-#define ESCAPES_FIRST  CHAR_a
-#define ESCAPES_LAST   CHAR_9
-#else                           /* Testing in an ASCII environment */
+#if 'a' == 0x81                    /* Check for a real EBCDIC environment */
+#define ESCAPES_FIRST       CHAR_a
+#define ESCAPES_LAST        CHAR_9
+#define ESCAPES_UPPER_CASE  (+64)  /* Add this to upper case a letter */
+#else                              /* Testing in an ASCII environment */
 #define ESCAPES_FIRST  ((unsigned char)'\x81')   /* EBCDIC 'a' */
 #define ESCAPES_LAST   ((unsigned char)'\xf9')   /* EBCDIC '9' */
+#define ESCAPES_UPPER_CASE  (-32)  /* Add this to upper case a letter */
 #endif

 static const short int escapes[] = {
@ -346,6 +349,11 @@ static const short int escapes[] = {
 /*  F8 */     0,     0
 };

+/* We also need a table of characters that may follow \c in an EBCDIC
+environment for characters 0-31. */
+
+static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
+
 #endif   /* EBCDIC */


@ -1238,7 +1246,7 @@ for (code = first_significant_code(code + PRIV(OP_lengths)[*code], TRUE);
  PCRE2_SPTR ccode;

  c = *code;
-  
+
  /* Skip over forward assertions; the other assertions are skipped by
  first_significant_code() with a TRUE final argument. */

@ -2076,30 +2084,62 @@ else
      }       /* End of Perl-style \x handling */
    break;

-    /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
-    An error is given if the byte following \c is not a printable ASCII
-    character. This coding is ASCII-specific, but then the whole concept of \cx
-    is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
+    /* The handling of \c is different in ASCII and EBCDIC environments. In an
+    ASCII (or Unicode) environment, an error is given if the character
+    following \c is not a printable ASCII character. Otherwise, the following
+    character is upper-cased if it is a letter, and after that the 0x40 bit is
+    flipped. The result is the value of the escape.

+    In an EBCDIC environment the handling of \c is compatible with the
+    specification in the perlebcdic document. The following character must be
+    a letter or one of small number of special characters. These provide a
+    means of defining the character values 0-31.
+
+    For testing the EBCDIC handling of \c in an ASCII environment, recognize
+    the EBCDIC value of 'c' explicitly. */
+
+#if defined EBCDIC && 'a' != 0x81
+    case 0x83:
+#else
    case CHAR_c:
+#endif
+
    c = *(++ptr);
+    if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
    if (c == CHAR_NULL && ptr >= cb->end_pattern)
      {
      *errorcodeptr = ERR2;
      break;
      }
+
+    /* Handle \c in an ASCII/Unicode environment. */
+
 #ifndef EBCDIC    /* ASCII/UTF-8 coding */
    if (c < 32 || c > 126)  /* Excludes all non-printable ASCII */
      {
      *errorcodeptr = ERR68;
      break;
      }
-    if (c >= CHAR_a && c <= CHAR_z) c -= 32;
    c ^= 0x40;
-#else             /* EBCDIC coding */
-    if (c >= CHAR_a && c <= CHAR_z) c += 64;
-    c ^= 0xC0;
-#endif
+
+    /* Handle \c in an EBCDIC environment. The special case \c? is converted to
+    255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
+    encoding. (This is the way Perl indicates that it handles \c?.) The other
+    valid sequences correspond to a list of specific characters. */
+
+#else
+    if (c == CHAR_QUESTION_MARK)
+      c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
+    else
+      {
+      for (i = 0; i < 32; i++)
+        {
+        if (c == ebcdic_escape_c[i]) break;
+        }
+      if (i < 32) c = i; else *errorcodeptr = ERR68;
+      }
+#endif  /* EBCDIC */
+
    break;

    /* Any other alphanumeric following \ is an error. Perl gives an error only
@ -6492,7 +6532,7 @@ for (;; ptr++)
              goto FAILED;
              }
            recno = recno * 10 + *ptr++ - CHAR_0;
-            } 
+            }

          if (*ptr != (PCRE2_UCHAR)terminator)
            {
--- a/src/pcre2_error.c
+++ b/src/pcre2_error.c
@ -145,7 +145,11 @@ static const char compile_error_texts[] =
  "different names for subpatterns of the same number are not allowed\0"
  "(*MARK) must have an argument\0"
  "non-hex character in \\x{} (closing brace missing?)\0"
+#ifndef EBCDIC   
  "\\c must be followed by a printable ASCII character\0"
+#else   
+  "\\c must be followed by a letter or one of [\\]^_?\0"
+#endif
  "\\k is not followed by a braced, angle-bracketed, or quoted name\0"
  /* 70 */
  "internal error: unknown opcode in find_fixedlength()\0"
--- a/testdata/testinputEBC
+++ b/testdata/testinputEBC
@ -1,5 +1,5 @@
 # This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
 # functionality is working. It catches cases where explicit values such as 0x0a
 # have been used instead of names like CHAR_LF. Needless to say, it is not a
 # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -117,5 +117,18 @@
    A\x25B
    A\x0bB
    A\x0cB
+    
+# Test \c functionality 
+    
+/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
+    \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+
+/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
+    \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+    
+/\ƒ?/
+    A\xffB
+
+/\ƒ&/

 # End
--- a/testdata/testoutputEBC
+++ b/testdata/testoutputEBC
@ -1,5 +1,5 @@
 # This is a specialized test for checking, when PCRE2 is compiled with the
-# EBCDIC option but in an ASCII environment, that newline and white space
+# EBCDIC option but in an ASCII environment, that newline, white space, and \c
 # functionality is working. It catches cases where explicit values such as 0x0a
 # have been used instead of names like CHAR_LF. Needless to say, it is not a
 # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -178,5 +178,22 @@ No match
 No match
    A\x0cB
 No match
+    
+# Test \c functionality 
+    
+/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
+    \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
+
+/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
+    \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
+ 0: \x1b\x1c\x1d\x1e\x1f
+    
+/\ƒ?/
+    A\xffB
+ 0: \xff
+
+/\ƒ&/
+Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f

 # End