From c146059c223da897bd662324480210971dba4ffb Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Sat, 13 Jun 2015 16:10:14 +0000 Subject: [PATCH] Make \c operate like Perl in EBCDIC environments. --- ChangeLog | 3 ++ doc/pcre2pattern.3 | 48 +++++++++++++++++---------- doc/pcre2syntax.3 | 11 ++++--- src/pcre2_compile.c | 74 ++++++++++++++++++++++++++++++++---------- src/pcre2_error.c | 4 +++ testdata/testinputEBC | 15 ++++++++- testdata/testoutputEBC | 19 ++++++++++- 7 files changed, 134 insertions(+), 40 deletions(-) diff --git a/ChangeLog b/ChangeLog index 766f6b2..523ea55 100644 --- a/ChangeLog +++ b/ChangeLog @@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/. 41. In an EBCDIC environment, \a in a pattern was converted to the ASCII instead of the EBCDIC value. +42. The handling of \c in an EBCDIC environment has been revised so that it is +now compatible with the specification in Perl's perlebcdic page. + Version 10.10 06-March-2015 --------------------------- diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index f8fbb2f..5aee596 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20" +.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters in a pattern, but when a pattern is being prepared by text editing, it is often easier to use one of the following escape sequences -than the binary character it represents: +than the binary character it represents. In an ASCII or Unicode environment, +these escapes are as follows: .sp \ea alarm, that is, the BEL character (hex 07) - \ecx "control-x", where x is any ASCII character + \ecx "control-x", where x is any printable ASCII character \ee escape (hex 1B) \ef form feed (hex 0C) \en linefeed (hex 0A) @@ -351,27 +352,40 @@ than the binary character it represents: \eo{ddd..} character with octal code ddd.. \exhh character with hex code hh \ex{hhh..} character with hex code hhh.. (default mode) - \euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set) + \euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) .sp The precise effect of \ecx on ASCII characters is as follows: if x is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the -code unit following \ec has a value greater than 127, a compile-time error -occurs. This locks out non-ASCII characters in all modes. +code unit following \ec has a value less than 32 or greater than 126, a +compile-time error occurs. This locks out non-printable ASCII characters in all +modes. .P -The \ec facility was designed for use with ASCII characters, but with the -extension to Unicode it is even less useful than it once was. It is, however, -recognized when PCRE2 is compiled in EBCDIC mode, where data items are always -bytes. In this mode, all values are valid after \ec. If the next character is a -lower case letter, it is converted to upper case. Then the 0xc0 bits of the -byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because -the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other -characters also generate different values. +When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et +generate the appropriate EBCDIC code values. The \ec escape is processed +as specified for Perl in the \fBperlebcdic\fP document. The only characters +that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any +other character provokes a compile-time error. The sequence \e@ encodes +character code 0; the letters (in either case) encode characters 1-26 (hex 01 +to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and +\e? becomes either 255 (hex FF) or 95 (hex 5F). +.P +Thus, apart from \e?, these escapes generate the same character code values as +they do in an ASCII environment, though the meanings of the values mostly +differ. For example, \eG always generates code value 7, which is BEL in ASCII +but DEL in EBCDIC. +.P +The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but +because 127 is not a control character in EBCDIC, Perl makes it generate the +APC character. Unfortunately, there are several variants of EBCDIC. In most of +them the APC character has the value 255 (hex FF), but in the one Perl calls +POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC +values, PCRE2 makes \e? generate 95; otherwise it generates 255. .P After \e0 up to two further octal digits are read. If there are fewer than two -digits, just those that are present are used. Thus the sequence \e0\ex\e07 -specifies two binary zeros followed by a BEL character (code value 7). Make +digits, just those that are present are used. Thus the sequence \e0\ex\e015 +specifies two binary zeros followed by a CR character (code value 13). Make sure you supply two digits after the initial zero if the pattern character that follows is itself an octal digit. .P @@ -3347,6 +3361,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 19 May 2015 +Last updated: 13 June 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index 398be1e..0e2aae8 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -1,4 +1,4 @@ -.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20" +.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" @@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax. . .SH "ESCAPED CHARACTERS" .rs +.sp +This table applies to ASCII and Unicode environments. .sp \ea alarm, that is, the BEL character (hex 07) - \ecx "control-x", where x is any ASCII character + \ecx "control-x", where x is any ASCII printing character \ee escape (hex 1B) \ef form feed (hex 0C) \en newline (hex 0A) @@ -47,7 +49,8 @@ in the .\" HREF \fBpcre2pattern\fP .\" -documentation. +documentation, where details of escape processing in EBCDIC environments are +also given. .P When \ex is not followed by {, from zero to two hexadecimal digits are read, but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to @@ -567,6 +570,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 23 April 2015 +Last updated: 13 June 2015 Copyright (c) 1997-2015 University of Cambridge. .fi diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index 9ad36d0..17dc77b 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -268,8 +268,9 @@ invalid. */ in UTF-8 mode. It runs from '0' to 'z'. */ #ifndef EBCDIC -#define ESCAPES_FIRST CHAR_0 -#define ESCAPES_LAST CHAR_z +#define ESCAPES_FIRST CHAR_0 +#define ESCAPES_LAST CHAR_z +#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */ static const short int escapes[] = { 0, 0, @@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a because it is defined as 'a', which of course picks up the ASCII value. */ -#if 'a' == 0x81 /* Check for a real EBCDIC environment */ -#define ESCAPES_FIRST CHAR_a -#define ESCAPES_LAST CHAR_9 -#else /* Testing in an ASCII environment */ +#if 'a' == 0x81 /* Check for a real EBCDIC environment */ +#define ESCAPES_FIRST CHAR_a +#define ESCAPES_LAST CHAR_9 +#define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */ +#else /* Testing in an ASCII environment */ #define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */ #define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */ +#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */ #endif static const short int escapes[] = { @@ -346,6 +349,11 @@ static const short int escapes[] = { /* F8 */ 0, 0 }; +/* We also need a table of characters that may follow \c in an EBCDIC +environment for characters 0-31. */ + +static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"; + #endif /* EBCDIC */ @@ -1238,7 +1246,7 @@ for (code = first_significant_code(code + PRIV(OP_lengths)[*code], TRUE); PCRE2_SPTR ccode; c = *code; - + /* Skip over forward assertions; the other assertions are skipped by first_significant_code() with a TRUE final argument. */ @@ -2076,30 +2084,62 @@ else } /* End of Perl-style \x handling */ break; - /* For \c, a following letter is upper-cased; then the 0x40 bit is flipped. - An error is given if the byte following \c is not a printable ASCII - character. This coding is ASCII-specific, but then the whole concept of \cx - is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */ + /* The handling of \c is different in ASCII and EBCDIC environments. In an + ASCII (or Unicode) environment, an error is given if the character + following \c is not a printable ASCII character. Otherwise, the following + character is upper-cased if it is a letter, and after that the 0x40 bit is + flipped. The result is the value of the escape. + In an EBCDIC environment the handling of \c is compatible with the + specification in the perlebcdic document. The following character must be + a letter or one of small number of special characters. These provide a + means of defining the character values 0-31. + + For testing the EBCDIC handling of \c in an ASCII environment, recognize + the EBCDIC value of 'c' explicitly. */ + +#if defined EBCDIC && 'a' != 0x81 + case 0x83: +#else case CHAR_c: +#endif + c = *(++ptr); + if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE; if (c == CHAR_NULL && ptr >= cb->end_pattern) { *errorcodeptr = ERR2; break; } + + /* Handle \c in an ASCII/Unicode environment. */ + #ifndef EBCDIC /* ASCII/UTF-8 coding */ if (c < 32 || c > 126) /* Excludes all non-printable ASCII */ { *errorcodeptr = ERR68; break; } - if (c >= CHAR_a && c <= CHAR_z) c -= 32; c ^= 0x40; -#else /* EBCDIC coding */ - if (c >= CHAR_a && c <= CHAR_z) c += 64; - c ^= 0xC0; -#endif + + /* Handle \c in an EBCDIC environment. The special case \c? is converted to + 255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC + encoding. (This is the way Perl indicates that it handles \c?.) The other + valid sequences correspond to a list of specific characters. */ + +#else + if (c == CHAR_QUESTION_MARK) + c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff; + else + { + for (i = 0; i < 32; i++) + { + if (c == ebcdic_escape_c[i]) break; + } + if (i < 32) c = i; else *errorcodeptr = ERR68; + } +#endif /* EBCDIC */ + break; /* Any other alphanumeric following \ is an error. Perl gives an error only @@ -6492,7 +6532,7 @@ for (;; ptr++) goto FAILED; } recno = recno * 10 + *ptr++ - CHAR_0; - } + } if (*ptr != (PCRE2_UCHAR)terminator) { diff --git a/src/pcre2_error.c b/src/pcre2_error.c index 48320ad..b88351e 100644 --- a/src/pcre2_error.c +++ b/src/pcre2_error.c @@ -145,7 +145,11 @@ static const char compile_error_texts[] = "different names for subpatterns of the same number are not allowed\0" "(*MARK) must have an argument\0" "non-hex character in \\x{} (closing brace missing?)\0" +#ifndef EBCDIC "\\c must be followed by a printable ASCII character\0" +#else + "\\c must be followed by a letter or one of [\\]^_?\0" +#endif "\\k is not followed by a braced, angle-bracketed, or quoted name\0" /* 70 */ "internal error: unknown opcode in find_fixedlength()\0" diff --git a/testdata/testinputEBC b/testdata/testinputEBC index e3f1154..7aa845c 100644 --- a/testdata/testinputEBC +++ b/testdata/testinputEBC @@ -1,5 +1,5 @@ # This is a specialized test for checking, when PCRE2 is compiled with the -# EBCDIC option but in an ASCII environment, that newline and white space +# EBCDIC option but in an ASCII environment, that newline, white space, and \c # functionality is working. It catches cases where explicit values such as 0x0a # have been used instead of names like CHAR_LF. Needless to say, it is not a # genuine EBCDIC test! In patterns, alphabetic characters that follow a @@ -117,5 +117,18 @@ A\x25B A\x0bB A\x0cB + +# Test \c functionality + +/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/ + \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f + +/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/ + \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f + +/\ƒ?/ + A\xffB + +/\ƒ&/ # End diff --git a/testdata/testoutputEBC b/testdata/testoutputEBC index 7904d22..5bb9797 100644 --- a/testdata/testoutputEBC +++ b/testdata/testoutputEBC @@ -1,5 +1,5 @@ # This is a specialized test for checking, when PCRE2 is compiled with the -# EBCDIC option but in an ASCII environment, that newline and white space +# EBCDIC option but in an ASCII environment, that newline, white space, and \c # functionality is working. It catches cases where explicit values such as 0x0a # have been used instead of names like CHAR_LF. Needless to say, it is not a # genuine EBCDIC test! In patterns, alphabetic characters that follow a @@ -178,5 +178,22 @@ No match No match A\x0cB No match + +# Test \c functionality + +/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/ + \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f + 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a + +/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/ + \x18\x19\x1a\x1b\x1c\x1d\x1e\x1f + 0: \x1b\x1c\x1d\x1e\x1f + +/\ƒ?/ + A\xffB + 0: \xff + +/\ƒ&/ +Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f # End