Make \c operate like Perl in EBCDIC environments.

This commit is contained in:
Philip.Hazel 2015-06-13 16:10:14 +00:00
parent 149aa29209
commit c146059c22
7 changed files with 134 additions and 40 deletions

View File

@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/.
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII 41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
instead of the EBCDIC value. instead of the EBCDIC value.
42. The handling of \c in an EBCDIC environment has been revised so that it is
now compatible with the specification in Perl's perlebcdic page.
Version 10.10 06-March-2015 Version 10.10 06-March-2015
--------------------------- ---------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20" .TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences text editing, it is often easier to use one of the following escape sequences
than the binary character it represents: than the binary character it represents. In an ASCII or Unicode environment,
these escapes are as follows:
.sp .sp
\ea alarm, that is, the BEL character (hex 07) \ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII character \ecx "control-x", where x is any printable ASCII character
\ee escape (hex 1B) \ee escape (hex 1B)
\ef form feed (hex 0C) \ef form feed (hex 0C)
\en linefeed (hex 0A) \en linefeed (hex 0A)
@ -351,27 +352,40 @@ than the binary character it represents:
\eo{ddd..} character with octal code ddd.. \eo{ddd..} character with octal code ddd..
\exhh character with hex code hh \exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode) \ex{hhh..} character with hex code hhh.. (default mode)
\euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set) \euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp .sp
The precise effect of \ecx on ASCII characters is as follows: if x is a lower The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
code unit following \ec has a value greater than 127, a compile-time error code unit following \ec has a value less than 32 or greater than 126, a
occurs. This locks out non-ASCII characters in all modes. compile-time error occurs. This locks out non-printable ASCII characters in all
modes.
.P .P
The \ec facility was designed for use with ASCII characters, but with the When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
extension to Unicode it is even less useful than it once was. It is, however, generate the appropriate EBCDIC code values. The \ec escape is processed
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always as specified for Perl in the \fBperlebcdic\fP document. The only characters
bytes. In this mode, all values are valid after \ec. If the next character is a that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
lower case letter, it is converted to upper case. Then the 0xc0 bits of the other character provokes a compile-time error. The sequence \e@ encodes
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because character code 0; the letters (in either case) encode characters 1-26 (hex 01
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
characters also generate different values. \e? becomes either 255 (hex FF) or 95 (hex 5F).
.P
Thus, apart from \e?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
differ. For example, \eG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
.P
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
.P .P
After \e0 up to two further octal digits are read. If there are fewer than two After \e0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \e0\ex\e07 digits, just those that are present are used. Thus the sequence \e0\ex\e015
specifies two binary zeros followed by a BEL character (code value 7). Make specifies two binary zeros followed by a CR character (code value 13). Make
sure you supply two digits after the initial zero if the pattern character that sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit. follows is itself an octal digit.
.P .P
@ -3347,6 +3361,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 19 May 2015 Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20" .TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax.
. .
.SH "ESCAPED CHARACTERS" .SH "ESCAPED CHARACTERS"
.rs .rs
.sp
This table applies to ASCII and Unicode environments.
.sp .sp
\ea alarm, that is, the BEL character (hex 07) \ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII character \ecx "control-x", where x is any ASCII printing character
\ee escape (hex 1B) \ee escape (hex 1B)
\ef form feed (hex 0C) \ef form feed (hex 0C)
\en newline (hex 0A) \en newline (hex 0A)
@ -47,7 +49,8 @@ in the
.\" HREF .\" HREF
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
documentation. documentation, where details of escape processing in EBCDIC environments are
also given.
.P .P
When \ex is not followed by {, from zero to two hexadecimal digits are read, When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@ -567,6 +570,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 23 April 2015 Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge. Copyright (c) 1997-2015 University of Cambridge.
.fi .fi

View File

@ -268,8 +268,9 @@ invalid. */
in UTF-8 mode. It runs from '0' to 'z'. */ in UTF-8 mode. It runs from '0' to 'z'. */
#ifndef EBCDIC #ifndef EBCDIC
#define ESCAPES_FIRST CHAR_0 #define ESCAPES_FIRST CHAR_0
#define ESCAPES_LAST CHAR_z #define ESCAPES_LAST CHAR_z
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
static const short int escapes[] = { static const short int escapes[] = {
0, 0, 0, 0,
@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
because it is defined as 'a', which of course picks up the ASCII value. */ because it is defined as 'a', which of course picks up the ASCII value. */
#if 'a' == 0x81 /* Check for a real EBCDIC environment */ #if 'a' == 0x81 /* Check for a real EBCDIC environment */
#define ESCAPES_FIRST CHAR_a #define ESCAPES_FIRST CHAR_a
#define ESCAPES_LAST CHAR_9 #define ESCAPES_LAST CHAR_9
#else /* Testing in an ASCII environment */ #define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */
#else /* Testing in an ASCII environment */
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */ #define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */ #define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
#endif #endif
static const short int escapes[] = { static const short int escapes[] = {
@ -346,6 +349,11 @@ static const short int escapes[] = {
/* F8 */ 0, 0 /* F8 */ 0, 0
}; };
/* We also need a table of characters that may follow \c in an EBCDIC
environment for characters 0-31. */
static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
#endif /* EBCDIC */ #endif /* EBCDIC */
@ -2076,30 +2084,62 @@ else
} /* End of Perl-style \x handling */ } /* End of Perl-style \x handling */
break; break;
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped. /* The handling of \c is different in ASCII and EBCDIC environments. In an
An error is given if the byte following \c is not a printable ASCII ASCII (or Unicode) environment, an error is given if the character
character. This coding is ASCII-specific, but then the whole concept of \cx following \c is not a printable ASCII character. Otherwise, the following
is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */ character is upper-cased if it is a letter, and after that the 0x40 bit is
flipped. The result is the value of the escape.
In an EBCDIC environment the handling of \c is compatible with the
specification in the perlebcdic document. The following character must be
a letter or one of small number of special characters. These provide a
means of defining the character values 0-31.
For testing the EBCDIC handling of \c in an ASCII environment, recognize
the EBCDIC value of 'c' explicitly. */
#if defined EBCDIC && 'a' != 0x81
case 0x83:
#else
case CHAR_c: case CHAR_c:
#endif
c = *(++ptr); c = *(++ptr);
if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
if (c == CHAR_NULL && ptr >= cb->end_pattern) if (c == CHAR_NULL && ptr >= cb->end_pattern)
{ {
*errorcodeptr = ERR2; *errorcodeptr = ERR2;
break; break;
} }
/* Handle \c in an ASCII/Unicode environment. */
#ifndef EBCDIC /* ASCII/UTF-8 coding */ #ifndef EBCDIC /* ASCII/UTF-8 coding */
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */ if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
{ {
*errorcodeptr = ERR68; *errorcodeptr = ERR68;
break; break;
} }
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40; c ^= 0x40;
#else /* EBCDIC coding */
if (c >= CHAR_a && c <= CHAR_z) c += 64; /* Handle \c in an EBCDIC environment. The special case \c? is converted to
c ^= 0xC0; 255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
#endif encoding. (This is the way Perl indicates that it handles \c?.) The other
valid sequences correspond to a list of specific characters. */
#else
if (c == CHAR_QUESTION_MARK)
c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
else
{
for (i = 0; i < 32; i++)
{
if (c == ebcdic_escape_c[i]) break;
}
if (i < 32) c = i; else *errorcodeptr = ERR68;
}
#endif /* EBCDIC */
break; break;
/* Any other alphanumeric following \ is an error. Perl gives an error only /* Any other alphanumeric following \ is an error. Perl gives an error only

View File

@ -145,7 +145,11 @@ static const char compile_error_texts[] =
"different names for subpatterns of the same number are not allowed\0" "different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0" "(*MARK) must have an argument\0"
"non-hex character in \\x{} (closing brace missing?)\0" "non-hex character in \\x{} (closing brace missing?)\0"
#ifndef EBCDIC
"\\c must be followed by a printable ASCII character\0" "\\c must be followed by a printable ASCII character\0"
#else
"\\c must be followed by a letter or one of [\\]^_?\0"
#endif
"\\k is not followed by a braced, angle-bracketed, or quoted name\0" "\\k is not followed by a braced, angle-bracketed, or quoted name\0"
/* 70 */ /* 70 */
"internal error: unknown opcode in find_fixedlength()\0" "internal error: unknown opcode in find_fixedlength()\0"

15
testdata/testinputEBC vendored
View File

@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the # This is a specialized test for checking, when PCRE2 is compiled with the
# EBCDIC option but in an ASCII environment, that newline and white space # EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a # functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a # have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -118,4 +118,17 @@
A\x0bB A\x0bB
A\x0cB A\x0cB
# Test \c functionality
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
/\ƒ?/
A\xffB
/\ƒ&/
# End # End

View File

@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the # This is a specialized test for checking, when PCRE2 is compiled with the
# EBCDIC option but in an ASCII environment, that newline and white space # EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a # functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a # have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a # genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -179,4 +179,21 @@ No match
A\x0cB A\x0cB
No match No match
# Test \c functionality
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
0: \x1b\x1c\x1d\x1e\x1f
/\ƒ?/
A\xffB
0: \xff
/\ƒ&/
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
# End # End