Make \c operate like Perl in EBCDIC environments.

This commit is contained in:
Philip.Hazel 2015-06-13 16:10:14 +00:00
parent 149aa29209
commit c146059c22
7 changed files with 134 additions and 40 deletions

View File

@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/.
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
instead of the EBCDIC value.
42. The handling of \c in an EBCDIC environment has been revised so that it is
now compatible with the specification in Perl's perlebcdic page.
Version 10.10 06-March-2015
---------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters in a pattern, but when a pattern is being prepared by
text editing, it is often easier to use one of the following escape sequences
than the binary character it represents:
than the binary character it represents. In an ASCII or Unicode environment,
these escapes are as follows:
.sp
\ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII character
\ecx "control-x", where x is any printable ASCII character
\ee escape (hex 1B)
\ef form feed (hex 0C)
\en linefeed (hex 0A)
@ -351,27 +352,40 @@ than the binary character it represents:
\eo{ddd..} character with octal code ddd..
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode)
\euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
code unit following \ec has a value greater than 127, a compile-time error
occurs. This locks out non-ASCII characters in all modes.
code unit following \ec has a value less than 32 or greater than 126, a
compile-time error occurs. This locks out non-printable ASCII characters in all
modes.
.P
The \ec facility was designed for use with ASCII characters, but with the
extension to Unicode it is even less useful than it once was. It is, however,
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
bytes. In this mode, all values are valid after \ec. If the next character is a
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
characters also generate different values.
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
generate the appropriate EBCDIC code values. The \ec escape is processed
as specified for Perl in the \fBperlebcdic\fP document. The only characters
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
other character provokes a compile-time error. The sequence \e@ encodes
character code 0; the letters (in either case) encode characters 1-26 (hex 01
to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
\e? becomes either 255 (hex FF) or 95 (hex 5F).
.P
Thus, apart from \e?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly
differ. For example, \eG always generates code value 7, which is BEL in ASCII
but DEL in EBCDIC.
.P
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
because 127 is not a control character in EBCDIC, Perl makes it generate the
APC character. Unfortunately, there are several variants of EBCDIC. In most of
them the APC character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
.P
After \e0 up to two further octal digits are read. If there are fewer than two
digits, just those that are present are used. Thus the sequence \e0\ex\e07
specifies two binary zeros followed by a BEL character (code value 7). Make
digits, just those that are present are used. Thus the sequence \e0\ex\e015
specifies two binary zeros followed by a CR character (code value 13). Make
sure you supply two digits after the initial zero if the pattern character that
follows is itself an octal digit.
.P
@ -3347,6 +3361,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 19 May 2015
Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax.
.
.SH "ESCAPED CHARACTERS"
.rs
.sp
This table applies to ASCII and Unicode environments.
.sp
\ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any ASCII character
\ecx "control-x", where x is any ASCII printing character
\ee escape (hex 1B)
\ef form feed (hex 0C)
\en newline (hex 0A)
@ -47,7 +49,8 @@ in the
.\" HREF
\fBpcre2pattern\fP
.\"
documentation.
documentation, where details of escape processing in EBCDIC environments are
also given.
.P
When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@ -567,6 +570,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 23 April 2015
Last updated: 13 June 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi

View File

@ -268,8 +268,9 @@ invalid. */
in UTF-8 mode. It runs from '0' to 'z'. */
#ifndef EBCDIC
#define ESCAPES_FIRST CHAR_0
#define ESCAPES_LAST CHAR_z
#define ESCAPES_FIRST CHAR_0
#define ESCAPES_LAST CHAR_z
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
static const short int escapes[] = {
0, 0,
@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
because it is defined as 'a', which of course picks up the ASCII value. */
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
#define ESCAPES_FIRST CHAR_a
#define ESCAPES_LAST CHAR_9
#else /* Testing in an ASCII environment */
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
#define ESCAPES_FIRST CHAR_a
#define ESCAPES_LAST CHAR_9
#define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */
#else /* Testing in an ASCII environment */
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
#endif
static const short int escapes[] = {
@ -346,6 +349,11 @@ static const short int escapes[] = {
/* F8 */ 0, 0
};
/* We also need a table of characters that may follow \c in an EBCDIC
environment for characters 0-31. */
static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
#endif /* EBCDIC */
@ -2076,30 +2084,62 @@ else
} /* End of Perl-style \x handling */
break;
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
An error is given if the byte following \c is not a printable ASCII
character. This coding is ASCII-specific, but then the whole concept of \cx
is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
/* The handling of \c is different in ASCII and EBCDIC environments. In an
ASCII (or Unicode) environment, an error is given if the character
following \c is not a printable ASCII character. Otherwise, the following
character is upper-cased if it is a letter, and after that the 0x40 bit is
flipped. The result is the value of the escape.
In an EBCDIC environment the handling of \c is compatible with the
specification in the perlebcdic document. The following character must be
a letter or one of small number of special characters. These provide a
means of defining the character values 0-31.
For testing the EBCDIC handling of \c in an ASCII environment, recognize
the EBCDIC value of 'c' explicitly. */
#if defined EBCDIC && 'a' != 0x81
case 0x83:
#else
case CHAR_c:
#endif
c = *(++ptr);
if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
if (c == CHAR_NULL && ptr >= cb->end_pattern)
{
*errorcodeptr = ERR2;
break;
}
/* Handle \c in an ASCII/Unicode environment. */
#ifndef EBCDIC /* ASCII/UTF-8 coding */
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
{
*errorcodeptr = ERR68;
break;
}
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40;
#else /* EBCDIC coding */
if (c >= CHAR_a && c <= CHAR_z) c += 64;
c ^= 0xC0;
#endif
/* Handle \c in an EBCDIC environment. The special case \c? is converted to
255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
encoding. (This is the way Perl indicates that it handles \c?.) The other
valid sequences correspond to a list of specific characters. */
#else
if (c == CHAR_QUESTION_MARK)
c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
else
{
for (i = 0; i < 32; i++)
{
if (c == ebcdic_escape_c[i]) break;
}
if (i < 32) c = i; else *errorcodeptr = ERR68;
}
#endif /* EBCDIC */
break;
/* Any other alphanumeric following \ is an error. Perl gives an error only

View File

@ -145,7 +145,11 @@ static const char compile_error_texts[] =
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
"non-hex character in \\x{} (closing brace missing?)\0"
#ifndef EBCDIC
"\\c must be followed by a printable ASCII character\0"
#else
"\\c must be followed by a letter or one of [\\]^_?\0"
#endif
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
/* 70 */
"internal error: unknown opcode in find_fixedlength()\0"

15
testdata/testinputEBC vendored
View File

@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the
# EBCDIC option but in an ASCII environment, that newline and white space
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -118,4 +118,17 @@
A\x0bB
A\x0cB
# Test \c functionality
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
/\ƒ?/
A\xffB
/\ƒ&/
# End

View File

@ -1,5 +1,5 @@
# This is a specialized test for checking, when PCRE2 is compiled with the
# EBCDIC option but in an ASCII environment, that newline and white space
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a
# have been used instead of names like CHAR_LF. Needless to say, it is not a
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
@ -179,4 +179,21 @@ No match
A\x0cB
No match
# Test \c functionality
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
0: \x1b\x1c\x1d\x1e\x1f
/\ƒ?/
A\xffB
0: \xff
/\ƒ&/
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
# End