Make \c operate like Perl in EBCDIC environments.
This commit is contained in:
parent
149aa29209
commit
c146059c22
|
@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/.
|
||||||
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
|
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
|
||||||
instead of the EBCDIC value.
|
instead of the EBCDIC value.
|
||||||
|
|
||||||
|
42. The handling of \c in an EBCDIC environment has been revised so that it is
|
||||||
|
now compatible with the specification in Perl's perlebcdic page.
|
||||||
|
|
||||||
|
|
||||||
Version 10.10 06-March-2015
|
Version 10.10 06-March-2015
|
||||||
---------------------------
|
---------------------------
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
|
.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters
|
||||||
in patterns in a visible manner. There is no restriction on the appearance of
|
in patterns in a visible manner. There is no restriction on the appearance of
|
||||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||||
text editing, it is often easier to use one of the following escape sequences
|
text editing, it is often easier to use one of the following escape sequences
|
||||||
than the binary character it represents:
|
than the binary character it represents. In an ASCII or Unicode environment,
|
||||||
|
these escapes are as follows:
|
||||||
.sp
|
.sp
|
||||||
\ea alarm, that is, the BEL character (hex 07)
|
\ea alarm, that is, the BEL character (hex 07)
|
||||||
\ecx "control-x", where x is any ASCII character
|
\ecx "control-x", where x is any printable ASCII character
|
||||||
\ee escape (hex 1B)
|
\ee escape (hex 1B)
|
||||||
\ef form feed (hex 0C)
|
\ef form feed (hex 0C)
|
||||||
\en linefeed (hex 0A)
|
\en linefeed (hex 0A)
|
||||||
|
@ -351,27 +352,40 @@ than the binary character it represents:
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||||
\euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
|
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
.sp
|
.sp
|
||||||
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
||||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||||
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
||||||
code unit following \ec has a value greater than 127, a compile-time error
|
code unit following \ec has a value less than 32 or greater than 126, a
|
||||||
occurs. This locks out non-ASCII characters in all modes.
|
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||||
|
modes.
|
||||||
.P
|
.P
|
||||||
The \ec facility was designed for use with ASCII characters, but with the
|
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
||||||
extension to Unicode it is even less useful than it once was. It is, however,
|
generate the appropriate EBCDIC code values. The \ec escape is processed
|
||||||
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
|
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
||||||
bytes. In this mode, all values are valid after \ec. If the next character is a
|
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
||||||
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
|
other character provokes a compile-time error. The sequence \e@ encodes
|
||||||
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
|
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||||
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
|
to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||||
characters also generate different values.
|
\e? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
|
.P
|
||||||
|
Thus, apart from \e?, these escapes generate the same character code values as
|
||||||
|
they do in an ASCII environment, though the meanings of the values mostly
|
||||||
|
differ. For example, \eG always generates code value 7, which is BEL in ASCII
|
||||||
|
but DEL in EBCDIC.
|
||||||
|
.P
|
||||||
|
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||||
|
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||||
|
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||||
|
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||||
|
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||||
|
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
|
||||||
.P
|
.P
|
||||||
After \e0 up to two further octal digits are read. If there are fewer than two
|
After \e0 up to two further octal digits are read. If there are fewer than two
|
||||||
digits, just those that are present are used. Thus the sequence \e0\ex\e07
|
digits, just those that are present are used. Thus the sequence \e0\ex\e015
|
||||||
specifies two binary zeros followed by a BEL character (code value 7). Make
|
specifies two binary zeros followed by a CR character (code value 13). Make
|
||||||
sure you supply two digits after the initial zero if the pattern character that
|
sure you supply two digits after the initial zero if the pattern character that
|
||||||
follows is itself an octal digit.
|
follows is itself an octal digit.
|
||||||
.P
|
.P
|
||||||
|
@ -3347,6 +3361,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 19 May 2015
|
Last updated: 13 June 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
|
.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax.
|
||||||
.
|
.
|
||||||
.SH "ESCAPED CHARACTERS"
|
.SH "ESCAPED CHARACTERS"
|
||||||
.rs
|
.rs
|
||||||
|
.sp
|
||||||
|
This table applies to ASCII and Unicode environments.
|
||||||
.sp
|
.sp
|
||||||
\ea alarm, that is, the BEL character (hex 07)
|
\ea alarm, that is, the BEL character (hex 07)
|
||||||
\ecx "control-x", where x is any ASCII character
|
\ecx "control-x", where x is any ASCII printing character
|
||||||
\ee escape (hex 1B)
|
\ee escape (hex 1B)
|
||||||
\ef form feed (hex 0C)
|
\ef form feed (hex 0C)
|
||||||
\en newline (hex 0A)
|
\en newline (hex 0A)
|
||||||
|
@ -47,7 +49,8 @@ in the
|
||||||
.\" HREF
|
.\" HREF
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
documentation.
|
documentation, where details of escape processing in EBCDIC environments are
|
||||||
|
also given.
|
||||||
.P
|
.P
|
||||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||||
|
@ -567,6 +570,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 23 April 2015
|
Last updated: 13 June 2015
|
||||||
Copyright (c) 1997-2015 University of Cambridge.
|
Copyright (c) 1997-2015 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -268,8 +268,9 @@ invalid. */
|
||||||
in UTF-8 mode. It runs from '0' to 'z'. */
|
in UTF-8 mode. It runs from '0' to 'z'. */
|
||||||
|
|
||||||
#ifndef EBCDIC
|
#ifndef EBCDIC
|
||||||
#define ESCAPES_FIRST CHAR_0
|
#define ESCAPES_FIRST CHAR_0
|
||||||
#define ESCAPES_LAST CHAR_z
|
#define ESCAPES_LAST CHAR_z
|
||||||
|
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
|
||||||
|
|
||||||
static const short int escapes[] = {
|
static const short int escapes[] = {
|
||||||
0, 0,
|
0, 0,
|
||||||
|
@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code
|
||||||
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
|
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
|
||||||
because it is defined as 'a', which of course picks up the ASCII value. */
|
because it is defined as 'a', which of course picks up the ASCII value. */
|
||||||
|
|
||||||
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
|
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
|
||||||
#define ESCAPES_FIRST CHAR_a
|
#define ESCAPES_FIRST CHAR_a
|
||||||
#define ESCAPES_LAST CHAR_9
|
#define ESCAPES_LAST CHAR_9
|
||||||
#else /* Testing in an ASCII environment */
|
#define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */
|
||||||
|
#else /* Testing in an ASCII environment */
|
||||||
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
|
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
|
||||||
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
|
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
|
||||||
|
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
static const short int escapes[] = {
|
static const short int escapes[] = {
|
||||||
|
@ -346,6 +349,11 @@ static const short int escapes[] = {
|
||||||
/* F8 */ 0, 0
|
/* F8 */ 0, 0
|
||||||
};
|
};
|
||||||
|
|
||||||
|
/* We also need a table of characters that may follow \c in an EBCDIC
|
||||||
|
environment for characters 0-31. */
|
||||||
|
|
||||||
|
static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
|
||||||
|
|
||||||
#endif /* EBCDIC */
|
#endif /* EBCDIC */
|
||||||
|
|
||||||
|
|
||||||
|
@ -2076,30 +2084,62 @@ else
|
||||||
} /* End of Perl-style \x handling */
|
} /* End of Perl-style \x handling */
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
|
/* The handling of \c is different in ASCII and EBCDIC environments. In an
|
||||||
An error is given if the byte following \c is not a printable ASCII
|
ASCII (or Unicode) environment, an error is given if the character
|
||||||
character. This coding is ASCII-specific, but then the whole concept of \cx
|
following \c is not a printable ASCII character. Otherwise, the following
|
||||||
is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
|
character is upper-cased if it is a letter, and after that the 0x40 bit is
|
||||||
|
flipped. The result is the value of the escape.
|
||||||
|
|
||||||
|
In an EBCDIC environment the handling of \c is compatible with the
|
||||||
|
specification in the perlebcdic document. The following character must be
|
||||||
|
a letter or one of small number of special characters. These provide a
|
||||||
|
means of defining the character values 0-31.
|
||||||
|
|
||||||
|
For testing the EBCDIC handling of \c in an ASCII environment, recognize
|
||||||
|
the EBCDIC value of 'c' explicitly. */
|
||||||
|
|
||||||
|
#if defined EBCDIC && 'a' != 0x81
|
||||||
|
case 0x83:
|
||||||
|
#else
|
||||||
case CHAR_c:
|
case CHAR_c:
|
||||||
|
#endif
|
||||||
|
|
||||||
c = *(++ptr);
|
c = *(++ptr);
|
||||||
|
if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
|
||||||
if (c == CHAR_NULL && ptr >= cb->end_pattern)
|
if (c == CHAR_NULL && ptr >= cb->end_pattern)
|
||||||
{
|
{
|
||||||
*errorcodeptr = ERR2;
|
*errorcodeptr = ERR2;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/* Handle \c in an ASCII/Unicode environment. */
|
||||||
|
|
||||||
#ifndef EBCDIC /* ASCII/UTF-8 coding */
|
#ifndef EBCDIC /* ASCII/UTF-8 coding */
|
||||||
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
|
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
|
||||||
{
|
{
|
||||||
*errorcodeptr = ERR68;
|
*errorcodeptr = ERR68;
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
|
|
||||||
c ^= 0x40;
|
c ^= 0x40;
|
||||||
#else /* EBCDIC coding */
|
|
||||||
if (c >= CHAR_a && c <= CHAR_z) c += 64;
|
/* Handle \c in an EBCDIC environment. The special case \c? is converted to
|
||||||
c ^= 0xC0;
|
255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
|
||||||
#endif
|
encoding. (This is the way Perl indicates that it handles \c?.) The other
|
||||||
|
valid sequences correspond to a list of specific characters. */
|
||||||
|
|
||||||
|
#else
|
||||||
|
if (c == CHAR_QUESTION_MARK)
|
||||||
|
c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
|
||||||
|
else
|
||||||
|
{
|
||||||
|
for (i = 0; i < 32; i++)
|
||||||
|
{
|
||||||
|
if (c == ebcdic_escape_c[i]) break;
|
||||||
|
}
|
||||||
|
if (i < 32) c = i; else *errorcodeptr = ERR68;
|
||||||
|
}
|
||||||
|
#endif /* EBCDIC */
|
||||||
|
|
||||||
break;
|
break;
|
||||||
|
|
||||||
/* Any other alphanumeric following \ is an error. Perl gives an error only
|
/* Any other alphanumeric following \ is an error. Perl gives an error only
|
||||||
|
|
|
@ -145,7 +145,11 @@ static const char compile_error_texts[] =
|
||||||
"different names for subpatterns of the same number are not allowed\0"
|
"different names for subpatterns of the same number are not allowed\0"
|
||||||
"(*MARK) must have an argument\0"
|
"(*MARK) must have an argument\0"
|
||||||
"non-hex character in \\x{} (closing brace missing?)\0"
|
"non-hex character in \\x{} (closing brace missing?)\0"
|
||||||
|
#ifndef EBCDIC
|
||||||
"\\c must be followed by a printable ASCII character\0"
|
"\\c must be followed by a printable ASCII character\0"
|
||||||
|
#else
|
||||||
|
"\\c must be followed by a letter or one of [\\]^_?\0"
|
||||||
|
#endif
|
||||||
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
|
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
|
||||||
/* 70 */
|
/* 70 */
|
||||||
"internal error: unknown opcode in find_fixedlength()\0"
|
"internal error: unknown opcode in find_fixedlength()\0"
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||||
# EBCDIC option but in an ASCII environment, that newline and white space
|
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||||
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
||||||
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
||||||
|
@ -118,4 +118,17 @@
|
||||||
A\x0bB
|
A\x0bB
|
||||||
A\x0cB
|
A\x0cB
|
||||||
|
|
||||||
|
# Test \c functionality
|
||||||
|
|
||||||
|
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
|
||||||
|
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||||
|
|
||||||
|
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
|
||||||
|
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||||
|
|
||||||
|
/\ƒ?/
|
||||||
|
A\xffB
|
||||||
|
|
||||||
|
/\ƒ&/
|
||||||
|
|
||||||
# End
|
# End
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||||
# EBCDIC option but in an ASCII environment, that newline and white space
|
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||||
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
||||||
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
||||||
|
@ -179,4 +179,21 @@ No match
|
||||||
A\x0cB
|
A\x0cB
|
||||||
No match
|
No match
|
||||||
|
|
||||||
|
# Test \c functionality
|
||||||
|
|
||||||
|
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
|
||||||
|
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||||
|
0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
|
||||||
|
|
||||||
|
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
|
||||||
|
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||||
|
0: \x1b\x1c\x1d\x1e\x1f
|
||||||
|
|
||||||
|
/\ƒ?/
|
||||||
|
A\xffB
|
||||||
|
0: \xff
|
||||||
|
|
||||||
|
/\ƒ&/
|
||||||
|
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||||
|
|
||||||
# End
|
# End
|
||||||
|
|
Loading…
Reference in New Issue