Make \c operate like Perl in EBCDIC environments.
This commit is contained in:
parent
149aa29209
commit
c146059c22
|
@ -161,6 +161,9 @@ itself. For example: /^(?:(?(1)x|)+)+$()/.
|
|||
41. In an EBCDIC environment, \a in a pattern was converted to the ASCII
|
||||
instead of the EBCDIC value.
|
||||
|
||||
42. The handling of \c in an EBCDIC environment has been revised so that it is
|
||||
now compatible with the specification in Perl's perlebcdic page.
|
||||
|
||||
|
||||
Version 10.10 06-March-2015
|
||||
---------------------------
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "19 May 2015" "PCRE2 10.20"
|
||||
.TH PCRE2PATTERN 3 "13 June 2015" "PCRE2 10.20"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -337,10 +337,11 @@ A second use of backslash provides a way of encoding non-printing characters
|
|||
in patterns in a visible manner. There is no restriction on the appearance of
|
||||
non-printing characters in a pattern, but when a pattern is being prepared by
|
||||
text editing, it is often easier to use one of the following escape sequences
|
||||
than the binary character it represents:
|
||||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
.sp
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
\ecx "control-x", where x is any ASCII character
|
||||
\ecx "control-x", where x is any printable ASCII character
|
||||
\ee escape (hex 1B)
|
||||
\ef form feed (hex 0C)
|
||||
\en linefeed (hex 0A)
|
||||
|
@ -351,27 +352,40 @@ than the binary character it represents:
|
|||
\eo{ddd..} character with octal code ddd..
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||
\euhhhh character with hex code hhhh (only when PCRE2_ALT_BSUX is set)
|
||||
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
.sp
|
||||
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
||||
code unit following \ec has a value greater than 127, a compile-time error
|
||||
occurs. This locks out non-ASCII characters in all modes.
|
||||
code unit following \ec has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs. This locks out non-printable ASCII characters in all
|
||||
modes.
|
||||
.P
|
||||
The \ec facility was designed for use with ASCII characters, but with the
|
||||
extension to Unicode it is even less useful than it once was. It is, however,
|
||||
recognized when PCRE2 is compiled in EBCDIC mode, where data items are always
|
||||
bytes. In this mode, all values are valid after \ec. If the next character is a
|
||||
lower case letter, it is converted to upper case. Then the 0xc0 bits of the
|
||||
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
|
||||
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
|
||||
characters also generate different values.
|
||||
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
||||
generate the appropriate EBCDIC code values. The \ec escape is processed
|
||||
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
||||
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \e@ encodes
|
||||
character code 0; the letters (in either case) encode characters 1-26 (hex 01
|
||||
to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and
|
||||
\e? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
.P
|
||||
Thus, apart from \e?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
differ. For example, \eG always generates code value 7, which is BEL in ASCII
|
||||
but DEL in EBCDIC.
|
||||
.P
|
||||
The sequence \e? generates DEL (127, hex 7F) in an ASCII environment, but
|
||||
because 127 is not a control character in EBCDIC, Perl makes it generate the
|
||||
APC character. Unfortunately, there are several variants of EBCDIC. In most of
|
||||
them the APC character has the value 255 (hex FF), but in the one Perl calls
|
||||
POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
|
||||
values, PCRE2 makes \e? generate 95; otherwise it generates 255.
|
||||
.P
|
||||
After \e0 up to two further octal digits are read. If there are fewer than two
|
||||
digits, just those that are present are used. Thus the sequence \e0\ex\e07
|
||||
specifies two binary zeros followed by a BEL character (code value 7). Make
|
||||
digits, just those that are present are used. Thus the sequence \e0\ex\e015
|
||||
specifies two binary zeros followed by a CR character (code value 13). Make
|
||||
sure you supply two digits after the initial zero if the pattern character that
|
||||
follows is itself an octal digit.
|
||||
.P
|
||||
|
@ -3347,6 +3361,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 19 May 2015
|
||||
Last updated: 13 June 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "23 April 2015" "PCRE2 10.20"
|
||||
.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -21,9 +21,11 @@ documentation. This document contains a quick-reference summary of the syntax.
|
|||
.
|
||||
.SH "ESCAPED CHARACTERS"
|
||||
.rs
|
||||
.sp
|
||||
This table applies to ASCII and Unicode environments.
|
||||
.sp
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
\ecx "control-x", where x is any ASCII character
|
||||
\ecx "control-x", where x is any ASCII printing character
|
||||
\ee escape (hex 1B)
|
||||
\ef form feed (hex 0C)
|
||||
\en newline (hex 0A)
|
||||
|
@ -47,7 +49,8 @@ in the
|
|||
.\" HREF
|
||||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation.
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
.P
|
||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||
|
@ -567,6 +570,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 23 April 2015
|
||||
Last updated: 13 June 2015
|
||||
Copyright (c) 1997-2015 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -268,8 +268,9 @@ invalid. */
|
|||
in UTF-8 mode. It runs from '0' to 'z'. */
|
||||
|
||||
#ifndef EBCDIC
|
||||
#define ESCAPES_FIRST CHAR_0
|
||||
#define ESCAPES_LAST CHAR_z
|
||||
#define ESCAPES_FIRST CHAR_0
|
||||
#define ESCAPES_LAST CHAR_z
|
||||
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
|
||||
|
||||
static const short int escapes[] = {
|
||||
0, 0,
|
||||
|
@ -319,12 +320,14 @@ It runs from 'a' to '9'. For some minimal testing of EBCDIC features, the code
|
|||
is sometimes compiled on an ASCII system. In this case, we must not use CHAR_a
|
||||
because it is defined as 'a', which of course picks up the ASCII value. */
|
||||
|
||||
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
|
||||
#define ESCAPES_FIRST CHAR_a
|
||||
#define ESCAPES_LAST CHAR_9
|
||||
#else /* Testing in an ASCII environment */
|
||||
#if 'a' == 0x81 /* Check for a real EBCDIC environment */
|
||||
#define ESCAPES_FIRST CHAR_a
|
||||
#define ESCAPES_LAST CHAR_9
|
||||
#define ESCAPES_UPPER_CASE (+64) /* Add this to upper case a letter */
|
||||
#else /* Testing in an ASCII environment */
|
||||
#define ESCAPES_FIRST ((unsigned char)'\x81') /* EBCDIC 'a' */
|
||||
#define ESCAPES_LAST ((unsigned char)'\xf9') /* EBCDIC '9' */
|
||||
#define ESCAPES_UPPER_CASE (-32) /* Add this to upper case a letter */
|
||||
#endif
|
||||
|
||||
static const short int escapes[] = {
|
||||
|
@ -346,6 +349,11 @@ static const short int escapes[] = {
|
|||
/* F8 */ 0, 0
|
||||
};
|
||||
|
||||
/* We also need a table of characters that may follow \c in an EBCDIC
|
||||
environment for characters 0-31. */
|
||||
|
||||
static unsigned char ebcdic_escape_c[] = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_";
|
||||
|
||||
#endif /* EBCDIC */
|
||||
|
||||
|
||||
|
@ -1238,7 +1246,7 @@ for (code = first_significant_code(code + PRIV(OP_lengths)[*code], TRUE);
|
|||
PCRE2_SPTR ccode;
|
||||
|
||||
c = *code;
|
||||
|
||||
|
||||
/* Skip over forward assertions; the other assertions are skipped by
|
||||
first_significant_code() with a TRUE final argument. */
|
||||
|
||||
|
@ -2076,30 +2084,62 @@ else
|
|||
} /* End of Perl-style \x handling */
|
||||
break;
|
||||
|
||||
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
|
||||
An error is given if the byte following \c is not a printable ASCII
|
||||
character. This coding is ASCII-specific, but then the whole concept of \cx
|
||||
is ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
|
||||
/* The handling of \c is different in ASCII and EBCDIC environments. In an
|
||||
ASCII (or Unicode) environment, an error is given if the character
|
||||
following \c is not a printable ASCII character. Otherwise, the following
|
||||
character is upper-cased if it is a letter, and after that the 0x40 bit is
|
||||
flipped. The result is the value of the escape.
|
||||
|
||||
In an EBCDIC environment the handling of \c is compatible with the
|
||||
specification in the perlebcdic document. The following character must be
|
||||
a letter or one of small number of special characters. These provide a
|
||||
means of defining the character values 0-31.
|
||||
|
||||
For testing the EBCDIC handling of \c in an ASCII environment, recognize
|
||||
the EBCDIC value of 'c' explicitly. */
|
||||
|
||||
#if defined EBCDIC && 'a' != 0x81
|
||||
case 0x83:
|
||||
#else
|
||||
case CHAR_c:
|
||||
#endif
|
||||
|
||||
c = *(++ptr);
|
||||
if (c >= CHAR_a && c <= CHAR_z) c += ESCAPES_UPPER_CASE;
|
||||
if (c == CHAR_NULL && ptr >= cb->end_pattern)
|
||||
{
|
||||
*errorcodeptr = ERR2;
|
||||
break;
|
||||
}
|
||||
|
||||
/* Handle \c in an ASCII/Unicode environment. */
|
||||
|
||||
#ifndef EBCDIC /* ASCII/UTF-8 coding */
|
||||
if (c < 32 || c > 126) /* Excludes all non-printable ASCII */
|
||||
{
|
||||
*errorcodeptr = ERR68;
|
||||
break;
|
||||
}
|
||||
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
|
||||
c ^= 0x40;
|
||||
#else /* EBCDIC coding */
|
||||
if (c >= CHAR_a && c <= CHAR_z) c += 64;
|
||||
c ^= 0xC0;
|
||||
#endif
|
||||
|
||||
/* Handle \c in an EBCDIC environment. The special case \c? is converted to
|
||||
255 (0xff) or 95 (0x5f) if other character suggest we are using th POSIX-BC
|
||||
encoding. (This is the way Perl indicates that it handles \c?.) The other
|
||||
valid sequences correspond to a list of specific characters. */
|
||||
|
||||
#else
|
||||
if (c == CHAR_QUESTION_MARK)
|
||||
c = ('\\' == 188 && '`' == 74)? 0x5f : 0xff;
|
||||
else
|
||||
{
|
||||
for (i = 0; i < 32; i++)
|
||||
{
|
||||
if (c == ebcdic_escape_c[i]) break;
|
||||
}
|
||||
if (i < 32) c = i; else *errorcodeptr = ERR68;
|
||||
}
|
||||
#endif /* EBCDIC */
|
||||
|
||||
break;
|
||||
|
||||
/* Any other alphanumeric following \ is an error. Perl gives an error only
|
||||
|
@ -6492,7 +6532,7 @@ for (;; ptr++)
|
|||
goto FAILED;
|
||||
}
|
||||
recno = recno * 10 + *ptr++ - CHAR_0;
|
||||
}
|
||||
}
|
||||
|
||||
if (*ptr != (PCRE2_UCHAR)terminator)
|
||||
{
|
||||
|
|
|
@ -145,7 +145,11 @@ static const char compile_error_texts[] =
|
|||
"different names for subpatterns of the same number are not allowed\0"
|
||||
"(*MARK) must have an argument\0"
|
||||
"non-hex character in \\x{} (closing brace missing?)\0"
|
||||
#ifndef EBCDIC
|
||||
"\\c must be followed by a printable ASCII character\0"
|
||||
#else
|
||||
"\\c must be followed by a letter or one of [\\]^_?\0"
|
||||
#endif
|
||||
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
|
||||
/* 70 */
|
||||
"internal error: unknown opcode in find_fixedlength()\0"
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||
# EBCDIC option but in an ASCII environment, that newline and white space
|
||||
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
||||
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
||||
|
@ -117,5 +117,18 @@
|
|||
A\x25B
|
||||
A\x0bB
|
||||
A\x0cB
|
||||
|
||||
# Test \c functionality
|
||||
|
||||
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
|
||||
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||
|
||||
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
|
||||
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||
|
||||
/\ƒ?/
|
||||
A\xffB
|
||||
|
||||
/\ƒ&/
|
||||
|
||||
# End
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||
# EBCDIC option but in an ASCII environment, that newline and white space
|
||||
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||
# have been used instead of names like CHAR_LF. Needless to say, it is not a
|
||||
# genuine EBCDIC test! In patterns, alphabetic characters that follow a
|
||||
|
@ -178,5 +178,22 @@ No match
|
|||
No match
|
||||
A\x0cB
|
||||
No match
|
||||
|
||||
# Test \c functionality
|
||||
|
||||
/\ƒ@\ƒA\ƒb\ƒC\ƒd\ƒE\ƒf\ƒG\ƒh\ƒI\ƒJ\ƒK\ƒl\ƒm\ƒN\ƒO\ƒp\ƒq\ƒr\ƒS\ƒT\ƒu\ƒV\ƒW\ƒX\ƒy\ƒZ/
|
||||
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||
0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a
|
||||
|
||||
/\ƒ[\ƒ\\ƒ]\ƒ^\ƒ_/
|
||||
\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
|
||||
0: \x1b\x1c\x1d\x1e\x1f
|
||||
|
||||
/\ƒ?/
|
||||
A\xffB
|
||||
0: \xff
|
||||
|
||||
/\ƒ&/
|
||||
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||
|
||||
# End
|
||||
|
|
Loading…
Reference in New Issue