Add support for \N{U+dd...}, for ASCII and Unicode modes only.
This commit is contained in:
parent
775481293a
commit
e9aa3c0a21
|
@ -129,6 +129,8 @@ present.
|
|||
|
||||
28. A (*MARK) name was not being passed back for positive assertions that were
|
||||
terminated by (*ACCEPT).
|
||||
|
||||
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
|
||||
|
||||
|
||||
Version 10.31 12-February-2018
|
||||
|
|
|
@ -249,10 +249,11 @@ is used.
|
|||
<P>
|
||||
The newline convention affects where the circumflex and dollar assertions are
|
||||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \R in the section entitled
|
||||
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
|
||||
opening brace. However, it does not affect what the \R escape sequence
|
||||
matches. By default, this is any Unicode newline sequence, for Perl
|
||||
compatibility. However, this can be changed; see the next section and the
|
||||
description of \R in the section entitled
|
||||
<a href="#newlineseq">"Newline sequences"</a>
|
||||
below. A change of \R setting can be combined with a change of newline
|
||||
convention.
|
||||
|
@ -382,20 +383,27 @@ text editing, it is often easier to use one of the following escape sequences
|
|||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
<pre>
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (default mode)
|
||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (default mode)
|
||||
\N{U+hhh..} character with Unicode code point hhh..
|
||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
</pre>
|
||||
Note that when \N is not followed by an opening brace (curly bracket) it has
|
||||
an entirely different meaning, matching any character that is not a newline.
|
||||
Perl also uses \N{name} to specify characters by Unicode name; PCRE2 does not
|
||||
support this.
|
||||
</P>
|
||||
<P>
|
||||
The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
|
@ -404,14 +412,14 @@ code unit following \c has a value less than 32 or greater than 126, a
|
|||
compile-time error occurs.
|
||||
</P>
|
||||
<P>
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \c@ encodes
|
||||
character code 0; after \c the letters (in either case) encode characters 1-26
|
||||
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
||||
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
|
||||
\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
|
||||
escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
|
||||
only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
|
||||
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||
\c@ encodes character code 0; after \c the letters (in either case) encode
|
||||
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
|
||||
(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
</P>
|
||||
<P>
|
||||
Thus, apart from \c?, these escapes generate the same character code values as
|
||||
|
@ -443,9 +451,9 @@ to be unambiguously specified.
|
|||
</P>
|
||||
<P>
|
||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||
digit greater than zero. Instead, use \o{} or \x{} to specify character
|
||||
numbers, and \g{} to specify backreferences. The following paragraphs
|
||||
describe the old, ambiguous syntax.
|
||||
digit greater than zero. Instead, use \o{} or \x{} to specify numerical
|
||||
character code points, and \g{} to specify backreferences. The following
|
||||
paragraphs describe the old, ambiguous syntax.
|
||||
</P>
|
||||
<P>
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
|
@ -528,10 +536,10 @@ and outside character classes. In addition, inside a character class, \b is
|
|||
interpreted as the backspace character (hex 08).
|
||||
</P>
|
||||
<P>
|
||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
||||
they cause an error. Outside a character class, these sequences have different
|
||||
meanings.
|
||||
When not followed by an opening brace, \N is not allowed in a character class.
|
||||
\B, \R, and \X are not special inside a character class. Like other
|
||||
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||
character class, these sequences have different meanings.
|
||||
</P>
|
||||
<br><b>
|
||||
Unsupported escape sequences
|
||||
|
@ -577,6 +585,7 @@ Another use of backslash is for specifying generic character types:
|
|||
\D any character that is not a decimal digit
|
||||
\h any horizontal white space character
|
||||
\H any character that is not a horizontal white space character
|
||||
\N any character that is not a newline
|
||||
\s any white space character
|
||||
\S any character that is not a white space character
|
||||
\v any vertical white space character
|
||||
|
@ -584,11 +593,14 @@ Another use of backslash is for specifying generic character types:
|
|||
\w any "word" character
|
||||
\W any "non-word" character
|
||||
</pre>
|
||||
There is also the single sequence \N, which matches a non-newline character.
|
||||
This is the same as
|
||||
The \N escape sequence has the same meaning as
|
||||
<a href="#fullstopdot">the "." metacharacter</a>
|
||||
when PCRE2_DOTALL is not set. Perl also uses \N to match characters by name;
|
||||
PCRE2 does not support this.
|
||||
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||
meaning of \N. Note that when \N is followed by an opening brace it has a
|
||||
different meaning. See the section entitled
|
||||
<a href="#digitsafterbackslash">"Non-printing characters"</a>
|
||||
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
</P>
|
||||
<P>
|
||||
Each pair of lower and upper case escape sequences partitions the complete set
|
||||
|
@ -1297,9 +1309,15 @@ dollar, the only relationship being that they both involve newlines. Dot has no
|
|||
special meaning in a character class.
|
||||
</P>
|
||||
<P>
|
||||
The escape sequence \N behaves like a dot, except that it is not affected by
|
||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
||||
that signifies the end of a line. Perl also uses \N to match characters by
|
||||
The escape sequence \N when not followed by an opening brace behaves like a
|
||||
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||
it matches any character except one that signifies the end of a line.
|
||||
</P>
|
||||
<P>
|
||||
When \N is followed by an opening brace it has a different meaning. See the
|
||||
section entitled
|
||||
<a href="digitsafterbackslash">"Non-printing characters"</a>
|
||||
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
||||
|
@ -1385,10 +1403,11 @@ string, and therefore it fails if the current pointer is at the end of the
|
|||
string.
|
||||
</P>
|
||||
<P>
|
||||
When caseless matching is set, any letters in a class represent both their
|
||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
||||
caseful version would.
|
||||
Characters in a class may be specified by their code points using \o, \x, or
|
||||
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||
class represent both their upper case and lower case versions, so for example,
|
||||
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||
match "A", whereas a caseful version would.
|
||||
</P>
|
||||
<P>
|
||||
Characters that might indicate line breaks are never treated in any special way
|
||||
|
@ -1397,17 +1416,18 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
|||
class such as [^a] always matches one of these characters.
|
||||
</P>
|
||||
<P>
|
||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
|
||||
\V, \w, and \W may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\dABCDEF] matches any hexadecimal
|
||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
|
||||
and their upper case partners, just as it does when they appear outside a
|
||||
character class, as described in the section entitled
|
||||
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||
characters that they match to the class. For example, [\dABCDEF] matches any
|
||||
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||
\d, \s, \w and their upper case partners, just as it does when they appear
|
||||
outside a character class, as described in the section entitled
|
||||
<a href="#genericchartypes">"Generic character types"</a>
|
||||
above. The escape sequence \b has a different meaning inside a character
|
||||
class; it matches the backspace character. The sequences \B, \N, \R, and \X
|
||||
are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
class; it matches the backspace character. The sequences \B, \R, and \X are
|
||||
not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error. The same is true for \N when not followed by
|
||||
an opening brace.
|
||||
</P>
|
||||
<P>
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
|
@ -3559,7 +3579,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 20 July 2018
|
||||
Last updated: 27 July 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -70,9 +70,10 @@ This table applies to ASCII and Unicode environments.
|
|||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||
\N{U+hh..} character with Unicode code point hh..
|
||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
\x{hh..} character with hex code hh..
|
||||
</pre>
|
||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
|
@ -80,7 +81,9 @@ a non-zero digit is complicated; for details see the section
|
|||
in the
|
||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
|
||||
supported in EBCDIC environments. Note that \N not followed by an opening
|
||||
curly bracket has a different meaning (see below).
|
||||
</P>
|
||||
<P>
|
||||
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
||||
|
@ -621,7 +624,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 21 July 2018
|
||||
Last updated: 27 July 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
392
doc/pcre2.txt
392
doc/pcre2.txt
|
@ -6015,36 +6015,37 @@ SPECIAL START-OF-PATTERN ITEMS
|
|||
|
||||
The newline convention affects where the circumflex and dollar asser-
|
||||
tions are true. It also affects the interpretation of the dot metachar-
|
||||
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
|
||||
it does not affect what the \R escape sequence matches. By default,
|
||||
this is any Unicode newline sequence, for Perl compatibility. However,
|
||||
this can be changed; see the next section and the description of \R in
|
||||
the section entitled "Newline sequences" below. A change of \R setting
|
||||
can be combined with a change of newline convention.
|
||||
acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
|
||||
followed by an opening brace. However, it does not affect what the \R
|
||||
escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the
|
||||
next section and the description of \R in the section entitled "Newline
|
||||
sequences" below. A change of \R setting can be combined with a change
|
||||
of newline convention.
|
||||
|
||||
Specifying what \R matches
|
||||
|
||||
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
|
||||
the complete set of Unicode line endings) by setting the option
|
||||
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
|
||||
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
|
||||
the complete set of Unicode line endings) by setting the option
|
||||
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
|
||||
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
|
||||
CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
|
||||
|
||||
|
||||
EBCDIC CHARACTER CODES
|
||||
|
||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||
character code instead of ASCII or Unicode (typically a mainframe sys-
|
||||
tem). In the sections below, character code values are ASCII or Uni-
|
||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||
character code instead of ASCII or Unicode (typically a mainframe sys-
|
||||
tem). In the sections below, character code values are ASCII or Uni-
|
||||
code; in an EBCDIC environment these characters may have different code
|
||||
values, and there are no code points greater than 255.
|
||||
|
||||
|
||||
CHARACTERS AND METACHARACTERS
|
||||
|
||||
A regular expression is a pattern that is matched against a subject
|
||||
string from left to right. Most characters stand for themselves in a
|
||||
pattern, and match the corresponding characters in the subject. As a
|
||||
A regular expression is a pattern that is matched against a subject
|
||||
string from left to right. Most characters stand for themselves in a
|
||||
pattern, and match the corresponding characters in the subject. As a
|
||||
trivial example, the pattern
|
||||
|
||||
The quick brown fox
|
||||
|
@ -6053,14 +6054,14 @@ CHARACTERS AND METACHARACTERS
|
|||
caseless matching is specified (the PCRE2_CASELESS option), letters are
|
||||
matched independently of case.
|
||||
|
||||
The power of regular expressions comes from the ability to include
|
||||
alternatives and repetitions in the pattern. These are encoded in the
|
||||
The power of regular expressions comes from the ability to include
|
||||
alternatives and repetitions in the pattern. These are encoded in the
|
||||
pattern by the use of metacharacters, which do not stand for themselves
|
||||
but instead are interpreted in some special way.
|
||||
|
||||
There are two different sets of metacharacters: those that are recog-
|
||||
nized anywhere in the pattern except within square brackets, and those
|
||||
that are recognized within square brackets. Outside square brackets,
|
||||
There are two different sets of metacharacters: those that are recog-
|
||||
nized anywhere in the pattern except within square brackets, and those
|
||||
that are recognized within square brackets. Outside square brackets,
|
||||
the metacharacters are as follows:
|
||||
|
||||
\ general escape character with several uses
|
||||
|
@ -6079,7 +6080,7 @@ CHARACTERS AND METACHARACTERS
|
|||
also "possessive quantifier"
|
||||
{ start min/max quantifier
|
||||
|
||||
Part of a pattern that is in square brackets is called a "character
|
||||
Part of a pattern that is in square brackets is called a "character
|
||||
class". In a character class the only metacharacters are:
|
||||
|
||||
\ general escape character
|
||||
|
@ -6096,30 +6097,30 @@ BACKSLASH
|
|||
|
||||
The backslash character has several uses. Firstly, if it is followed by
|
||||
a character that is not a number or a letter, it takes away any special
|
||||
meaning that character may have. This use of backslash as an escape
|
||||
meaning that character may have. This use of backslash as an escape
|
||||
character applies both inside and outside character classes.
|
||||
|
||||
For example, if you want to match a * character, you must write \* in
|
||||
the pattern. This escaping action applies whether or not the following
|
||||
character would otherwise be interpreted as a metacharacter, so it is
|
||||
always safe to precede a non-alphanumeric with backslash to specify
|
||||
For example, if you want to match a * character, you must write \* in
|
||||
the pattern. This escaping action applies whether or not the following
|
||||
character would otherwise be interpreted as a metacharacter, so it is
|
||||
always safe to precede a non-alphanumeric with backslash to specify
|
||||
that it stands for itself. In particular, if you want to match a back-
|
||||
slash, you write \\.
|
||||
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning
|
||||
after a backslash. All other characters (in particular, those whose
|
||||
In a UTF mode, only ASCII numbers and letters have any special meaning
|
||||
after a backslash. All other characters (in particular, those whose
|
||||
code points are greater than 127) are treated as literals.
|
||||
|
||||
If a pattern is compiled with the PCRE2_EXTENDED option, most white
|
||||
space in the pattern (other than in a character class), and characters
|
||||
between a # outside a character class and the next newline, inclusive,
|
||||
If a pattern is compiled with the PCRE2_EXTENDED option, most white
|
||||
space in the pattern (other than in a character class), and characters
|
||||
between a # outside a character class and the next newline, inclusive,
|
||||
are ignored. An escaping backslash can be used to include a white space
|
||||
or # character as part of the pattern.
|
||||
|
||||
If you want to remove the special meaning from a sequence of charac-
|
||||
ters, you can do so by putting them between \Q and \E. This is differ-
|
||||
ent from Perl in that $ and @ are handled as literals in \Q...\E
|
||||
sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
|
||||
If you want to remove the special meaning from a sequence of charac-
|
||||
ters, you can do so by putting them between \Q and \E. This is differ-
|
||||
ent from Perl in that $ and @ are handled as literals in \Q...\E
|
||||
sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
|
||||
tion. Note the following examples:
|
||||
|
||||
Pattern PCRE2 matches Perl matches
|
||||
|
@ -6129,36 +6130,42 @@ BACKSLASH
|
|||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||
|
||||
The \Q...\E sequence is recognized both inside and outside character
|
||||
classes. An isolated \E that is not preceded by \Q is ignored. If \Q
|
||||
is not followed by \E later in the pattern, the literal interpretation
|
||||
continues to the end of the pattern (that is, \E is assumed at the
|
||||
end). If the isolated \Q is inside a character class, this causes an
|
||||
error, because the character class is not terminated by a closing
|
||||
The \Q...\E sequence is recognized both inside and outside character
|
||||
classes. An isolated \E that is not preceded by \Q is ignored. If \Q
|
||||
is not followed by \E later in the pattern, the literal interpretation
|
||||
continues to the end of the pattern (that is, \E is assumed at the
|
||||
end). If the isolated \Q is inside a character class, this causes an
|
||||
error, because the character class is not terminated by a closing
|
||||
square bracket.
|
||||
|
||||
Non-printing characters
|
||||
|
||||
A second use of backslash provides a way of encoding non-printing char-
|
||||
acters in patterns in a visible manner. There is no restriction on the
|
||||
appearance of non-printing characters in a pattern, but when a pattern
|
||||
acters in patterns in a visible manner. There is no restriction on the
|
||||
appearance of non-printing characters in a pattern, but when a pattern
|
||||
is being prepared by text editing, it is often easier to use one of the
|
||||
following escape sequences than the binary character it represents. In
|
||||
following escape sequences than the binary character it represents. In
|
||||
an ASCII or Unicode environment, these escapes are as follows:
|
||||
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (default mode)
|
||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
\a alarm, that is, the BEL character (hex 07)
|
||||
\cx "control-x", where x is any printable ASCII character
|
||||
\e escape (hex 1B)
|
||||
\f form feed (hex 0C)
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (default mode)
|
||||
\N{U+hhh..} character with Unicode code point hhh..
|
||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
|
||||
Note that when \N is not followed by an opening brace (curly bracket)
|
||||
it has an entirely different meaning, matching any character that is
|
||||
not a newline. Perl also uses \N{name} to specify characters by Uni-
|
||||
code name; PCRE2 does not support this.
|
||||
|
||||
The precise effect of \cx on ASCII characters is as follows: if x is a
|
||||
lower case letter, it is converted to upper case. Then bit 6 of the
|
||||
|
@ -6167,15 +6174,15 @@ BACKSLASH
|
|||
hex 7B (; is 3B). If the code unit following \c has a value less than
|
||||
32 or greater than 126, a compile-time error occurs.
|
||||
|
||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
|
||||
erate the appropriate EBCDIC code values. The \c escape is processed as
|
||||
specified for Perl in the perlebcdic document. The only characters that
|
||||
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
|
||||
Any other character provokes a compile-time error. The sequence \c@
|
||||
encodes character code 0; after \c the letters (in either case) encode
|
||||
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
|
||||
27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
|
||||
(hex 5F).
|
||||
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
|
||||
\a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
|
||||
The \c escape is processed as specified for Perl in the perlebcdic doc-
|
||||
ument. The only characters that are allowed after \c are A-Z, a-z, or
|
||||
one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
|
||||
time error. The sequence \c@ encodes character code 0; after \c the
|
||||
letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
|
||||
\, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
|
||||
becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
|
||||
Thus, apart from \c?, these escapes generate the same character code
|
||||
values as they do in an ASCII environment, though the meanings of the
|
||||
|
@ -6203,9 +6210,9 @@ BACKSLASH
|
|||
numbers and backreferences to be unambiguously specified.
|
||||
|
||||
For greater clarity and unambiguity, it is best to avoid following \ by
|
||||
a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
|
||||
ter numbers, and \g{} to specify backreferences. The following para-
|
||||
graphs describe the old, ambiguous syntax.
|
||||
a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
|
||||
cal character code points, and \g{} to specify backreferences. The fol-
|
||||
lowing paragraphs describe the old, ambiguous syntax.
|
||||
|
||||
The handling of a backslash followed by a digit other than 0 is compli-
|
||||
cated, and Perl has changed over time, causing PCRE2 also to change.
|
||||
|
@ -6281,10 +6288,10 @@ BACKSLASH
|
|||
inside and outside character classes. In addition, inside a character
|
||||
class, \b is interpreted as the backspace character (hex 08).
|
||||
|
||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
||||
inside a character class. Like other unrecognized alphabetic escape
|
||||
sequences, they cause an error. Outside a character class, these
|
||||
sequences have different meanings.
|
||||
When not followed by an opening brace, \N is not allowed in a character
|
||||
class. \B, \R, and \X are not special inside a character class. Like
|
||||
other unrecognized alphabetic escape sequences, they cause an error.
|
||||
Outside a character class, these sequences have different meanings.
|
||||
|
||||
Unsupported escape sequences
|
||||
|
||||
|
@ -6318,6 +6325,7 @@ BACKSLASH
|
|||
\D any character that is not a decimal digit
|
||||
\h any horizontal white space character
|
||||
\H any character that is not a horizontal white space character
|
||||
\N any character that is not a newline
|
||||
\s any white space character
|
||||
\S any character that is not a white space character
|
||||
\v any vertical white space character
|
||||
|
@ -6325,10 +6333,12 @@ BACKSLASH
|
|||
\w any "word" character
|
||||
\W any "non-word" character
|
||||
|
||||
There is also the single sequence \N, which matches a non-newline char-
|
||||
acter. This is the same as the "." metacharacter when PCRE2_DOTALL is
|
||||
not set. Perl also uses \N to match characters by name; PCRE2 does not
|
||||
support this.
|
||||
The \N escape sequence has the same meaning as the "." metacharacter
|
||||
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
|
||||
the meaning of \N. Note that when \N is followed by an opening brace it
|
||||
has a different meaning. See the section entitled "Non-printing charac-
|
||||
ters" above for details. Perl also uses \N{name} to specify characters
|
||||
by Unicode name; PCRE2 does not support this.
|
||||
|
||||
Each pair of lower and upper case escape sequences partitions the com-
|
||||
plete set of characters into two disjoint sets. Any given character
|
||||
|
@ -6867,49 +6877,54 @@ FULL STOP (PERIOD, DOT) AND \N
|
|||
flex and dollar, the only relationship being that they both involve
|
||||
newlines. Dot has no special meaning in a character class.
|
||||
|
||||
The escape sequence \N behaves like a dot, except that it is not
|
||||
affected by the PCRE2_DOTALL option. In other words, it matches any
|
||||
character except one that signifies the end of a line. Perl also uses
|
||||
\N to match characters by name; PCRE2 does not support this.
|
||||
The escape sequence \N when not followed by an opening brace behaves
|
||||
like a dot, except that it is not affected by the PCRE2_DOTALL option.
|
||||
In other words, it matches any character except one that signifies the
|
||||
end of a line.
|
||||
|
||||
When \N is followed by an opening brace it has a different meaning. See
|
||||
the section entitled "Non-printing characters" above for details. Perl
|
||||
also uses \N{name} to specify characters by Unicode name; PCRE2 does
|
||||
not support this.
|
||||
|
||||
|
||||
MATCHING A SINGLE CODE UNIT
|
||||
|
||||
Outside a character class, the escape sequence \C matches any one code
|
||||
unit, whether or not a UTF mode is set. In the 8-bit library, one code
|
||||
unit is one byte; in the 16-bit library it is a 16-bit unit; in the
|
||||
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
|
||||
line-ending characters. The feature is provided in Perl in order to
|
||||
Outside a character class, the escape sequence \C matches any one code
|
||||
unit, whether or not a UTF mode is set. In the 8-bit library, one code
|
||||
unit is one byte; in the 16-bit library it is a 16-bit unit; in the
|
||||
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
|
||||
line-ending characters. The feature is provided in Perl in order to
|
||||
match individual bytes in UTF-8 mode, but it is unclear how it can use-
|
||||
fully be used.
|
||||
|
||||
Because \C breaks up characters into individual code units, matching
|
||||
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
|
||||
string may start with a malformed UTF character. This has undefined
|
||||
Because \C breaks up characters into individual code units, matching
|
||||
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
|
||||
string may start with a malformed UTF character. This has undefined
|
||||
results, because PCRE2 assumes that it is matching character by charac-
|
||||
ter in a valid UTF string (by default it checks the subject string's
|
||||
validity at the start of processing unless the PCRE2_NO_UTF_CHECK
|
||||
ter in a valid UTF string (by default it checks the subject string's
|
||||
validity at the start of processing unless the PCRE2_NO_UTF_CHECK
|
||||
option is used).
|
||||
|
||||
An application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
|
||||
An application can lock out the use of \C by setting the
|
||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
|
||||
possible to build PCRE2 with the use of \C permanently disabled.
|
||||
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions (described
|
||||
below) in UTF-8 or UTF-16 modes, because this would make it impossible
|
||||
to calculate the length of the lookbehind. Neither the alternative
|
||||
PCRE2 does not allow \C to appear in lookbehind assertions (described
|
||||
below) in UTF-8 or UTF-16 modes, because this would make it impossible
|
||||
to calculate the length of the lookbehind. Neither the alternative
|
||||
matching function pcre2_dfa_match() nor the JIT optimizer support \C in
|
||||
these UTF modes. The former gives a match-time error; the latter fails
|
||||
to optimize and so the match is always run using the interpreter.
|
||||
|
||||
In the 32-bit library, however, \C is always supported (when not
|
||||
explicitly locked out) because it always matches a single code unit,
|
||||
In the 32-bit library, however, \C is always supported (when not
|
||||
explicitly locked out) because it always matches a single code unit,
|
||||
whether or not UTF-32 is specified.
|
||||
|
||||
In general, the \C escape sequence is best avoided. However, one way of
|
||||
using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
|
||||
ters is to use a lookahead to check the length of the next character,
|
||||
as in this pattern, which could be used with a UTF-8 string (ignore
|
||||
using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
|
||||
ters is to use a lookahead to check the length of the next character,
|
||||
as in this pattern, which could be used with a UTF-8 string (ignore
|
||||
white space and line breaks):
|
||||
|
||||
(?| (?=[\x00-\x7f])(\C) |
|
||||
|
@ -6917,10 +6932,10 @@ MATCHING A SINGLE CODE UNIT
|
|||
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
|
||||
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
|
||||
|
||||
In this example, a group that starts with (?| resets the capturing
|
||||
In this example, a group that starts with (?| resets the capturing
|
||||
parentheses numbers in each alternative (see "Duplicate Subpattern Num-
|
||||
bers" below). The assertions at the start of each branch check the next
|
||||
UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
|
||||
UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
|
||||
respectively. The character's individual bytes are then captured by the
|
||||
appropriate number of \C groups.
|
||||
|
||||
|
@ -6929,50 +6944,53 @@ SQUARE BRACKETS AND CHARACTER CLASSES
|
|||
|
||||
An opening square bracket introduces a character class, terminated by a
|
||||
closing square bracket. A closing square bracket on its own is not spe-
|
||||
cial by default. If a closing square bracket is required as a member
|
||||
cial by default. If a closing square bracket is required as a member
|
||||
of the class, it should be the first data character in the class (after
|
||||
an initial circumflex, if present) or escaped with a backslash. This
|
||||
means that, by default, an empty class cannot be defined. However, if
|
||||
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
|
||||
an initial circumflex, if present) or escaped with a backslash. This
|
||||
means that, by default, an empty class cannot be defined. However, if
|
||||
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
|
||||
the start does end the (empty) class.
|
||||
|
||||
A character class matches a single character in the subject. A matched
|
||||
A character class matches a single character in the subject. A matched
|
||||
character must be in the set of characters defined by the class, unless
|
||||
the first character in the class definition is a circumflex, in which
|
||||
the first character in the class definition is a circumflex, in which
|
||||
case the subject character must not be in the set defined by the class.
|
||||
If a circumflex is actually required as a member of the class, ensure
|
||||
If a circumflex is actually required as a member of the class, ensure
|
||||
it is not the first character, or escape it with a backslash.
|
||||
|
||||
For example, the character class [aeiou] matches any lower case vowel,
|
||||
while [^aeiou] matches any character that is not a lower case vowel.
|
||||
For example, the character class [aeiou] matches any lower case vowel,
|
||||
while [^aeiou] matches any character that is not a lower case vowel.
|
||||
Note that a circumflex is just a convenient notation for specifying the
|
||||
characters that are in the class by enumerating those that are not. A
|
||||
class that starts with a circumflex is not an assertion; it still con-
|
||||
sumes a character from the subject string, and therefore it fails if
|
||||
characters that are in the class by enumerating those that are not. A
|
||||
class that starts with a circumflex is not an assertion; it still con-
|
||||
sumes a character from the subject string, and therefore it fails if
|
||||
the current pointer is at the end of the string.
|
||||
|
||||
When caseless matching is set, any letters in a class represent both
|
||||
their upper case and lower case versions, so for example, a caseless
|
||||
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||
match "A", whereas a caseful version would.
|
||||
Characters in a class may be specified by their code points using \o,
|
||||
\x, or \N{U+hh..} in the usual way. When caseless matching is set, any
|
||||
letters in a class represent both their upper case and lower case ver-
|
||||
sions, so for example, a caseless [aeiou] matches "A" as well as "a",
|
||||
and a caseless [^aeiou] does not match "A", whereas a caseful version
|
||||
would.
|
||||
|
||||
Characters that might indicate line breaks are never treated in any
|
||||
special way when matching character classes, whatever line-ending
|
||||
sequence is in use, and whatever setting of the PCRE2_DOTALL and
|
||||
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
||||
Characters that might indicate line breaks are never treated in any
|
||||
special way when matching character classes, whatever line-ending
|
||||
sequence is in use, and whatever setting of the PCRE2_DOTALL and
|
||||
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
||||
one of these characters.
|
||||
|
||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
|
||||
\w, and \W may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\dABCDEF] matches any hexadeci-
|
||||
mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||
\d, \s, \w and their upper case partners, just as it does when they
|
||||
appear outside a character class, as described in the section entitled
|
||||
"Generic character types" above. The escape sequence \b has a different
|
||||
meaning inside a character class; it matches the backspace character.
|
||||
The sequences \B, \N, \R, and \X are not special inside a character
|
||||
class. Like any other unrecognized escape sequences, they cause an
|
||||
error.
|
||||
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||
characters that they match to the class. For example, [\dABCDEF]
|
||||
matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
|
||||
affects the meanings of \d, \s, \w and their upper case partners, just
|
||||
as it does when they appear outside a character class, as described in
|
||||
the section entitled "Generic character types" above. The escape
|
||||
sequence \b has a different meaning inside a character class; it
|
||||
matches the backspace character. The sequences \B, \R, and \X are not
|
||||
special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error. The same is true for \N when not fol-
|
||||
lowed by an opening brace.
|
||||
|
||||
The minus (hyphen) character can be used to specify a range of charac-
|
||||
ters in a character class. For example, [d-m] matches any letter
|
||||
|
@ -9012,7 +9030,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 20 July 2018
|
||||
Last updated: 27 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -9873,19 +9891,23 @@ ESCAPED CHARACTERS
|
|||
\ddd character with octal code ddd, or backreference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||
\N{U+hh..} character with Unicode code point hh..
|
||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh..
|
||||
\x{hh..} character with hex code hh..
|
||||
|
||||
Note that \0dd is always an octal code. The treatment of backslash fol-
|
||||
lowed by a non-zero digit is complicated; for details see the section
|
||||
"Non-printing characters" in the pcre2pattern documentation, where
|
||||
details of escape processing in EBCDIC environments are also given.
|
||||
details of escape processing in EBCDIC environments are also given.
|
||||
\N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
|
||||
EBCDIC environments. Note that \N not followed by an opening curly
|
||||
bracket has a different meaning (see below).
|
||||
|
||||
When \x is not followed by {, from zero to two hexadecimal digits are
|
||||
When \x is not followed by {, from zero to two hexadecimal digits are
|
||||
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
|
||||
imal digits to be recognized as a hexadecimal escape; otherwise it
|
||||
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
|
||||
imal digits to be recognized as a hexadecimal escape; otherwise it
|
||||
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
|
||||
lowed by four hexadecimal digits, it matches a literal "u".
|
||||
|
||||
|
||||
|
@ -9910,14 +9932,14 @@ CHARACTER TYPES
|
|||
\W a "non-word" character
|
||||
\X a Unicode extended grapheme cluster
|
||||
|
||||
\C is dangerous because it may leave the current matching point in the
|
||||
\C is dangerous because it may leave the current matching point in the
|
||||
middle of a UTF-8 or UTF-16 character. The application can lock out the
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
|
||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
|
||||
possible to build PCRE2 with the use of \C permanently disabled.
|
||||
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
|
||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
|
||||
mode or in the 16-bit and 32-bit libraries. However, if locale-specific
|
||||
matching is happening, \s and \w may also match characters with code
|
||||
matching is happening, \s and \w may also match characters with code
|
||||
points in the range 128-255. If the PCRE2_UCP option is set, the behav-
|
||||
iour of these escape sequences is changed to use Unicode properties and
|
||||
they match many more characters.
|
||||
|
@ -9986,28 +10008,28 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
|||
|
||||
SCRIPT NAMES FOR \p AND \P
|
||||
|
||||
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
||||
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
||||
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
|
||||
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
|
||||
Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
|
||||
Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
|
||||
Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
|
||||
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
|
||||
Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
|
||||
nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
|
||||
Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
|
||||
jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
|
||||
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
||||
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
||||
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
|
||||
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
|
||||
Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
|
||||
Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
|
||||
Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
|
||||
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
|
||||
Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
|
||||
nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
|
||||
Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
|
||||
jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
|
||||
Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
|
||||
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
|
||||
Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
|
||||
ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
|
||||
dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
|
||||
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
|
||||
Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
|
||||
ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
|
||||
dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
|
||||
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
|
||||
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
|
||||
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
|
||||
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
|
||||
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
|
||||
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
|
||||
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
|
||||
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
|
||||
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
|
||||
nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
|
||||
|
||||
|
||||
|
@ -10034,8 +10056,8 @@ CHARACTER CLASSES
|
|||
word same as \w
|
||||
xdigit hexadecimal digit
|
||||
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by
|
||||
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
||||
In PCRE2, POSIX character set names recognize only ASCII characters by
|
||||
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
||||
You can use \Q...\E inside a character class.
|
||||
|
||||
|
||||
|
@ -10121,8 +10143,8 @@ OPTION SETTING
|
|||
(?xx) as (?x) but also ignore space and tab in classes
|
||||
(?-...) unset option(s)
|
||||
|
||||
The following are recognized only at the very start of a pattern or
|
||||
after one of the newline or \R options with similar syntax. More than
|
||||
The following are recognized only at the very start of a pattern or
|
||||
after one of the newline or \R options with similar syntax. More than
|
||||
one of them may appear. For the first three, d is a decimal number.
|
||||
|
||||
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
||||
|
@ -10137,17 +10159,17 @@ OPTION SETTING
|
|||
(*UTF) set appropriate UTF mode for the library in use
|
||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||
|
||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
||||
value of the limits set by the caller of pcre2_match() or
|
||||
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
||||
value of the limits set by the caller of pcre2_match() or
|
||||
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
||||
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
|
||||
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
||||
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
||||
respectively, at compile time.
|
||||
|
||||
|
||||
NEWLINE CONVENTION
|
||||
|
||||
These are recognized only at the very start of the pattern or after
|
||||
These are recognized only at the very start of the pattern or after
|
||||
option settings with a similar syntax.
|
||||
|
||||
(*CR) carriage return only
|
||||
|
@ -10160,7 +10182,7 @@ NEWLINE CONVENTION
|
|||
|
||||
WHAT \R MATCHES
|
||||
|
||||
These are recognized only at the very start of the pattern or after
|
||||
These are recognized only at the very start of the pattern or after
|
||||
option setting with a similar syntax.
|
||||
|
||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||
|
@ -10229,16 +10251,16 @@ CONDITIONAL PATTERNS
|
|||
(?(VERSION[>]=n.m) test PCRE2 version
|
||||
(?(assert) assertion condition
|
||||
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a
|
||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||
conditions or recursion tests. Such a condition is interpreted as a
|
||||
reference condition if the relevant named group exists.
|
||||
|
||||
|
||||
BACKTRACKING CONTROL
|
||||
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For
|
||||
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
||||
changes its behaviour if :NAME is present. The others just set a name
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For
|
||||
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
||||
changes its behaviour if :NAME is present. The others just set a name
|
||||
for passing back to the caller, but this is not a name that (*SKIP) can
|
||||
see. The following act immediately they are reached:
|
||||
|
||||
|
@ -10246,7 +10268,7 @@ BACKTRACKING CONTROL
|
|||
(*FAIL) force backtrack; synonym (*F)
|
||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||
|
||||
The following act only when a subsequent match failure causes a back-
|
||||
The following act only when a subsequent match failure causes a back-
|
||||
track to reach them. They all force a match failure, but they differ in
|
||||
what happens afterwards. Those that advance the start-of-match point do
|
||||
so only if the pattern is not anchored.
|
||||
|
@ -10258,7 +10280,7 @@ BACKTRACKING CONTROL
|
|||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||
(*THEN) local failure, backtrack to next alternation
|
||||
|
||||
The effect of one of these verbs in a group called as a subroutine is
|
||||
The effect of one of these verbs in a group called as a subroutine is
|
||||
confined to the subroutine call.
|
||||
|
||||
|
||||
|
@ -10269,14 +10291,14 @@ CALLOUTS
|
|||
(?C"text") callout with string data
|
||||
|
||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
|
||||
the start and the end), and the starting delimiter { matched with the
|
||||
ending delimiter }. To encode the ending delimiter within the string,
|
||||
the start and the end), and the starting delimiter { matched with the
|
||||
ending delimiter }. To encode the ending delimiter within the string,
|
||||
double it.
|
||||
|
||||
|
||||
SEE ALSO
|
||||
|
||||
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
||||
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
||||
pcre2(3).
|
||||
|
||||
|
||||
|
@ -10289,7 +10311,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 21 July 2018
|
||||
Last updated: 27 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.sp
|
||||
|
@ -1400,7 +1400,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
|
|||
not match when the current position in the subject is at a newline. This option
|
||||
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
||||
(?s) option setting. A negative class such as [^a] always matches newline
|
||||
characters, independent of the setting of this option.
|
||||
characters, and the \eN escape sequence always matches a non-newline character,
|
||||
independent of the setting of PCRE2_DOTALL.
|
||||
.sp
|
||||
PCRE2_DUPNAMES
|
||||
.sp
|
||||
|
@ -3640,6 +3641,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 July 2018
|
||||
Last updated: 27 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "20 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2PATTERN 3 "27 July 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -218,10 +218,11 @@ is used.
|
|||
.P
|
||||
The newline convention affects where the circumflex and dollar assertions are
|
||||
true. It also affects the interpretation of the dot metacharacter when
|
||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
||||
section and the description of \eR in the section entitled
|
||||
PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
|
||||
opening brace. However, it does not affect what the \eR escape sequence
|
||||
matches. By default, this is any Unicode newline sequence, for Perl
|
||||
compatibility. However, this can be changed; see the next section and the
|
||||
description of \eR in the section entitled
|
||||
.\" HTML <a href="#newlineseq">
|
||||
.\" </a>
|
||||
"Newline sequences"
|
||||
|
@ -359,20 +360,26 @@ text editing, it is often easier to use one of the following escape sequences
|
|||
than the binary character it represents. In an ASCII or Unicode environment,
|
||||
these escapes are as follows:
|
||||
.sp
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
\ecx "control-x", where x is any printable ASCII character
|
||||
\ee escape (hex 1B)
|
||||
\ef form feed (hex 0C)
|
||||
\en linefeed (hex 0A)
|
||||
\er carriage return (hex 0D)
|
||||
\et tab (hex 09)
|
||||
\e0dd character with octal code 0dd
|
||||
\eddd character with octal code ddd, or backreference
|
||||
\eo{ddd..} character with octal code ddd..
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
\ea alarm, that is, the BEL character (hex 07)
|
||||
\ecx "control-x", where x is any printable ASCII character
|
||||
\ee escape (hex 1B)
|
||||
\ef form feed (hex 0C)
|
||||
\en linefeed (hex 0A)
|
||||
\er carriage return (hex 0D)
|
||||
\et tab (hex 09)
|
||||
\e0dd character with octal code 0dd
|
||||
\eddd character with octal code ddd, or backreference
|
||||
\eo{ddd..} character with octal code ddd..
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||
\eN{U+hhh..} character with Unicode code point hhh..
|
||||
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||
.sp
|
||||
Note that when \eN is not followed by an opening brace (curly bracket) it has
|
||||
an entirely different meaning, matching any character that is not a newline.
|
||||
Perl also uses \eN{name} to specify characters by Unicode name; PCRE2 does not
|
||||
support this.
|
||||
.P
|
||||
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||
|
@ -380,14 +387,14 @@ but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
|||
code unit following \ec has a value less than 32 or greater than 126, a
|
||||
compile-time error occurs.
|
||||
.P
|
||||
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
||||
generate the appropriate EBCDIC code values. The \ec escape is processed
|
||||
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
||||
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
||||
other character provokes a compile-time error. The sequence \ec@ encodes
|
||||
character code 0; after \ec the letters (in either case) encode characters 1-26
|
||||
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
||||
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
|
||||
\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
|
||||
escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
|
||||
only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
|
||||
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||
\ec@ encodes character code 0; after \ec the letters (in either case) encode
|
||||
characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
|
||||
(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||
.P
|
||||
Thus, apart from \ec?, these escapes generate the same character code values as
|
||||
they do in an ASCII environment, though the meanings of the values mostly
|
||||
|
@ -414,9 +421,9 @@ numbers greater than 0777, and it also allows octal numbers and backreferences
|
|||
to be unambiguously specified.
|
||||
.P
|
||||
For greater clarity and unambiguity, it is best to avoid following \e by a
|
||||
digit greater than zero. Instead, use \eo{} or \ex{} to specify character
|
||||
numbers, and \eg{} to specify backreferences. The following paragraphs
|
||||
describe the old, ambiguous syntax.
|
||||
digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
|
||||
character code points, and \eg{} to specify backreferences. The following
|
||||
paragraphs describe the old, ambiguous syntax.
|
||||
.P
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
and Perl has changed over time, causing PCRE2 also to change.
|
||||
|
@ -507,10 +514,10 @@ All the sequences that define a single character value can be used both inside
|
|||
and outside character classes. In addition, inside a character class, \eb is
|
||||
interpreted as the backspace character (hex 08).
|
||||
.P
|
||||
\eN is not allowed in a character class. \eB, \eR, and \eX are not special
|
||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
||||
they cause an error. Outside a character class, these sequences have different
|
||||
meanings.
|
||||
When not followed by an opening brace, \eN is not allowed in a character class.
|
||||
\eB, \eR, and \eX are not special inside a character class. Like other
|
||||
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||
character class, these sequences have different meanings.
|
||||
.
|
||||
.
|
||||
.SS "Unsupported escape sequences"
|
||||
|
@ -569,6 +576,7 @@ Another use of backslash is for specifying generic character types:
|
|||
\eD any character that is not a decimal digit
|
||||
\eh any horizontal white space character
|
||||
\eH any character that is not a horizontal white space character
|
||||
\eN any character that is not a newline
|
||||
\es any white space character
|
||||
\eS any character that is not a white space character
|
||||
\ev any vertical white space character
|
||||
|
@ -576,14 +584,20 @@ Another use of backslash is for specifying generic character types:
|
|||
\ew any "word" character
|
||||
\eW any "non-word" character
|
||||
.sp
|
||||
There is also the single sequence \eN, which matches a non-newline character.
|
||||
This is the same as
|
||||
The \eN escape sequence has the same meaning as
|
||||
.\" HTML <a href="#fullstopdot">
|
||||
.\" </a>
|
||||
the "." metacharacter
|
||||
.\"
|
||||
when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name;
|
||||
PCRE2 does not support this.
|
||||
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||
meaning of \eN. Note that when \eN is followed by an opening brace it has a
|
||||
different meaning. See the section entitled
|
||||
.\" HTML <a href="#digitsafterbackslash">
|
||||
.\" </a>
|
||||
"Non-printing characters"
|
||||
.\"
|
||||
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
.P
|
||||
Each pair of lower and upper case escape sequences partitions the complete set
|
||||
of characters into two disjoint sets. Any given character matches one, and only
|
||||
|
@ -1289,9 +1303,17 @@ The handling of dot is entirely independent of the handling of circumflex and
|
|||
dollar, the only relationship being that they both involve newlines. Dot has no
|
||||
special meaning in a character class.
|
||||
.P
|
||||
The escape sequence \eN behaves like a dot, except that it is not affected by
|
||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
||||
that signifies the end of a line. Perl also uses \eN to match characters by
|
||||
The escape sequence \eN when not followed by an opening brace behaves like a
|
||||
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||
it matches any character except one that signifies the end of a line.
|
||||
.P
|
||||
When \eN is followed by an opening brace it has a different meaning. See the
|
||||
section entitled
|
||||
.\" HTML <a href="digitsafterbackslash">
|
||||
.\" </a>
|
||||
"Non-printing characters"
|
||||
.\"
|
||||
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||
name; PCRE2 does not support this.
|
||||
.
|
||||
.
|
||||
|
@ -1380,30 +1402,32 @@ circumflex is not an assertion; it still consumes a character from the subject
|
|||
string, and therefore it fails if the current pointer is at the end of the
|
||||
string.
|
||||
.P
|
||||
When caseless matching is set, any letters in a class represent both their
|
||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
||||
caseful version would.
|
||||
Characters in a class may be specified by their code points using \eo, \ex, or
|
||||
\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||
class represent both their upper case and lower case versions, so for example,
|
||||
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||
match "A", whereas a caseful version would.
|
||||
.P
|
||||
Characters that might indicate line breaks are never treated in any special way
|
||||
when matching character classes, whatever line-ending sequence is in use, and
|
||||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||
class such as [^a] always matches one of these characters.
|
||||
.P
|
||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
|
||||
and their upper case partners, just as it does when they appear outside a
|
||||
character class, as described in the section entitled
|
||||
The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
|
||||
\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
|
||||
characters that they match to the class. For example, [\edABCDEF] matches any
|
||||
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||
\ed, \es, \ew and their upper case partners, just as it does when they appear
|
||||
outside a character class, as described in the section entitled
|
||||
.\" HTML <a href="#genericchartypes">
|
||||
.\" </a>
|
||||
"Generic character types"
|
||||
.\"
|
||||
above. The escape sequence \eb has a different meaning inside a character
|
||||
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
||||
are not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error.
|
||||
class; it matches the backspace character. The sequences \eB, \eR, and \eX are
|
||||
not special inside a character class. Like any other unrecognized escape
|
||||
sequences, they cause an error. The same is true for \eN when not followed by
|
||||
an opening brace.
|
||||
.P
|
||||
The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
|
@ -3580,6 +3604,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 20 July 2018
|
||||
Last updated: 27 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "21 July 2018" "PCRE2 10.32"
|
||||
.TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -35,9 +35,10 @@ This table applies to ASCII and Unicode environments.
|
|||
\eddd character with octal code ddd, or backreference
|
||||
\eo{ddd..} character with octal code ddd..
|
||||
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||
\eN{U+hh..} character with Unicode code point hh..
|
||||
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||
\exhh character with hex code hh
|
||||
\ex{hhh..} character with hex code hhh..
|
||||
\ex{hh..} character with hex code hh..
|
||||
.sp
|
||||
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
||||
a non-zero digit is complicated; for details see the section
|
||||
|
@ -50,7 +51,9 @@ in the
|
|||
\fBpcre2pattern\fP
|
||||
.\"
|
||||
documentation, where details of escape processing in EBCDIC environments are
|
||||
also given.
|
||||
also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
|
||||
supported in EBCDIC environments. Note that \eN not followed by an opening
|
||||
curly bracket has a different meaning (see below).
|
||||
.P
|
||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||
|
@ -609,6 +612,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 21 July 2018
|
||||
Last updated: 27 July 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -316,6 +316,7 @@ pcre2_pattern_convert(). */
|
|||
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
|
||||
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
|
||||
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
|
||||
#define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC 193
|
||||
|
||||
|
||||
/* "Expected" matching error codes: no match and partial match. */
|
||||
|
|
|
@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
|
|||
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
||||
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
||||
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
||||
ERR91, ERR92};
|
||||
ERR91, ERR92, ERR93 };
|
||||
|
||||
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
||||
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
||||
|
@ -1441,6 +1441,42 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
|
|||
escape = -i; /* Else return a special escape */
|
||||
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
|
||||
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
|
||||
|
||||
/* Perl supports \N{name} for character names and \N{U+dddd} for numerical
|
||||
Unicode code points, as well as plain \N for "not newline". PCRE does not
|
||||
support \N{name}. However, it does support quantification such as \N{2,3},
|
||||
so if \N{ is not followed by U+dddd we check for a quantifier. */
|
||||
|
||||
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||
{
|
||||
PCRE2_SPTR p = ptr + 1;
|
||||
|
||||
/* \N{U+ can be handled by the \x{ code. However, this construction is
|
||||
not valid in EBCDIC environments because it specifies a Unicode
|
||||
character, not a codepoint in the local code. For example \N{U+0041}
|
||||
must be "A" in all environments. */
|
||||
|
||||
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
|
||||
{
|
||||
#ifdef EBCDIC
|
||||
*errorcodeptr = ERR93;
|
||||
#else
|
||||
ptr = p + 1;
|
||||
escape = 0; /* Not a fancy escape after all */
|
||||
goto COME_FROM_NU;
|
||||
#endif
|
||||
}
|
||||
|
||||
/* Give an error if what follows is not a quantifier, but don't override
|
||||
an error set by the quantifier reader (e.g. number overflow). */
|
||||
|
||||
else
|
||||
{
|
||||
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
||||
*errorcodeptr == 0)
|
||||
*errorcodeptr = ERR37;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -1725,6 +1761,9 @@ else
|
|||
{
|
||||
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||
{
|
||||
#ifndef EBCDIC
|
||||
COME_FROM_NU:
|
||||
#endif
|
||||
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
|
||||
{
|
||||
*errorcodeptr = ERR78;
|
||||
|
@ -1858,19 +1897,6 @@ else
|
|||
}
|
||||
}
|
||||
|
||||
/* Perl supports \N{name} for character names, as well as plain \N for "not
|
||||
newline". PCRE does not support \N{name}. However, it does support
|
||||
quantification such as \N{2,3}. */
|
||||
|
||||
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET &&
|
||||
ptrend - ptr > 2)
|
||||
{
|
||||
PCRE2_SPTR p = ptr + 1;
|
||||
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
||||
*errorcodeptr == 0)
|
||||
*errorcodeptr = ERR37;
|
||||
}
|
||||
|
||||
/* Set the pointer to the next character before returning. */
|
||||
|
||||
*ptrptr = ptr;
|
||||
|
@ -3223,7 +3249,6 @@ while (ptr < ptrend)
|
|||
tempptr = ptr;
|
||||
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
|
||||
options, TRUE, cb);
|
||||
|
||||
if (errorcode != 0)
|
||||
{
|
||||
CLASS_ESCAPE_FAILED:
|
||||
|
|
|
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
|
|||
"using UCP is disabled by the application\0"
|
||||
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
||||
"character code point value in \\u.... sequence is too large\0"
|
||||
"digits missing in \\x{} or \\o{}\0"
|
||||
"digits missing in \\x{} or \\o{} or \\N{U+}\0"
|
||||
"syntax error or number too big in (?(VERSION condition\0"
|
||||
/* 80 */
|
||||
"internal error: unknown opcode in auto_possessify()\0"
|
||||
|
@ -179,6 +179,7 @@ static const unsigned char compile_error_texts[] =
|
|||
"internal error: bad code value in parsed_skip()\0"
|
||||
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
|
||||
"invalid option bits with PCRE2_LITERAL\0"
|
||||
"\\N{U+dddd} is not supported in EBCDIC mode\0"
|
||||
;
|
||||
|
||||
/* Match-time and UTF error texts are in the same format. */
|
||||
|
|
|
@ -2287,5 +2287,11 @@
|
|||
\x{123}\x{122}\x{123}
|
||||
\= Expect no match
|
||||
\x{123}\x{124}\x{123}
|
||||
|
||||
/\N{U+1234}/utf
|
||||
\x{1234}
|
||||
|
||||
/[\N{U+1234}]/utf
|
||||
\x{1234}
|
||||
|
||||
# End of testinput4
|
||||
|
|
|
@ -2087,4 +2087,8 @@
|
|||
\x{655}
|
||||
\x{1D1AA}
|
||||
|
||||
/\N{U+}/
|
||||
|
||||
/\N{U}/
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -13194,7 +13194,7 @@ Failed: error 167 at offset 5: non-hex character in \x{} (closing brace missing?
|
|||
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
|
||||
|
||||
/^A\x{/
|
||||
Failed: error 178 at offset 5: digits missing in \x{} or \o{}
|
||||
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||
|
||||
/[ab]++/B,no_auto_possess
|
||||
------------------------------------------------------------------
|
||||
|
@ -13408,7 +13408,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
|
|||
Failed: error 155 at offset 2: missing opening brace after \o
|
||||
|
||||
/\o{}/
|
||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||
|
||||
/\o{whatever}/
|
||||
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
|
||||
|
@ -13416,7 +13416,7 @@ Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missin
|
|||
/\xthing/
|
||||
|
||||
/\x{}/
|
||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||
|
||||
/\x{whatever}/
|
||||
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
|
||||
|
|
|
@ -3703,5 +3703,13 @@ No match
|
|||
\= Expect no match
|
||||
\x{123}\x{124}\x{123}
|
||||
No match
|
||||
|
||||
/\N{U+1234}/utf
|
||||
\x{1234}
|
||||
0: \x{1234}
|
||||
|
||||
/[\N{U+1234}]/utf
|
||||
\x{1234}
|
||||
0: \x{1234}
|
||||
|
||||
# End of testinput4
|
||||
|
|
|
@ -4750,4 +4750,10 @@ No match
|
|||
\x{1D1AA}
|
||||
0: \x{1d1aa}
|
||||
|
||||
/\N{U+}/
|
||||
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||
|
||||
/\N{U}/
|
||||
Failed: error 137 at offset 2: PCRE does not support \L, \l, \N{name}, \U, or \u
|
||||
|
||||
# End of testinput5
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
PCRE2 version 10.32-RC1 2018-02-19
|
||||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||
|
@ -200,6 +201,6 @@ No match
|
|||
0: \xff
|
||||
|
||||
/\ƒ&/
|
||||
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||
Failed: error 168 at offset 3: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||
|
||||
# End
|
||||
|
|
Loading…
Reference in New Issue