Add support for \N{U+dd...}, for ASCII and Unicode modes only.
This commit is contained in:
parent
775481293a
commit
e9aa3c0a21
|
@ -130,6 +130,8 @@ present.
|
||||||
28. A (*MARK) name was not being passed back for positive assertions that were
|
28. A (*MARK) name was not being passed back for positive assertions that were
|
||||||
terminated by (*ACCEPT).
|
terminated by (*ACCEPT).
|
||||||
|
|
||||||
|
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
|
||||||
|
|
||||||
|
|
||||||
Version 10.31 12-February-2018
|
Version 10.31 12-February-2018
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
|
@ -249,10 +249,11 @@ is used.
|
||||||
<P>
|
<P>
|
||||||
The newline convention affects where the circumflex and dollar assertions are
|
The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
|
||||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
opening brace. However, it does not affect what the \R escape sequence
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
matches. By default, this is any Unicode newline sequence, for Perl
|
||||||
section and the description of \R in the section entitled
|
compatibility. However, this can be changed; see the next section and the
|
||||||
|
description of \R in the section entitled
|
||||||
<a href="#newlineseq">"Newline sequences"</a>
|
<a href="#newlineseq">"Newline sequences"</a>
|
||||||
below. A change of \R setting can be combined with a change of newline
|
below. A change of \R setting can be combined with a change of newline
|
||||||
convention.
|
convention.
|
||||||
|
@ -394,8 +395,15 @@ these escapes are as follows:
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh.. (default mode)
|
\x{hhh..} character with hex code hhh.. (default mode)
|
||||||
|
\N{U+hhh..} character with Unicode code point hhh..
|
||||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
</pre>
|
</pre>
|
||||||
|
Note that when \N is not followed by an opening brace (curly bracket) it has
|
||||||
|
an entirely different meaning, matching any character that is not a newline.
|
||||||
|
Perl also uses \N{name} to specify characters by Unicode name; PCRE2 does not
|
||||||
|
support this.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
||||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||||
|
@ -404,14 +412,14 @@ code unit following \c has a value less than 32 or greater than 126, a
|
||||||
compile-time error occurs.
|
compile-time error occurs.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
|
||||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
|
||||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
|
||||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
|
||||||
other character provokes a compile-time error. The sequence \c@ encodes
|
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||||
character code 0; after \c the letters (in either case) encode characters 1-26
|
\c@ encodes character code 0; after \c the letters (in either case) encode
|
||||||
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
|
||||||
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Thus, apart from \c?, these escapes generate the same character code values as
|
Thus, apart from \c?, these escapes generate the same character code values as
|
||||||
|
@ -443,9 +451,9 @@ to be unambiguously specified.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||||
digit greater than zero. Instead, use \o{} or \x{} to specify character
|
digit greater than zero. Instead, use \o{} or \x{} to specify numerical
|
||||||
numbers, and \g{} to specify backreferences. The following paragraphs
|
character code points, and \g{} to specify backreferences. The following
|
||||||
describe the old, ambiguous syntax.
|
paragraphs describe the old, ambiguous syntax.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||||
|
@ -528,10 +536,10 @@ and outside character classes. In addition, inside a character class, \b is
|
||||||
interpreted as the backspace character (hex 08).
|
interpreted as the backspace character (hex 08).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
When not followed by an opening brace, \N is not allowed in a character class.
|
||||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
\B, \R, and \X are not special inside a character class. Like other
|
||||||
they cause an error. Outside a character class, these sequences have different
|
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||||
meanings.
|
character class, these sequences have different meanings.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Unsupported escape sequences
|
Unsupported escape sequences
|
||||||
|
@ -577,6 +585,7 @@ Another use of backslash is for specifying generic character types:
|
||||||
\D any character that is not a decimal digit
|
\D any character that is not a decimal digit
|
||||||
\h any horizontal white space character
|
\h any horizontal white space character
|
||||||
\H any character that is not a horizontal white space character
|
\H any character that is not a horizontal white space character
|
||||||
|
\N any character that is not a newline
|
||||||
\s any white space character
|
\s any white space character
|
||||||
\S any character that is not a white space character
|
\S any character that is not a white space character
|
||||||
\v any vertical white space character
|
\v any vertical white space character
|
||||||
|
@ -584,11 +593,14 @@ Another use of backslash is for specifying generic character types:
|
||||||
\w any "word" character
|
\w any "word" character
|
||||||
\W any "non-word" character
|
\W any "non-word" character
|
||||||
</pre>
|
</pre>
|
||||||
There is also the single sequence \N, which matches a non-newline character.
|
The \N escape sequence has the same meaning as
|
||||||
This is the same as
|
|
||||||
<a href="#fullstopdot">the "." metacharacter</a>
|
<a href="#fullstopdot">the "." metacharacter</a>
|
||||||
when PCRE2_DOTALL is not set. Perl also uses \N to match characters by name;
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||||
PCRE2 does not support this.
|
meaning of \N. Note that when \N is followed by an opening brace it has a
|
||||||
|
different meaning. See the section entitled
|
||||||
|
<a href="#digitsafterbackslash">"Non-printing characters"</a>
|
||||||
|
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||||
|
name; PCRE2 does not support this.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Each pair of lower and upper case escape sequences partitions the complete set
|
Each pair of lower and upper case escape sequences partitions the complete set
|
||||||
|
@ -1297,9 +1309,15 @@ dollar, the only relationship being that they both involve newlines. Dot has no
|
||||||
special meaning in a character class.
|
special meaning in a character class.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The escape sequence \N behaves like a dot, except that it is not affected by
|
The escape sequence \N when not followed by an opening brace behaves like a
|
||||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||||
that signifies the end of a line. Perl also uses \N to match characters by
|
it matches any character except one that signifies the end of a line.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When \N is followed by an opening brace it has a different meaning. See the
|
||||||
|
section entitled
|
||||||
|
<a href="digitsafterbackslash">"Non-printing characters"</a>
|
||||||
|
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||||
name; PCRE2 does not support this.
|
name; PCRE2 does not support this.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
||||||
|
@ -1385,10 +1403,11 @@ string, and therefore it fails if the current pointer is at the end of the
|
||||||
string.
|
string.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When caseless matching is set, any letters in a class represent both their
|
Characters in a class may be specified by their code points using \o, \x, or
|
||||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
class represent both their upper case and lower case versions, so for example,
|
||||||
caseful version would.
|
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||||
|
match "A", whereas a caseful version would.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Characters that might indicate line breaks are never treated in any special way
|
Characters that might indicate line breaks are never treated in any special way
|
||||||
|
@ -1397,17 +1416,18 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||||
class such as [^a] always matches one of these characters.
|
class such as [^a] always matches one of these characters.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
|
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||||
\V, \w, and \W may appear in a character class, and add the characters that
|
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||||
they match to the class. For example, [\dABCDEF] matches any hexadecimal
|
characters that they match to the class. For example, [\dABCDEF] matches any
|
||||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
|
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||||
and their upper case partners, just as it does when they appear outside a
|
\d, \s, \w and their upper case partners, just as it does when they appear
|
||||||
character class, as described in the section entitled
|
outside a character class, as described in the section entitled
|
||||||
<a href="#genericchartypes">"Generic character types"</a>
|
<a href="#genericchartypes">"Generic character types"</a>
|
||||||
above. The escape sequence \b has a different meaning inside a character
|
above. The escape sequence \b has a different meaning inside a character
|
||||||
class; it matches the backspace character. The sequences \B, \N, \R, and \X
|
class; it matches the backspace character. The sequences \B, \R, and \X are
|
||||||
are not special inside a character class. Like any other unrecognized escape
|
not special inside a character class. Like any other unrecognized escape
|
||||||
sequences, they cause an error.
|
sequences, they cause an error. The same is true for \N when not followed by
|
||||||
|
an opening brace.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The minus (hyphen) character can be used to specify a range of characters in a
|
The minus (hyphen) character can be used to specify a range of characters in a
|
||||||
|
@ -3559,7 +3579,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -70,9 +70,10 @@ This table applies to ASCII and Unicode environments.
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\N{U+hh..} character with Unicode code point hh..
|
||||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh..
|
\x{hh..} character with hex code hh..
|
||||||
</pre>
|
</pre>
|
||||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||||
a non-zero digit is complicated; for details see the section
|
a non-zero digit is complicated; for details see the section
|
||||||
|
@ -80,7 +81,9 @@ a non-zero digit is complicated; for details see the section
|
||||||
in the
|
in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation, where details of escape processing in EBCDIC environments are
|
documentation, where details of escape processing in EBCDIC environments are
|
||||||
also given.
|
also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
|
||||||
|
supported in EBCDIC environments. Note that \N not followed by an opening
|
||||||
|
curly bracket has a different meaning (see below).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
|
@ -621,7 +624,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
118
doc/pcre2.txt
118
doc/pcre2.txt
|
@ -6015,12 +6015,13 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
|
|
||||||
The newline convention affects where the circumflex and dollar asser-
|
The newline convention affects where the circumflex and dollar asser-
|
||||||
tions are true. It also affects the interpretation of the dot metachar-
|
tions are true. It also affects the interpretation of the dot metachar-
|
||||||
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
|
acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
|
||||||
it does not affect what the \R escape sequence matches. By default,
|
followed by an opening brace. However, it does not affect what the \R
|
||||||
this is any Unicode newline sequence, for Perl compatibility. However,
|
escape sequence matches. By default, this is any Unicode newline
|
||||||
this can be changed; see the next section and the description of \R in
|
sequence, for Perl compatibility. However, this can be changed; see the
|
||||||
the section entitled "Newline sequences" below. A change of \R setting
|
next section and the description of \R in the section entitled "Newline
|
||||||
can be combined with a change of newline convention.
|
sequences" below. A change of \R setting can be combined with a change
|
||||||
|
of newline convention.
|
||||||
|
|
||||||
Specifying what \R matches
|
Specifying what \R matches
|
||||||
|
|
||||||
|
@ -6158,8 +6159,14 @@ BACKSLASH
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh.. (default mode)
|
\x{hhh..} character with hex code hhh.. (default mode)
|
||||||
|
\N{U+hhh..} character with Unicode code point hhh..
|
||||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
|
|
||||||
|
Note that when \N is not followed by an opening brace (curly bracket)
|
||||||
|
it has an entirely different meaning, matching any character that is
|
||||||
|
not a newline. Perl also uses \N{name} to specify characters by Uni-
|
||||||
|
code name; PCRE2 does not support this.
|
||||||
|
|
||||||
The precise effect of \cx on ASCII characters is as follows: if x is a
|
The precise effect of \cx on ASCII characters is as follows: if x is a
|
||||||
lower case letter, it is converted to upper case. Then bit 6 of the
|
lower case letter, it is converted to upper case. Then bit 6 of the
|
||||||
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
|
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
|
||||||
|
@ -6167,15 +6174,15 @@ BACKSLASH
|
||||||
hex 7B (; is 3B). If the code unit following \c has a value less than
|
hex 7B (; is 3B). If the code unit following \c has a value less than
|
||||||
32 or greater than 126, a compile-time error occurs.
|
32 or greater than 126, a compile-time error occurs.
|
||||||
|
|
||||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
|
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
|
||||||
erate the appropriate EBCDIC code values. The \c escape is processed as
|
\a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
|
||||||
specified for Perl in the perlebcdic document. The only characters that
|
The \c escape is processed as specified for Perl in the perlebcdic doc-
|
||||||
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
|
ument. The only characters that are allowed after \c are A-Z, a-z, or
|
||||||
Any other character provokes a compile-time error. The sequence \c@
|
one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
|
||||||
encodes character code 0; after \c the letters (in either case) encode
|
time error. The sequence \c@ encodes character code 0; after \c the
|
||||||
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
|
letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
|
||||||
27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
|
\, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
|
||||||
(hex 5F).
|
becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
|
|
||||||
Thus, apart from \c?, these escapes generate the same character code
|
Thus, apart from \c?, these escapes generate the same character code
|
||||||
values as they do in an ASCII environment, though the meanings of the
|
values as they do in an ASCII environment, though the meanings of the
|
||||||
|
@ -6203,9 +6210,9 @@ BACKSLASH
|
||||||
numbers and backreferences to be unambiguously specified.
|
numbers and backreferences to be unambiguously specified.
|
||||||
|
|
||||||
For greater clarity and unambiguity, it is best to avoid following \ by
|
For greater clarity and unambiguity, it is best to avoid following \ by
|
||||||
a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
|
a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
|
||||||
ter numbers, and \g{} to specify backreferences. The following para-
|
cal character code points, and \g{} to specify backreferences. The fol-
|
||||||
graphs describe the old, ambiguous syntax.
|
lowing paragraphs describe the old, ambiguous syntax.
|
||||||
|
|
||||||
The handling of a backslash followed by a digit other than 0 is compli-
|
The handling of a backslash followed by a digit other than 0 is compli-
|
||||||
cated, and Perl has changed over time, causing PCRE2 also to change.
|
cated, and Perl has changed over time, causing PCRE2 also to change.
|
||||||
|
@ -6281,10 +6288,10 @@ BACKSLASH
|
||||||
inside and outside character classes. In addition, inside a character
|
inside and outside character classes. In addition, inside a character
|
||||||
class, \b is interpreted as the backspace character (hex 08).
|
class, \b is interpreted as the backspace character (hex 08).
|
||||||
|
|
||||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
When not followed by an opening brace, \N is not allowed in a character
|
||||||
inside a character class. Like other unrecognized alphabetic escape
|
class. \B, \R, and \X are not special inside a character class. Like
|
||||||
sequences, they cause an error. Outside a character class, these
|
other unrecognized alphabetic escape sequences, they cause an error.
|
||||||
sequences have different meanings.
|
Outside a character class, these sequences have different meanings.
|
||||||
|
|
||||||
Unsupported escape sequences
|
Unsupported escape sequences
|
||||||
|
|
||||||
|
@ -6318,6 +6325,7 @@ BACKSLASH
|
||||||
\D any character that is not a decimal digit
|
\D any character that is not a decimal digit
|
||||||
\h any horizontal white space character
|
\h any horizontal white space character
|
||||||
\H any character that is not a horizontal white space character
|
\H any character that is not a horizontal white space character
|
||||||
|
\N any character that is not a newline
|
||||||
\s any white space character
|
\s any white space character
|
||||||
\S any character that is not a white space character
|
\S any character that is not a white space character
|
||||||
\v any vertical white space character
|
\v any vertical white space character
|
||||||
|
@ -6325,10 +6333,12 @@ BACKSLASH
|
||||||
\w any "word" character
|
\w any "word" character
|
||||||
\W any "non-word" character
|
\W any "non-word" character
|
||||||
|
|
||||||
There is also the single sequence \N, which matches a non-newline char-
|
The \N escape sequence has the same meaning as the "." metacharacter
|
||||||
acter. This is the same as the "." metacharacter when PCRE2_DOTALL is
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
|
||||||
not set. Perl also uses \N to match characters by name; PCRE2 does not
|
the meaning of \N. Note that when \N is followed by an opening brace it
|
||||||
support this.
|
has a different meaning. See the section entitled "Non-printing charac-
|
||||||
|
ters" above for details. Perl also uses \N{name} to specify characters
|
||||||
|
by Unicode name; PCRE2 does not support this.
|
||||||
|
|
||||||
Each pair of lower and upper case escape sequences partitions the com-
|
Each pair of lower and upper case escape sequences partitions the com-
|
||||||
plete set of characters into two disjoint sets. Any given character
|
plete set of characters into two disjoint sets. Any given character
|
||||||
|
@ -6867,10 +6877,15 @@ FULL STOP (PERIOD, DOT) AND \N
|
||||||
flex and dollar, the only relationship being that they both involve
|
flex and dollar, the only relationship being that they both involve
|
||||||
newlines. Dot has no special meaning in a character class.
|
newlines. Dot has no special meaning in a character class.
|
||||||
|
|
||||||
The escape sequence \N behaves like a dot, except that it is not
|
The escape sequence \N when not followed by an opening brace behaves
|
||||||
affected by the PCRE2_DOTALL option. In other words, it matches any
|
like a dot, except that it is not affected by the PCRE2_DOTALL option.
|
||||||
character except one that signifies the end of a line. Perl also uses
|
In other words, it matches any character except one that signifies the
|
||||||
\N to match characters by name; PCRE2 does not support this.
|
end of a line.
|
||||||
|
|
||||||
|
When \N is followed by an opening brace it has a different meaning. See
|
||||||
|
the section entitled "Non-printing characters" above for details. Perl
|
||||||
|
also uses \N{name} to specify characters by Unicode name; PCRE2 does
|
||||||
|
not support this.
|
||||||
|
|
||||||
|
|
||||||
MATCHING A SINGLE CODE UNIT
|
MATCHING A SINGLE CODE UNIT
|
||||||
|
@ -6951,10 +6966,12 @@ SQUARE BRACKETS AND CHARACTER CLASSES
|
||||||
sumes a character from the subject string, and therefore it fails if
|
sumes a character from the subject string, and therefore it fails if
|
||||||
the current pointer is at the end of the string.
|
the current pointer is at the end of the string.
|
||||||
|
|
||||||
When caseless matching is set, any letters in a class represent both
|
Characters in a class may be specified by their code points using \o,
|
||||||
their upper case and lower case versions, so for example, a caseless
|
\x, or \N{U+hh..} in the usual way. When caseless matching is set, any
|
||||||
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
letters in a class represent both their upper case and lower case ver-
|
||||||
match "A", whereas a caseful version would.
|
sions, so for example, a caseless [aeiou] matches "A" as well as "a",
|
||||||
|
and a caseless [^aeiou] does not match "A", whereas a caseful version
|
||||||
|
would.
|
||||||
|
|
||||||
Characters that might indicate line breaks are never treated in any
|
Characters that might indicate line breaks are never treated in any
|
||||||
special way when matching character classes, whatever line-ending
|
special way when matching character classes, whatever line-ending
|
||||||
|
@ -6962,17 +6979,18 @@ SQUARE BRACKETS AND CHARACTER CLASSES
|
||||||
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
||||||
one of these characters.
|
one of these characters.
|
||||||
|
|
||||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
|
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||||
\w, and \W may appear in a character class, and add the characters that
|
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||||
they match to the class. For example, [\dABCDEF] matches any hexadeci-
|
characters that they match to the class. For example, [\dABCDEF]
|
||||||
mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
|
||||||
\d, \s, \w and their upper case partners, just as it does when they
|
affects the meanings of \d, \s, \w and their upper case partners, just
|
||||||
appear outside a character class, as described in the section entitled
|
as it does when they appear outside a character class, as described in
|
||||||
"Generic character types" above. The escape sequence \b has a different
|
the section entitled "Generic character types" above. The escape
|
||||||
meaning inside a character class; it matches the backspace character.
|
sequence \b has a different meaning inside a character class; it
|
||||||
The sequences \B, \N, \R, and \X are not special inside a character
|
matches the backspace character. The sequences \B, \R, and \X are not
|
||||||
class. Like any other unrecognized escape sequences, they cause an
|
special inside a character class. Like any other unrecognized escape
|
||||||
error.
|
sequences, they cause an error. The same is true for \N when not fol-
|
||||||
|
lowed by an opening brace.
|
||||||
|
|
||||||
The minus (hyphen) character can be used to specify a range of charac-
|
The minus (hyphen) character can be used to specify a range of charac-
|
||||||
ters in a character class. For example, [d-m] matches any letter
|
ters in a character class. For example, [d-m] matches any letter
|
||||||
|
@ -9012,7 +9030,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -9873,14 +9891,18 @@ ESCAPED CHARACTERS
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\N{U+hh..} character with Unicode code point hh..
|
||||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh..
|
\x{hh..} character with hex code hh..
|
||||||
|
|
||||||
Note that \0dd is always an octal code. The treatment of backslash fol-
|
Note that \0dd is always an octal code. The treatment of backslash fol-
|
||||||
lowed by a non-zero digit is complicated; for details see the section
|
lowed by a non-zero digit is complicated; for details see the section
|
||||||
"Non-printing characters" in the pcre2pattern documentation, where
|
"Non-printing characters" in the pcre2pattern documentation, where
|
||||||
details of escape processing in EBCDIC environments are also given.
|
details of escape processing in EBCDIC environments are also given.
|
||||||
|
\N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
|
||||||
|
EBCDIC environments. Note that \N not followed by an opening curly
|
||||||
|
bracket has a different meaning (see below).
|
||||||
|
|
||||||
When \x is not followed by {, from zero to two hexadecimal digits are
|
When \x is not followed by {, from zero to two hexadecimal digits are
|
||||||
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
|
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
|
||||||
|
@ -10289,7 +10311,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
|
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -1400,7 +1400,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
|
||||||
not match when the current position in the subject is at a newline. This option
|
not match when the current position in the subject is at a newline. This option
|
||||||
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
||||||
(?s) option setting. A negative class such as [^a] always matches newline
|
(?s) option setting. A negative class such as [^a] always matches newline
|
||||||
characters, independent of the setting of this option.
|
characters, and the \eN escape sequence always matches a non-newline character,
|
||||||
|
independent of the setting of PCRE2_DOTALL.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_DUPNAMES
|
PCRE2_DUPNAMES
|
||||||
.sp
|
.sp
|
||||||
|
@ -3640,6 +3641,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "20 July 2018" "PCRE2 10.32"
|
.TH PCRE2PATTERN 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -218,10 +218,11 @@ is used.
|
||||||
.P
|
.P
|
||||||
The newline convention affects where the circumflex and dollar assertions are
|
The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
|
||||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
opening brace. However, it does not affect what the \eR escape sequence
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
matches. By default, this is any Unicode newline sequence, for Perl
|
||||||
section and the description of \eR in the section entitled
|
compatibility. However, this can be changed; see the next section and the
|
||||||
|
description of \eR in the section entitled
|
||||||
.\" HTML <a href="#newlineseq">
|
.\" HTML <a href="#newlineseq">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Newline sequences"
|
"Newline sequences"
|
||||||
|
@ -371,8 +372,14 @@ these escapes are as follows:
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||||
|
\eN{U+hhh..} character with Unicode code point hhh..
|
||||||
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
.sp
|
.sp
|
||||||
|
Note that when \eN is not followed by an opening brace (curly bracket) it has
|
||||||
|
an entirely different meaning, matching any character that is not a newline.
|
||||||
|
Perl also uses \eN{name} to specify characters by Unicode name; PCRE2 does not
|
||||||
|
support this.
|
||||||
|
.P
|
||||||
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
||||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||||
|
@ -380,14 +387,14 @@ but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
||||||
code unit following \ec has a value less than 32 or greater than 126, a
|
code unit following \ec has a value less than 32 or greater than 126, a
|
||||||
compile-time error occurs.
|
compile-time error occurs.
|
||||||
.P
|
.P
|
||||||
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
|
||||||
generate the appropriate EBCDIC code values. The \ec escape is processed
|
\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
|
||||||
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
|
||||||
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
|
||||||
other character provokes a compile-time error. The sequence \ec@ encodes
|
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||||
character code 0; after \ec the letters (in either case) encode characters 1-26
|
\ec@ encodes character code 0; after \ec the letters (in either case) encode
|
||||||
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
|
||||||
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
.P
|
.P
|
||||||
Thus, apart from \ec?, these escapes generate the same character code values as
|
Thus, apart from \ec?, these escapes generate the same character code values as
|
||||||
they do in an ASCII environment, though the meanings of the values mostly
|
they do in an ASCII environment, though the meanings of the values mostly
|
||||||
|
@ -414,9 +421,9 @@ numbers greater than 0777, and it also allows octal numbers and backreferences
|
||||||
to be unambiguously specified.
|
to be unambiguously specified.
|
||||||
.P
|
.P
|
||||||
For greater clarity and unambiguity, it is best to avoid following \e by a
|
For greater clarity and unambiguity, it is best to avoid following \e by a
|
||||||
digit greater than zero. Instead, use \eo{} or \ex{} to specify character
|
digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
|
||||||
numbers, and \eg{} to specify backreferences. The following paragraphs
|
character code points, and \eg{} to specify backreferences. The following
|
||||||
describe the old, ambiguous syntax.
|
paragraphs describe the old, ambiguous syntax.
|
||||||
.P
|
.P
|
||||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||||
and Perl has changed over time, causing PCRE2 also to change.
|
and Perl has changed over time, causing PCRE2 also to change.
|
||||||
|
@ -507,10 +514,10 @@ All the sequences that define a single character value can be used both inside
|
||||||
and outside character classes. In addition, inside a character class, \eb is
|
and outside character classes. In addition, inside a character class, \eb is
|
||||||
interpreted as the backspace character (hex 08).
|
interpreted as the backspace character (hex 08).
|
||||||
.P
|
.P
|
||||||
\eN is not allowed in a character class. \eB, \eR, and \eX are not special
|
When not followed by an opening brace, \eN is not allowed in a character class.
|
||||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
\eB, \eR, and \eX are not special inside a character class. Like other
|
||||||
they cause an error. Outside a character class, these sequences have different
|
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||||
meanings.
|
character class, these sequences have different meanings.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Unsupported escape sequences"
|
.SS "Unsupported escape sequences"
|
||||||
|
@ -569,6 +576,7 @@ Another use of backslash is for specifying generic character types:
|
||||||
\eD any character that is not a decimal digit
|
\eD any character that is not a decimal digit
|
||||||
\eh any horizontal white space character
|
\eh any horizontal white space character
|
||||||
\eH any character that is not a horizontal white space character
|
\eH any character that is not a horizontal white space character
|
||||||
|
\eN any character that is not a newline
|
||||||
\es any white space character
|
\es any white space character
|
||||||
\eS any character that is not a white space character
|
\eS any character that is not a white space character
|
||||||
\ev any vertical white space character
|
\ev any vertical white space character
|
||||||
|
@ -576,14 +584,20 @@ Another use of backslash is for specifying generic character types:
|
||||||
\ew any "word" character
|
\ew any "word" character
|
||||||
\eW any "non-word" character
|
\eW any "non-word" character
|
||||||
.sp
|
.sp
|
||||||
There is also the single sequence \eN, which matches a non-newline character.
|
The \eN escape sequence has the same meaning as
|
||||||
This is the same as
|
|
||||||
.\" HTML <a href="#fullstopdot">
|
.\" HTML <a href="#fullstopdot">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
the "." metacharacter
|
the "." metacharacter
|
||||||
.\"
|
.\"
|
||||||
when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name;
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||||
PCRE2 does not support this.
|
meaning of \eN. Note that when \eN is followed by an opening brace it has a
|
||||||
|
different meaning. See the section entitled
|
||||||
|
.\" HTML <a href="#digitsafterbackslash">
|
||||||
|
.\" </a>
|
||||||
|
"Non-printing characters"
|
||||||
|
.\"
|
||||||
|
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||||
|
name; PCRE2 does not support this.
|
||||||
.P
|
.P
|
||||||
Each pair of lower and upper case escape sequences partitions the complete set
|
Each pair of lower and upper case escape sequences partitions the complete set
|
||||||
of characters into two disjoint sets. Any given character matches one, and only
|
of characters into two disjoint sets. Any given character matches one, and only
|
||||||
|
@ -1289,9 +1303,17 @@ The handling of dot is entirely independent of the handling of circumflex and
|
||||||
dollar, the only relationship being that they both involve newlines. Dot has no
|
dollar, the only relationship being that they both involve newlines. Dot has no
|
||||||
special meaning in a character class.
|
special meaning in a character class.
|
||||||
.P
|
.P
|
||||||
The escape sequence \eN behaves like a dot, except that it is not affected by
|
The escape sequence \eN when not followed by an opening brace behaves like a
|
||||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||||
that signifies the end of a line. Perl also uses \eN to match characters by
|
it matches any character except one that signifies the end of a line.
|
||||||
|
.P
|
||||||
|
When \eN is followed by an opening brace it has a different meaning. See the
|
||||||
|
section entitled
|
||||||
|
.\" HTML <a href="digitsafterbackslash">
|
||||||
|
.\" </a>
|
||||||
|
"Non-printing characters"
|
||||||
|
.\"
|
||||||
|
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||||
name; PCRE2 does not support this.
|
name; PCRE2 does not support this.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -1380,30 +1402,32 @@ circumflex is not an assertion; it still consumes a character from the subject
|
||||||
string, and therefore it fails if the current pointer is at the end of the
|
string, and therefore it fails if the current pointer is at the end of the
|
||||||
string.
|
string.
|
||||||
.P
|
.P
|
||||||
When caseless matching is set, any letters in a class represent both their
|
Characters in a class may be specified by their code points using \eo, \ex, or
|
||||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
class represent both their upper case and lower case versions, so for example,
|
||||||
caseful version would.
|
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||||
|
match "A", whereas a caseful version would.
|
||||||
.P
|
.P
|
||||||
Characters that might indicate line breaks are never treated in any special way
|
Characters that might indicate line breaks are never treated in any special way
|
||||||
when matching character classes, whatever line-ending sequence is in use, and
|
when matching character classes, whatever line-ending sequence is in use, and
|
||||||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||||
class such as [^a] always matches one of these characters.
|
class such as [^a] always matches one of these characters.
|
||||||
.P
|
.P
|
||||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
|
||||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
|
||||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
characters that they match to the class. For example, [\edABCDEF] matches any
|
||||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
|
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||||
and their upper case partners, just as it does when they appear outside a
|
\ed, \es, \ew and their upper case partners, just as it does when they appear
|
||||||
character class, as described in the section entitled
|
outside a character class, as described in the section entitled
|
||||||
.\" HTML <a href="#genericchartypes">
|
.\" HTML <a href="#genericchartypes">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Generic character types"
|
"Generic character types"
|
||||||
.\"
|
.\"
|
||||||
above. The escape sequence \eb has a different meaning inside a character
|
above. The escape sequence \eb has a different meaning inside a character
|
||||||
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
class; it matches the backspace character. The sequences \eB, \eR, and \eX are
|
||||||
are not special inside a character class. Like any other unrecognized escape
|
not special inside a character class. Like any other unrecognized escape
|
||||||
sequences, they cause an error.
|
sequences, they cause an error. The same is true for \eN when not followed by
|
||||||
|
an opening brace.
|
||||||
.P
|
.P
|
||||||
The minus (hyphen) character can be used to specify a range of characters in a
|
The minus (hyphen) character can be used to specify a range of characters in a
|
||||||
character class. For example, [d-m] matches any letter between d and m,
|
character class. For example, [d-m] matches any letter between d and m,
|
||||||
|
@ -3580,6 +3604,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "21 July 2018" "PCRE2 10.32"
|
.TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -35,9 +35,10 @@ This table applies to ASCII and Unicode environments.
|
||||||
\eddd character with octal code ddd, or backreference
|
\eddd character with octal code ddd, or backreference
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\eN{U+hh..} character with Unicode code point hh..
|
||||||
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh..
|
\ex{hh..} character with hex code hh..
|
||||||
.sp
|
.sp
|
||||||
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
||||||
a non-zero digit is complicated; for details see the section
|
a non-zero digit is complicated; for details see the section
|
||||||
|
@ -50,7 +51,9 @@ in the
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
documentation, where details of escape processing in EBCDIC environments are
|
documentation, where details of escape processing in EBCDIC environments are
|
||||||
also given.
|
also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
|
||||||
|
supported in EBCDIC environments. Note that \eN not followed by an opening
|
||||||
|
curly bracket has a different meaning (see below).
|
||||||
.P
|
.P
|
||||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||||
|
@ -609,6 +612,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -316,6 +316,7 @@ pcre2_pattern_convert(). */
|
||||||
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
|
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
|
||||||
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
|
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
|
||||||
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
|
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
|
||||||
|
#define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC 193
|
||||||
|
|
||||||
|
|
||||||
/* "Expected" matching error codes: no match and partial match. */
|
/* "Expected" matching error codes: no match and partial match. */
|
||||||
|
|
|
@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
|
||||||
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
||||||
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
||||||
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
||||||
ERR91, ERR92};
|
ERR91, ERR92, ERR93 };
|
||||||
|
|
||||||
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
||||||
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
||||||
|
@ -1441,6 +1441,42 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
|
||||||
escape = -i; /* Else return a special escape */
|
escape = -i; /* Else return a special escape */
|
||||||
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
|
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
|
||||||
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
|
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
|
||||||
|
|
||||||
|
/* Perl supports \N{name} for character names and \N{U+dddd} for numerical
|
||||||
|
Unicode code points, as well as plain \N for "not newline". PCRE does not
|
||||||
|
support \N{name}. However, it does support quantification such as \N{2,3},
|
||||||
|
so if \N{ is not followed by U+dddd we check for a quantifier. */
|
||||||
|
|
||||||
|
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||||
|
{
|
||||||
|
PCRE2_SPTR p = ptr + 1;
|
||||||
|
|
||||||
|
/* \N{U+ can be handled by the \x{ code. However, this construction is
|
||||||
|
not valid in EBCDIC environments because it specifies a Unicode
|
||||||
|
character, not a codepoint in the local code. For example \N{U+0041}
|
||||||
|
must be "A" in all environments. */
|
||||||
|
|
||||||
|
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
|
||||||
|
{
|
||||||
|
#ifdef EBCDIC
|
||||||
|
*errorcodeptr = ERR93;
|
||||||
|
#else
|
||||||
|
ptr = p + 1;
|
||||||
|
escape = 0; /* Not a fancy escape after all */
|
||||||
|
goto COME_FROM_NU;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Give an error if what follows is not a quantifier, but don't override
|
||||||
|
an error set by the quantifier reader (e.g. number overflow). */
|
||||||
|
|
||||||
|
else
|
||||||
|
{
|
||||||
|
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
||||||
|
*errorcodeptr == 0)
|
||||||
|
*errorcodeptr = ERR37;
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -1725,6 +1761,9 @@ else
|
||||||
{
|
{
|
||||||
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||||
{
|
{
|
||||||
|
#ifndef EBCDIC
|
||||||
|
COME_FROM_NU:
|
||||||
|
#endif
|
||||||
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
|
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
|
||||||
{
|
{
|
||||||
*errorcodeptr = ERR78;
|
*errorcodeptr = ERR78;
|
||||||
|
@ -1858,19 +1897,6 @@ else
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Perl supports \N{name} for character names, as well as plain \N for "not
|
|
||||||
newline". PCRE does not support \N{name}. However, it does support
|
|
||||||
quantification such as \N{2,3}. */
|
|
||||||
|
|
||||||
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET &&
|
|
||||||
ptrend - ptr > 2)
|
|
||||||
{
|
|
||||||
PCRE2_SPTR p = ptr + 1;
|
|
||||||
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
|
||||||
*errorcodeptr == 0)
|
|
||||||
*errorcodeptr = ERR37;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Set the pointer to the next character before returning. */
|
/* Set the pointer to the next character before returning. */
|
||||||
|
|
||||||
*ptrptr = ptr;
|
*ptrptr = ptr;
|
||||||
|
@ -3223,7 +3249,6 @@ while (ptr < ptrend)
|
||||||
tempptr = ptr;
|
tempptr = ptr;
|
||||||
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
|
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
|
||||||
options, TRUE, cb);
|
options, TRUE, cb);
|
||||||
|
|
||||||
if (errorcode != 0)
|
if (errorcode != 0)
|
||||||
{
|
{
|
||||||
CLASS_ESCAPE_FAILED:
|
CLASS_ESCAPE_FAILED:
|
||||||
|
|
|
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
|
||||||
"using UCP is disabled by the application\0"
|
"using UCP is disabled by the application\0"
|
||||||
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
||||||
"character code point value in \\u.... sequence is too large\0"
|
"character code point value in \\u.... sequence is too large\0"
|
||||||
"digits missing in \\x{} or \\o{}\0"
|
"digits missing in \\x{} or \\o{} or \\N{U+}\0"
|
||||||
"syntax error or number too big in (?(VERSION condition\0"
|
"syntax error or number too big in (?(VERSION condition\0"
|
||||||
/* 80 */
|
/* 80 */
|
||||||
"internal error: unknown opcode in auto_possessify()\0"
|
"internal error: unknown opcode in auto_possessify()\0"
|
||||||
|
@ -179,6 +179,7 @@ static const unsigned char compile_error_texts[] =
|
||||||
"internal error: bad code value in parsed_skip()\0"
|
"internal error: bad code value in parsed_skip()\0"
|
||||||
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
|
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
|
||||||
"invalid option bits with PCRE2_LITERAL\0"
|
"invalid option bits with PCRE2_LITERAL\0"
|
||||||
|
"\\N{U+dddd} is not supported in EBCDIC mode\0"
|
||||||
;
|
;
|
||||||
|
|
||||||
/* Match-time and UTF error texts are in the same format. */
|
/* Match-time and UTF error texts are in the same format. */
|
||||||
|
|
|
@ -2288,4 +2288,10 @@
|
||||||
\= Expect no match
|
\= Expect no match
|
||||||
\x{123}\x{124}\x{123}
|
\x{123}\x{124}\x{123}
|
||||||
|
|
||||||
|
/\N{U+1234}/utf
|
||||||
|
\x{1234}
|
||||||
|
|
||||||
|
/[\N{U+1234}]/utf
|
||||||
|
\x{1234}
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -2087,4 +2087,8 @@
|
||||||
\x{655}
|
\x{655}
|
||||||
\x{1D1AA}
|
\x{1D1AA}
|
||||||
|
|
||||||
|
/\N{U+}/
|
||||||
|
|
||||||
|
/\N{U}/
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -13194,7 +13194,7 @@ Failed: error 167 at offset 5: non-hex character in \x{} (closing brace missing?
|
||||||
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
|
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
|
||||||
|
|
||||||
/^A\x{/
|
/^A\x{/
|
||||||
Failed: error 178 at offset 5: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/[ab]++/B,no_auto_possess
|
/[ab]++/B,no_auto_possess
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
|
@ -13408,7 +13408,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
|
||||||
Failed: error 155 at offset 2: missing opening brace after \o
|
Failed: error 155 at offset 2: missing opening brace after \o
|
||||||
|
|
||||||
/\o{}/
|
/\o{}/
|
||||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/\o{whatever}/
|
/\o{whatever}/
|
||||||
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
|
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
|
||||||
|
@ -13416,7 +13416,7 @@ Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missin
|
||||||
/\xthing/
|
/\xthing/
|
||||||
|
|
||||||
/\x{}/
|
/\x{}/
|
||||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/\x{whatever}/
|
/\x{whatever}/
|
||||||
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
|
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
|
||||||
|
|
|
@ -3704,4 +3704,12 @@ No match
|
||||||
\x{123}\x{124}\x{123}
|
\x{123}\x{124}\x{123}
|
||||||
No match
|
No match
|
||||||
|
|
||||||
|
/\N{U+1234}/utf
|
||||||
|
\x{1234}
|
||||||
|
0: \x{1234}
|
||||||
|
|
||||||
|
/[\N{U+1234}]/utf
|
||||||
|
\x{1234}
|
||||||
|
0: \x{1234}
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -4750,4 +4750,10 @@ No match
|
||||||
\x{1D1AA}
|
\x{1D1AA}
|
||||||
0: \x{1d1aa}
|
0: \x{1d1aa}
|
||||||
|
|
||||||
|
/\N{U+}/
|
||||||
|
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
|
/\N{U}/
|
||||||
|
Failed: error 137 at offset 2: PCRE does not support \L, \l, \N{name}, \U, or \u
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
PCRE2 version 10.32-RC1 2018-02-19
|
||||||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||||
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||||
|
@ -200,6 +201,6 @@ No match
|
||||||
0: \xff
|
0: \xff
|
||||||
|
|
||||||
/\ƒ&/
|
/\ƒ&/
|
||||||
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
Failed: error 168 at offset 3: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||||
|
|
||||||
# End
|
# End
|
||||||
|
|
Loading…
Reference in New Issue