Add support for \N{U+dd...}, for ASCII and Unicode modes only.
This commit is contained in:
parent
775481293a
commit
e9aa3c0a21
|
@ -129,6 +129,8 @@ present.
|
||||||
|
|
||||||
28. A (*MARK) name was not being passed back for positive assertions that were
|
28. A (*MARK) name was not being passed back for positive assertions that were
|
||||||
terminated by (*ACCEPT).
|
terminated by (*ACCEPT).
|
||||||
|
|
||||||
|
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
|
||||||
|
|
||||||
|
|
||||||
Version 10.31 12-February-2018
|
Version 10.31 12-February-2018
|
||||||
|
|
|
@ -249,10 +249,11 @@ is used.
|
||||||
<P>
|
<P>
|
||||||
The newline convention affects where the circumflex and dollar assertions are
|
The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
|
||||||
what the \R escape sequence matches. By default, this is any Unicode newline
|
opening brace. However, it does not affect what the \R escape sequence
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
matches. By default, this is any Unicode newline sequence, for Perl
|
||||||
section and the description of \R in the section entitled
|
compatibility. However, this can be changed; see the next section and the
|
||||||
|
description of \R in the section entitled
|
||||||
<a href="#newlineseq">"Newline sequences"</a>
|
<a href="#newlineseq">"Newline sequences"</a>
|
||||||
below. A change of \R setting can be combined with a change of newline
|
below. A change of \R setting can be combined with a change of newline
|
||||||
convention.
|
convention.
|
||||||
|
@ -382,20 +383,27 @@ text editing, it is often easier to use one of the following escape sequences
|
||||||
than the binary character it represents. In an ASCII or Unicode environment,
|
than the binary character it represents. In an ASCII or Unicode environment,
|
||||||
these escapes are as follows:
|
these escapes are as follows:
|
||||||
<pre>
|
<pre>
|
||||||
\a alarm, that is, the BEL character (hex 07)
|
\a alarm, that is, the BEL character (hex 07)
|
||||||
\cx "control-x", where x is any printable ASCII character
|
\cx "control-x", where x is any printable ASCII character
|
||||||
\e escape (hex 1B)
|
\e escape (hex 1B)
|
||||||
\f form feed (hex 0C)
|
\f form feed (hex 0C)
|
||||||
\n linefeed (hex 0A)
|
\n linefeed (hex 0A)
|
||||||
\r carriage return (hex 0D)
|
\r carriage return (hex 0D)
|
||||||
\t tab (hex 09)
|
\t tab (hex 09)
|
||||||
\0dd character with octal code 0dd
|
\0dd character with octal code 0dd
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh.. (default mode)
|
\x{hhh..} character with hex code hhh.. (default mode)
|
||||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\N{U+hhh..} character with Unicode code point hhh..
|
||||||
|
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
</pre>
|
</pre>
|
||||||
|
Note that when \N is not followed by an opening brace (curly bracket) it has
|
||||||
|
an entirely different meaning, matching any character that is not a newline.
|
||||||
|
Perl also uses \N{name} to specify characters by Unicode name; PCRE2 does not
|
||||||
|
support this.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
The precise effect of \cx on ASCII characters is as follows: if x is a lower
|
||||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||||
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||||
|
@ -404,14 +412,14 @@ code unit following \c has a value less than 32 or greater than 126, a
|
||||||
compile-time error occurs.
|
compile-time error occurs.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
|
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
|
||||||
generate the appropriate EBCDIC code values. The \c escape is processed
|
\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
|
||||||
as specified for Perl in the <b>perlebcdic</b> document. The only characters
|
escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
|
||||||
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
|
only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
|
||||||
other character provokes a compile-time error. The sequence \c@ encodes
|
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||||
character code 0; after \c the letters (in either case) encode characters 1-26
|
\c@ encodes character code 0; after \c the letters (in either case) encode
|
||||||
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
|
||||||
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Thus, apart from \c?, these escapes generate the same character code values as
|
Thus, apart from \c?, these escapes generate the same character code values as
|
||||||
|
@ -443,9 +451,9 @@ to be unambiguously specified.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||||
digit greater than zero. Instead, use \o{} or \x{} to specify character
|
digit greater than zero. Instead, use \o{} or \x{} to specify numerical
|
||||||
numbers, and \g{} to specify backreferences. The following paragraphs
|
character code points, and \g{} to specify backreferences. The following
|
||||||
describe the old, ambiguous syntax.
|
paragraphs describe the old, ambiguous syntax.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||||
|
@ -528,10 +536,10 @@ and outside character classes. In addition, inside a character class, \b is
|
||||||
interpreted as the backspace character (hex 08).
|
interpreted as the backspace character (hex 08).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
When not followed by an opening brace, \N is not allowed in a character class.
|
||||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
\B, \R, and \X are not special inside a character class. Like other
|
||||||
they cause an error. Outside a character class, these sequences have different
|
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||||
meanings.
|
character class, these sequences have different meanings.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Unsupported escape sequences
|
Unsupported escape sequences
|
||||||
|
@ -577,6 +585,7 @@ Another use of backslash is for specifying generic character types:
|
||||||
\D any character that is not a decimal digit
|
\D any character that is not a decimal digit
|
||||||
\h any horizontal white space character
|
\h any horizontal white space character
|
||||||
\H any character that is not a horizontal white space character
|
\H any character that is not a horizontal white space character
|
||||||
|
\N any character that is not a newline
|
||||||
\s any white space character
|
\s any white space character
|
||||||
\S any character that is not a white space character
|
\S any character that is not a white space character
|
||||||
\v any vertical white space character
|
\v any vertical white space character
|
||||||
|
@ -584,11 +593,14 @@ Another use of backslash is for specifying generic character types:
|
||||||
\w any "word" character
|
\w any "word" character
|
||||||
\W any "non-word" character
|
\W any "non-word" character
|
||||||
</pre>
|
</pre>
|
||||||
There is also the single sequence \N, which matches a non-newline character.
|
The \N escape sequence has the same meaning as
|
||||||
This is the same as
|
|
||||||
<a href="#fullstopdot">the "." metacharacter</a>
|
<a href="#fullstopdot">the "." metacharacter</a>
|
||||||
when PCRE2_DOTALL is not set. Perl also uses \N to match characters by name;
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||||
PCRE2 does not support this.
|
meaning of \N. Note that when \N is followed by an opening brace it has a
|
||||||
|
different meaning. See the section entitled
|
||||||
|
<a href="#digitsafterbackslash">"Non-printing characters"</a>
|
||||||
|
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||||
|
name; PCRE2 does not support this.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Each pair of lower and upper case escape sequences partitions the complete set
|
Each pair of lower and upper case escape sequences partitions the complete set
|
||||||
|
@ -1297,9 +1309,15 @@ dollar, the only relationship being that they both involve newlines. Dot has no
|
||||||
special meaning in a character class.
|
special meaning in a character class.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The escape sequence \N behaves like a dot, except that it is not affected by
|
The escape sequence \N when not followed by an opening brace behaves like a
|
||||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||||
that signifies the end of a line. Perl also uses \N to match characters by
|
it matches any character except one that signifies the end of a line.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
When \N is followed by an opening brace it has a different meaning. See the
|
||||||
|
section entitled
|
||||||
|
<a href="digitsafterbackslash">"Non-printing characters"</a>
|
||||||
|
above for details. Perl also uses \N{name} to specify characters by Unicode
|
||||||
name; PCRE2 does not support this.
|
name; PCRE2 does not support this.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
|
||||||
|
@ -1385,10 +1403,11 @@ string, and therefore it fails if the current pointer is at the end of the
|
||||||
string.
|
string.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When caseless matching is set, any letters in a class represent both their
|
Characters in a class may be specified by their code points using \o, \x, or
|
||||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
class represent both their upper case and lower case versions, so for example,
|
||||||
caseful version would.
|
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||||
|
match "A", whereas a caseful version would.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Characters that might indicate line breaks are never treated in any special way
|
Characters that might indicate line breaks are never treated in any special way
|
||||||
|
@ -1397,17 +1416,18 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||||
class such as [^a] always matches one of these characters.
|
class such as [^a] always matches one of these characters.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
|
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||||
\V, \w, and \W may appear in a character class, and add the characters that
|
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||||
they match to the class. For example, [\dABCDEF] matches any hexadecimal
|
characters that they match to the class. For example, [\dABCDEF] matches any
|
||||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
|
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||||
and their upper case partners, just as it does when they appear outside a
|
\d, \s, \w and their upper case partners, just as it does when they appear
|
||||||
character class, as described in the section entitled
|
outside a character class, as described in the section entitled
|
||||||
<a href="#genericchartypes">"Generic character types"</a>
|
<a href="#genericchartypes">"Generic character types"</a>
|
||||||
above. The escape sequence \b has a different meaning inside a character
|
above. The escape sequence \b has a different meaning inside a character
|
||||||
class; it matches the backspace character. The sequences \B, \N, \R, and \X
|
class; it matches the backspace character. The sequences \B, \R, and \X are
|
||||||
are not special inside a character class. Like any other unrecognized escape
|
not special inside a character class. Like any other unrecognized escape
|
||||||
sequences, they cause an error.
|
sequences, they cause an error. The same is true for \N when not followed by
|
||||||
|
an opening brace.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
The minus (hyphen) character can be used to specify a range of characters in a
|
The minus (hyphen) character can be used to specify a range of characters in a
|
||||||
|
@ -3559,7 +3579,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -70,9 +70,10 @@ This table applies to ASCII and Unicode environments.
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\N{U+hh..} character with Unicode code point hh..
|
||||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh..
|
\x{hh..} character with hex code hh..
|
||||||
</pre>
|
</pre>
|
||||||
Note that \0dd is always an octal code. The treatment of backslash followed by
|
Note that \0dd is always an octal code. The treatment of backslash followed by
|
||||||
a non-zero digit is complicated; for details see the section
|
a non-zero digit is complicated; for details see the section
|
||||||
|
@ -80,7 +81,9 @@ a non-zero digit is complicated; for details see the section
|
||||||
in the
|
in the
|
||||||
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
|
||||||
documentation, where details of escape processing in EBCDIC environments are
|
documentation, where details of escape processing in EBCDIC environments are
|
||||||
also given.
|
also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
|
||||||
|
supported in EBCDIC environments. Note that \N not followed by an opening
|
||||||
|
curly bracket has a different meaning (see below).
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
When \x is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
|
@ -621,7 +624,7 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
392
doc/pcre2.txt
392
doc/pcre2.txt
|
@ -6015,36 +6015,37 @@ SPECIAL START-OF-PATTERN ITEMS
|
||||||
|
|
||||||
The newline convention affects where the circumflex and dollar asser-
|
The newline convention affects where the circumflex and dollar asser-
|
||||||
tions are true. It also affects the interpretation of the dot metachar-
|
tions are true. It also affects the interpretation of the dot metachar-
|
||||||
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However,
|
acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
|
||||||
it does not affect what the \R escape sequence matches. By default,
|
followed by an opening brace. However, it does not affect what the \R
|
||||||
this is any Unicode newline sequence, for Perl compatibility. However,
|
escape sequence matches. By default, this is any Unicode newline
|
||||||
this can be changed; see the next section and the description of \R in
|
sequence, for Perl compatibility. However, this can be changed; see the
|
||||||
the section entitled "Newline sequences" below. A change of \R setting
|
next section and the description of \R in the section entitled "Newline
|
||||||
can be combined with a change of newline convention.
|
sequences" below. A change of \R setting can be combined with a change
|
||||||
|
of newline convention.
|
||||||
|
|
||||||
Specifying what \R matches
|
Specifying what \R matches
|
||||||
|
|
||||||
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
|
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
|
||||||
the complete set of Unicode line endings) by setting the option
|
the complete set of Unicode line endings) by setting the option
|
||||||
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
|
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
|
||||||
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
|
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
|
||||||
CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
|
CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
|
||||||
|
|
||||||
|
|
||||||
EBCDIC CHARACTER CODES
|
EBCDIC CHARACTER CODES
|
||||||
|
|
||||||
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
PCRE2 can be compiled to run in an environment that uses EBCDIC as its
|
||||||
character code instead of ASCII or Unicode (typically a mainframe sys-
|
character code instead of ASCII or Unicode (typically a mainframe sys-
|
||||||
tem). In the sections below, character code values are ASCII or Uni-
|
tem). In the sections below, character code values are ASCII or Uni-
|
||||||
code; in an EBCDIC environment these characters may have different code
|
code; in an EBCDIC environment these characters may have different code
|
||||||
values, and there are no code points greater than 255.
|
values, and there are no code points greater than 255.
|
||||||
|
|
||||||
|
|
||||||
CHARACTERS AND METACHARACTERS
|
CHARACTERS AND METACHARACTERS
|
||||||
|
|
||||||
A regular expression is a pattern that is matched against a subject
|
A regular expression is a pattern that is matched against a subject
|
||||||
string from left to right. Most characters stand for themselves in a
|
string from left to right. Most characters stand for themselves in a
|
||||||
pattern, and match the corresponding characters in the subject. As a
|
pattern, and match the corresponding characters in the subject. As a
|
||||||
trivial example, the pattern
|
trivial example, the pattern
|
||||||
|
|
||||||
The quick brown fox
|
The quick brown fox
|
||||||
|
@ -6053,14 +6054,14 @@ CHARACTERS AND METACHARACTERS
|
||||||
caseless matching is specified (the PCRE2_CASELESS option), letters are
|
caseless matching is specified (the PCRE2_CASELESS option), letters are
|
||||||
matched independently of case.
|
matched independently of case.
|
||||||
|
|
||||||
The power of regular expressions comes from the ability to include
|
The power of regular expressions comes from the ability to include
|
||||||
alternatives and repetitions in the pattern. These are encoded in the
|
alternatives and repetitions in the pattern. These are encoded in the
|
||||||
pattern by the use of metacharacters, which do not stand for themselves
|
pattern by the use of metacharacters, which do not stand for themselves
|
||||||
but instead are interpreted in some special way.
|
but instead are interpreted in some special way.
|
||||||
|
|
||||||
There are two different sets of metacharacters: those that are recog-
|
There are two different sets of metacharacters: those that are recog-
|
||||||
nized anywhere in the pattern except within square brackets, and those
|
nized anywhere in the pattern except within square brackets, and those
|
||||||
that are recognized within square brackets. Outside square brackets,
|
that are recognized within square brackets. Outside square brackets,
|
||||||
the metacharacters are as follows:
|
the metacharacters are as follows:
|
||||||
|
|
||||||
\ general escape character with several uses
|
\ general escape character with several uses
|
||||||
|
@ -6079,7 +6080,7 @@ CHARACTERS AND METACHARACTERS
|
||||||
also "possessive quantifier"
|
also "possessive quantifier"
|
||||||
{ start min/max quantifier
|
{ start min/max quantifier
|
||||||
|
|
||||||
Part of a pattern that is in square brackets is called a "character
|
Part of a pattern that is in square brackets is called a "character
|
||||||
class". In a character class the only metacharacters are:
|
class". In a character class the only metacharacters are:
|
||||||
|
|
||||||
\ general escape character
|
\ general escape character
|
||||||
|
@ -6096,30 +6097,30 @@ BACKSLASH
|
||||||
|
|
||||||
The backslash character has several uses. Firstly, if it is followed by
|
The backslash character has several uses. Firstly, if it is followed by
|
||||||
a character that is not a number or a letter, it takes away any special
|
a character that is not a number or a letter, it takes away any special
|
||||||
meaning that character may have. This use of backslash as an escape
|
meaning that character may have. This use of backslash as an escape
|
||||||
character applies both inside and outside character classes.
|
character applies both inside and outside character classes.
|
||||||
|
|
||||||
For example, if you want to match a * character, you must write \* in
|
For example, if you want to match a * character, you must write \* in
|
||||||
the pattern. This escaping action applies whether or not the following
|
the pattern. This escaping action applies whether or not the following
|
||||||
character would otherwise be interpreted as a metacharacter, so it is
|
character would otherwise be interpreted as a metacharacter, so it is
|
||||||
always safe to precede a non-alphanumeric with backslash to specify
|
always safe to precede a non-alphanumeric with backslash to specify
|
||||||
that it stands for itself. In particular, if you want to match a back-
|
that it stands for itself. In particular, if you want to match a back-
|
||||||
slash, you write \\.
|
slash, you write \\.
|
||||||
|
|
||||||
In a UTF mode, only ASCII numbers and letters have any special meaning
|
In a UTF mode, only ASCII numbers and letters have any special meaning
|
||||||
after a backslash. All other characters (in particular, those whose
|
after a backslash. All other characters (in particular, those whose
|
||||||
code points are greater than 127) are treated as literals.
|
code points are greater than 127) are treated as literals.
|
||||||
|
|
||||||
If a pattern is compiled with the PCRE2_EXTENDED option, most white
|
If a pattern is compiled with the PCRE2_EXTENDED option, most white
|
||||||
space in the pattern (other than in a character class), and characters
|
space in the pattern (other than in a character class), and characters
|
||||||
between a # outside a character class and the next newline, inclusive,
|
between a # outside a character class and the next newline, inclusive,
|
||||||
are ignored. An escaping backslash can be used to include a white space
|
are ignored. An escaping backslash can be used to include a white space
|
||||||
or # character as part of the pattern.
|
or # character as part of the pattern.
|
||||||
|
|
||||||
If you want to remove the special meaning from a sequence of charac-
|
If you want to remove the special meaning from a sequence of charac-
|
||||||
ters, you can do so by putting them between \Q and \E. This is differ-
|
ters, you can do so by putting them between \Q and \E. This is differ-
|
||||||
ent from Perl in that $ and @ are handled as literals in \Q...\E
|
ent from Perl in that $ and @ are handled as literals in \Q...\E
|
||||||
sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
|
sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
|
||||||
tion. Note the following examples:
|
tion. Note the following examples:
|
||||||
|
|
||||||
Pattern PCRE2 matches Perl matches
|
Pattern PCRE2 matches Perl matches
|
||||||
|
@ -6129,36 +6130,42 @@ BACKSLASH
|
||||||
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
\Qabc\$xyz\E abc\$xyz abc\$xyz
|
||||||
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
|
||||||
|
|
||||||
The \Q...\E sequence is recognized both inside and outside character
|
The \Q...\E sequence is recognized both inside and outside character
|
||||||
classes. An isolated \E that is not preceded by \Q is ignored. If \Q
|
classes. An isolated \E that is not preceded by \Q is ignored. If \Q
|
||||||
is not followed by \E later in the pattern, the literal interpretation
|
is not followed by \E later in the pattern, the literal interpretation
|
||||||
continues to the end of the pattern (that is, \E is assumed at the
|
continues to the end of the pattern (that is, \E is assumed at the
|
||||||
end). If the isolated \Q is inside a character class, this causes an
|
end). If the isolated \Q is inside a character class, this causes an
|
||||||
error, because the character class is not terminated by a closing
|
error, because the character class is not terminated by a closing
|
||||||
square bracket.
|
square bracket.
|
||||||
|
|
||||||
Non-printing characters
|
Non-printing characters
|
||||||
|
|
||||||
A second use of backslash provides a way of encoding non-printing char-
|
A second use of backslash provides a way of encoding non-printing char-
|
||||||
acters in patterns in a visible manner. There is no restriction on the
|
acters in patterns in a visible manner. There is no restriction on the
|
||||||
appearance of non-printing characters in a pattern, but when a pattern
|
appearance of non-printing characters in a pattern, but when a pattern
|
||||||
is being prepared by text editing, it is often easier to use one of the
|
is being prepared by text editing, it is often easier to use one of the
|
||||||
following escape sequences than the binary character it represents. In
|
following escape sequences than the binary character it represents. In
|
||||||
an ASCII or Unicode environment, these escapes are as follows:
|
an ASCII or Unicode environment, these escapes are as follows:
|
||||||
|
|
||||||
\a alarm, that is, the BEL character (hex 07)
|
\a alarm, that is, the BEL character (hex 07)
|
||||||
\cx "control-x", where x is any printable ASCII character
|
\cx "control-x", where x is any printable ASCII character
|
||||||
\e escape (hex 1B)
|
\e escape (hex 1B)
|
||||||
\f form feed (hex 0C)
|
\f form feed (hex 0C)
|
||||||
\n linefeed (hex 0A)
|
\n linefeed (hex 0A)
|
||||||
\r carriage return (hex 0D)
|
\r carriage return (hex 0D)
|
||||||
\t tab (hex 09)
|
\t tab (hex 09)
|
||||||
\0dd character with octal code 0dd
|
\0dd character with octal code 0dd
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh.. (default mode)
|
\x{hhh..} character with hex code hhh.. (default mode)
|
||||||
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\N{U+hhh..} character with Unicode code point hhh..
|
||||||
|
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
|
|
||||||
|
Note that when \N is not followed by an opening brace (curly bracket)
|
||||||
|
it has an entirely different meaning, matching any character that is
|
||||||
|
not a newline. Perl also uses \N{name} to specify characters by Uni-
|
||||||
|
code name; PCRE2 does not support this.
|
||||||
|
|
||||||
The precise effect of \cx on ASCII characters is as follows: if x is a
|
The precise effect of \cx on ASCII characters is as follows: if x is a
|
||||||
lower case letter, it is converted to upper case. Then bit 6 of the
|
lower case letter, it is converted to upper case. Then bit 6 of the
|
||||||
|
@ -6167,15 +6174,15 @@ BACKSLASH
|
||||||
hex 7B (; is 3B). If the code unit following \c has a value less than
|
hex 7B (; is 3B). If the code unit following \c has a value less than
|
||||||
32 or greater than 126, a compile-time error occurs.
|
32 or greater than 126, a compile-time error occurs.
|
||||||
|
|
||||||
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
|
When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
|
||||||
erate the appropriate EBCDIC code values. The \c escape is processed as
|
\a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
|
||||||
specified for Perl in the perlebcdic document. The only characters that
|
The \c escape is processed as specified for Perl in the perlebcdic doc-
|
||||||
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
|
ument. The only characters that are allowed after \c are A-Z, a-z, or
|
||||||
Any other character provokes a compile-time error. The sequence \c@
|
one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
|
||||||
encodes character code 0; after \c the letters (in either case) encode
|
time error. The sequence \c@ encodes character code 0; after \c the
|
||||||
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
|
letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
|
||||||
27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
|
\, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
|
||||||
(hex 5F).
|
becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
|
|
||||||
Thus, apart from \c?, these escapes generate the same character code
|
Thus, apart from \c?, these escapes generate the same character code
|
||||||
values as they do in an ASCII environment, though the meanings of the
|
values as they do in an ASCII environment, though the meanings of the
|
||||||
|
@ -6203,9 +6210,9 @@ BACKSLASH
|
||||||
numbers and backreferences to be unambiguously specified.
|
numbers and backreferences to be unambiguously specified.
|
||||||
|
|
||||||
For greater clarity and unambiguity, it is best to avoid following \ by
|
For greater clarity and unambiguity, it is best to avoid following \ by
|
||||||
a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
|
a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
|
||||||
ter numbers, and \g{} to specify backreferences. The following para-
|
cal character code points, and \g{} to specify backreferences. The fol-
|
||||||
graphs describe the old, ambiguous syntax.
|
lowing paragraphs describe the old, ambiguous syntax.
|
||||||
|
|
||||||
The handling of a backslash followed by a digit other than 0 is compli-
|
The handling of a backslash followed by a digit other than 0 is compli-
|
||||||
cated, and Perl has changed over time, causing PCRE2 also to change.
|
cated, and Perl has changed over time, causing PCRE2 also to change.
|
||||||
|
@ -6281,10 +6288,10 @@ BACKSLASH
|
||||||
inside and outside character classes. In addition, inside a character
|
inside and outside character classes. In addition, inside a character
|
||||||
class, \b is interpreted as the backspace character (hex 08).
|
class, \b is interpreted as the backspace character (hex 08).
|
||||||
|
|
||||||
\N is not allowed in a character class. \B, \R, and \X are not special
|
When not followed by an opening brace, \N is not allowed in a character
|
||||||
inside a character class. Like other unrecognized alphabetic escape
|
class. \B, \R, and \X are not special inside a character class. Like
|
||||||
sequences, they cause an error. Outside a character class, these
|
other unrecognized alphabetic escape sequences, they cause an error.
|
||||||
sequences have different meanings.
|
Outside a character class, these sequences have different meanings.
|
||||||
|
|
||||||
Unsupported escape sequences
|
Unsupported escape sequences
|
||||||
|
|
||||||
|
@ -6318,6 +6325,7 @@ BACKSLASH
|
||||||
\D any character that is not a decimal digit
|
\D any character that is not a decimal digit
|
||||||
\h any horizontal white space character
|
\h any horizontal white space character
|
||||||
\H any character that is not a horizontal white space character
|
\H any character that is not a horizontal white space character
|
||||||
|
\N any character that is not a newline
|
||||||
\s any white space character
|
\s any white space character
|
||||||
\S any character that is not a white space character
|
\S any character that is not a white space character
|
||||||
\v any vertical white space character
|
\v any vertical white space character
|
||||||
|
@ -6325,10 +6333,12 @@ BACKSLASH
|
||||||
\w any "word" character
|
\w any "word" character
|
||||||
\W any "non-word" character
|
\W any "non-word" character
|
||||||
|
|
||||||
There is also the single sequence \N, which matches a non-newline char-
|
The \N escape sequence has the same meaning as the "." metacharacter
|
||||||
acter. This is the same as the "." metacharacter when PCRE2_DOTALL is
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
|
||||||
not set. Perl also uses \N to match characters by name; PCRE2 does not
|
the meaning of \N. Note that when \N is followed by an opening brace it
|
||||||
support this.
|
has a different meaning. See the section entitled "Non-printing charac-
|
||||||
|
ters" above for details. Perl also uses \N{name} to specify characters
|
||||||
|
by Unicode name; PCRE2 does not support this.
|
||||||
|
|
||||||
Each pair of lower and upper case escape sequences partitions the com-
|
Each pair of lower and upper case escape sequences partitions the com-
|
||||||
plete set of characters into two disjoint sets. Any given character
|
plete set of characters into two disjoint sets. Any given character
|
||||||
|
@ -6867,49 +6877,54 @@ FULL STOP (PERIOD, DOT) AND \N
|
||||||
flex and dollar, the only relationship being that they both involve
|
flex and dollar, the only relationship being that they both involve
|
||||||
newlines. Dot has no special meaning in a character class.
|
newlines. Dot has no special meaning in a character class.
|
||||||
|
|
||||||
The escape sequence \N behaves like a dot, except that it is not
|
The escape sequence \N when not followed by an opening brace behaves
|
||||||
affected by the PCRE2_DOTALL option. In other words, it matches any
|
like a dot, except that it is not affected by the PCRE2_DOTALL option.
|
||||||
character except one that signifies the end of a line. Perl also uses
|
In other words, it matches any character except one that signifies the
|
||||||
\N to match characters by name; PCRE2 does not support this.
|
end of a line.
|
||||||
|
|
||||||
|
When \N is followed by an opening brace it has a different meaning. See
|
||||||
|
the section entitled "Non-printing characters" above for details. Perl
|
||||||
|
also uses \N{name} to specify characters by Unicode name; PCRE2 does
|
||||||
|
not support this.
|
||||||
|
|
||||||
|
|
||||||
MATCHING A SINGLE CODE UNIT
|
MATCHING A SINGLE CODE UNIT
|
||||||
|
|
||||||
Outside a character class, the escape sequence \C matches any one code
|
Outside a character class, the escape sequence \C matches any one code
|
||||||
unit, whether or not a UTF mode is set. In the 8-bit library, one code
|
unit, whether or not a UTF mode is set. In the 8-bit library, one code
|
||||||
unit is one byte; in the 16-bit library it is a 16-bit unit; in the
|
unit is one byte; in the 16-bit library it is a 16-bit unit; in the
|
||||||
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
|
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
|
||||||
line-ending characters. The feature is provided in Perl in order to
|
line-ending characters. The feature is provided in Perl in order to
|
||||||
match individual bytes in UTF-8 mode, but it is unclear how it can use-
|
match individual bytes in UTF-8 mode, but it is unclear how it can use-
|
||||||
fully be used.
|
fully be used.
|
||||||
|
|
||||||
Because \C breaks up characters into individual code units, matching
|
Because \C breaks up characters into individual code units, matching
|
||||||
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
|
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
|
||||||
string may start with a malformed UTF character. This has undefined
|
string may start with a malformed UTF character. This has undefined
|
||||||
results, because PCRE2 assumes that it is matching character by charac-
|
results, because PCRE2 assumes that it is matching character by charac-
|
||||||
ter in a valid UTF string (by default it checks the subject string's
|
ter in a valid UTF string (by default it checks the subject string's
|
||||||
validity at the start of processing unless the PCRE2_NO_UTF_CHECK
|
validity at the start of processing unless the PCRE2_NO_UTF_CHECK
|
||||||
option is used).
|
option is used).
|
||||||
|
|
||||||
An application can lock out the use of \C by setting the
|
An application can lock out the use of \C by setting the
|
||||||
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
|
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
|
||||||
possible to build PCRE2 with the use of \C permanently disabled.
|
possible to build PCRE2 with the use of \C permanently disabled.
|
||||||
|
|
||||||
PCRE2 does not allow \C to appear in lookbehind assertions (described
|
PCRE2 does not allow \C to appear in lookbehind assertions (described
|
||||||
below) in UTF-8 or UTF-16 modes, because this would make it impossible
|
below) in UTF-8 or UTF-16 modes, because this would make it impossible
|
||||||
to calculate the length of the lookbehind. Neither the alternative
|
to calculate the length of the lookbehind. Neither the alternative
|
||||||
matching function pcre2_dfa_match() nor the JIT optimizer support \C in
|
matching function pcre2_dfa_match() nor the JIT optimizer support \C in
|
||||||
these UTF modes. The former gives a match-time error; the latter fails
|
these UTF modes. The former gives a match-time error; the latter fails
|
||||||
to optimize and so the match is always run using the interpreter.
|
to optimize and so the match is always run using the interpreter.
|
||||||
|
|
||||||
In the 32-bit library, however, \C is always supported (when not
|
In the 32-bit library, however, \C is always supported (when not
|
||||||
explicitly locked out) because it always matches a single code unit,
|
explicitly locked out) because it always matches a single code unit,
|
||||||
whether or not UTF-32 is specified.
|
whether or not UTF-32 is specified.
|
||||||
|
|
||||||
In general, the \C escape sequence is best avoided. However, one way of
|
In general, the \C escape sequence is best avoided. However, one way of
|
||||||
using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
|
using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
|
||||||
ters is to use a lookahead to check the length of the next character,
|
ters is to use a lookahead to check the length of the next character,
|
||||||
as in this pattern, which could be used with a UTF-8 string (ignore
|
as in this pattern, which could be used with a UTF-8 string (ignore
|
||||||
white space and line breaks):
|
white space and line breaks):
|
||||||
|
|
||||||
(?| (?=[\x00-\x7f])(\C) |
|
(?| (?=[\x00-\x7f])(\C) |
|
||||||
|
@ -6917,10 +6932,10 @@ MATCHING A SINGLE CODE UNIT
|
||||||
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
|
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
|
||||||
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
|
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
|
||||||
|
|
||||||
In this example, a group that starts with (?| resets the capturing
|
In this example, a group that starts with (?| resets the capturing
|
||||||
parentheses numbers in each alternative (see "Duplicate Subpattern Num-
|
parentheses numbers in each alternative (see "Duplicate Subpattern Num-
|
||||||
bers" below). The assertions at the start of each branch check the next
|
bers" below). The assertions at the start of each branch check the next
|
||||||
UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
|
UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
|
||||||
respectively. The character's individual bytes are then captured by the
|
respectively. The character's individual bytes are then captured by the
|
||||||
appropriate number of \C groups.
|
appropriate number of \C groups.
|
||||||
|
|
||||||
|
@ -6929,50 +6944,53 @@ SQUARE BRACKETS AND CHARACTER CLASSES
|
||||||
|
|
||||||
An opening square bracket introduces a character class, terminated by a
|
An opening square bracket introduces a character class, terminated by a
|
||||||
closing square bracket. A closing square bracket on its own is not spe-
|
closing square bracket. A closing square bracket on its own is not spe-
|
||||||
cial by default. If a closing square bracket is required as a member
|
cial by default. If a closing square bracket is required as a member
|
||||||
of the class, it should be the first data character in the class (after
|
of the class, it should be the first data character in the class (after
|
||||||
an initial circumflex, if present) or escaped with a backslash. This
|
an initial circumflex, if present) or escaped with a backslash. This
|
||||||
means that, by default, an empty class cannot be defined. However, if
|
means that, by default, an empty class cannot be defined. However, if
|
||||||
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
|
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
|
||||||
the start does end the (empty) class.
|
the start does end the (empty) class.
|
||||||
|
|
||||||
A character class matches a single character in the subject. A matched
|
A character class matches a single character in the subject. A matched
|
||||||
character must be in the set of characters defined by the class, unless
|
character must be in the set of characters defined by the class, unless
|
||||||
the first character in the class definition is a circumflex, in which
|
the first character in the class definition is a circumflex, in which
|
||||||
case the subject character must not be in the set defined by the class.
|
case the subject character must not be in the set defined by the class.
|
||||||
If a circumflex is actually required as a member of the class, ensure
|
If a circumflex is actually required as a member of the class, ensure
|
||||||
it is not the first character, or escape it with a backslash.
|
it is not the first character, or escape it with a backslash.
|
||||||
|
|
||||||
For example, the character class [aeiou] matches any lower case vowel,
|
For example, the character class [aeiou] matches any lower case vowel,
|
||||||
while [^aeiou] matches any character that is not a lower case vowel.
|
while [^aeiou] matches any character that is not a lower case vowel.
|
||||||
Note that a circumflex is just a convenient notation for specifying the
|
Note that a circumflex is just a convenient notation for specifying the
|
||||||
characters that are in the class by enumerating those that are not. A
|
characters that are in the class by enumerating those that are not. A
|
||||||
class that starts with a circumflex is not an assertion; it still con-
|
class that starts with a circumflex is not an assertion; it still con-
|
||||||
sumes a character from the subject string, and therefore it fails if
|
sumes a character from the subject string, and therefore it fails if
|
||||||
the current pointer is at the end of the string.
|
the current pointer is at the end of the string.
|
||||||
|
|
||||||
When caseless matching is set, any letters in a class represent both
|
Characters in a class may be specified by their code points using \o,
|
||||||
their upper case and lower case versions, so for example, a caseless
|
\x, or \N{U+hh..} in the usual way. When caseless matching is set, any
|
||||||
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
letters in a class represent both their upper case and lower case ver-
|
||||||
match "A", whereas a caseful version would.
|
sions, so for example, a caseless [aeiou] matches "A" as well as "a",
|
||||||
|
and a caseless [^aeiou] does not match "A", whereas a caseful version
|
||||||
|
would.
|
||||||
|
|
||||||
Characters that might indicate line breaks are never treated in any
|
Characters that might indicate line breaks are never treated in any
|
||||||
special way when matching character classes, whatever line-ending
|
special way when matching character classes, whatever line-ending
|
||||||
sequence is in use, and whatever setting of the PCRE2_DOTALL and
|
sequence is in use, and whatever setting of the PCRE2_DOTALL and
|
||||||
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
PCRE2_MULTILINE options is used. A class such as [^a] always matches
|
||||||
one of these characters.
|
one of these characters.
|
||||||
|
|
||||||
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
|
The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
|
||||||
\w, and \W may appear in a character class, and add the characters that
|
\S, \v, \V, \w, and \W may appear in a character class, and add the
|
||||||
they match to the class. For example, [\dABCDEF] matches any hexadeci-
|
characters that they match to the class. For example, [\dABCDEF]
|
||||||
mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
|
||||||
\d, \s, \w and their upper case partners, just as it does when they
|
affects the meanings of \d, \s, \w and their upper case partners, just
|
||||||
appear outside a character class, as described in the section entitled
|
as it does when they appear outside a character class, as described in
|
||||||
"Generic character types" above. The escape sequence \b has a different
|
the section entitled "Generic character types" above. The escape
|
||||||
meaning inside a character class; it matches the backspace character.
|
sequence \b has a different meaning inside a character class; it
|
||||||
The sequences \B, \N, \R, and \X are not special inside a character
|
matches the backspace character. The sequences \B, \R, and \X are not
|
||||||
class. Like any other unrecognized escape sequences, they cause an
|
special inside a character class. Like any other unrecognized escape
|
||||||
error.
|
sequences, they cause an error. The same is true for \N when not fol-
|
||||||
|
lowed by an opening brace.
|
||||||
|
|
||||||
The minus (hyphen) character can be used to specify a range of charac-
|
The minus (hyphen) character can be used to specify a range of charac-
|
||||||
ters in a character class. For example, [d-m] matches any letter
|
ters in a character class. For example, [d-m] matches any letter
|
||||||
|
@ -9012,7 +9030,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@ -9873,19 +9891,23 @@ ESCAPED CHARACTERS
|
||||||
\ddd character with octal code ddd, or backreference
|
\ddd character with octal code ddd, or backreference
|
||||||
\o{ddd..} character with octal code ddd..
|
\o{ddd..} character with octal code ddd..
|
||||||
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\N{U+hh..} character with Unicode code point hh..
|
||||||
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\xhh character with hex code hh
|
\xhh character with hex code hh
|
||||||
\x{hhh..} character with hex code hhh..
|
\x{hh..} character with hex code hh..
|
||||||
|
|
||||||
Note that \0dd is always an octal code. The treatment of backslash fol-
|
Note that \0dd is always an octal code. The treatment of backslash fol-
|
||||||
lowed by a non-zero digit is complicated; for details see the section
|
lowed by a non-zero digit is complicated; for details see the section
|
||||||
"Non-printing characters" in the pcre2pattern documentation, where
|
"Non-printing characters" in the pcre2pattern documentation, where
|
||||||
details of escape processing in EBCDIC environments are also given.
|
details of escape processing in EBCDIC environments are also given.
|
||||||
|
\N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
|
||||||
|
EBCDIC environments. Note that \N not followed by an opening curly
|
||||||
|
bracket has a different meaning (see below).
|
||||||
|
|
||||||
When \x is not followed by {, from zero to two hexadecimal digits are
|
When \x is not followed by {, from zero to two hexadecimal digits are
|
||||||
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
|
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
|
||||||
imal digits to be recognized as a hexadecimal escape; otherwise it
|
imal digits to be recognized as a hexadecimal escape; otherwise it
|
||||||
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
|
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
|
||||||
lowed by four hexadecimal digits, it matches a literal "u".
|
lowed by four hexadecimal digits, it matches a literal "u".
|
||||||
|
|
||||||
|
|
||||||
|
@ -9910,14 +9932,14 @@ CHARACTER TYPES
|
||||||
\W a "non-word" character
|
\W a "non-word" character
|
||||||
\X a Unicode extended grapheme cluster
|
\X a Unicode extended grapheme cluster
|
||||||
|
|
||||||
\C is dangerous because it may leave the current matching point in the
|
\C is dangerous because it may leave the current matching point in the
|
||||||
middle of a UTF-8 or UTF-16 character. The application can lock out the
|
middle of a UTF-8 or UTF-16 character. The application can lock out the
|
||||||
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
|
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
|
||||||
possible to build PCRE2 with the use of \C permanently disabled.
|
possible to build PCRE2 with the use of \C permanently disabled.
|
||||||
|
|
||||||
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
|
By default, \d, \s, and \w match only ASCII characters, even in UTF-8
|
||||||
mode or in the 16-bit and 32-bit libraries. However, if locale-specific
|
mode or in the 16-bit and 32-bit libraries. However, if locale-specific
|
||||||
matching is happening, \s and \w may also match characters with code
|
matching is happening, \s and \w may also match characters with code
|
||||||
points in the range 128-255. If the PCRE2_UCP option is set, the behav-
|
points in the range 128-255. If the PCRE2_UCP option is set, the behav-
|
||||||
iour of these escape sequences is changed to use Unicode properties and
|
iour of these escape sequences is changed to use Unicode properties and
|
||||||
they match many more characters.
|
they match many more characters.
|
||||||
|
@ -9986,28 +10008,28 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
||||||
|
|
||||||
SCRIPT NAMES FOR \p AND \P
|
SCRIPT NAMES FOR \p AND \P
|
||||||
|
|
||||||
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
||||||
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
||||||
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
|
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
|
||||||
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
|
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
|
||||||
Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
|
Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
|
||||||
Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
|
Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
|
||||||
Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
|
Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
|
||||||
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
|
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
|
||||||
Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
|
Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
|
||||||
nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
|
nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
|
||||||
Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
|
Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
|
||||||
jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
|
jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
|
||||||
Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
|
Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
|
||||||
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
|
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
|
||||||
Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
|
Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
|
||||||
ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
|
ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
|
||||||
dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
|
dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
|
||||||
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
|
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
|
||||||
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
|
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
|
||||||
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
|
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
|
||||||
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
|
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
|
||||||
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
|
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
|
||||||
nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
|
nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
|
||||||
|
|
||||||
|
|
||||||
|
@ -10034,8 +10056,8 @@ CHARACTER CLASSES
|
||||||
word same as \w
|
word same as \w
|
||||||
xdigit hexadecimal digit
|
xdigit hexadecimal digit
|
||||||
|
|
||||||
In PCRE2, POSIX character set names recognize only ASCII characters by
|
In PCRE2, POSIX character set names recognize only ASCII characters by
|
||||||
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
default, but some of them use Unicode properties if PCRE2_UCP is set.
|
||||||
You can use \Q...\E inside a character class.
|
You can use \Q...\E inside a character class.
|
||||||
|
|
||||||
|
|
||||||
|
@ -10121,8 +10143,8 @@ OPTION SETTING
|
||||||
(?xx) as (?x) but also ignore space and tab in classes
|
(?xx) as (?x) but also ignore space and tab in classes
|
||||||
(?-...) unset option(s)
|
(?-...) unset option(s)
|
||||||
|
|
||||||
The following are recognized only at the very start of a pattern or
|
The following are recognized only at the very start of a pattern or
|
||||||
after one of the newline or \R options with similar syntax. More than
|
after one of the newline or \R options with similar syntax. More than
|
||||||
one of them may appear. For the first three, d is a decimal number.
|
one of them may appear. For the first three, d is a decimal number.
|
||||||
|
|
||||||
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
(*LIMIT_DEPTH=d) set the backtracking limit to d
|
||||||
|
@ -10137,17 +10159,17 @@ OPTION SETTING
|
||||||
(*UTF) set appropriate UTF mode for the library in use
|
(*UTF) set appropriate UTF mode for the library in use
|
||||||
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
|
||||||
|
|
||||||
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
|
||||||
value of the limits set by the caller of pcre2_match() or
|
value of the limits set by the caller of pcre2_match() or
|
||||||
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
|
||||||
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
|
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
|
||||||
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
|
||||||
respectively, at compile time.
|
respectively, at compile time.
|
||||||
|
|
||||||
|
|
||||||
NEWLINE CONVENTION
|
NEWLINE CONVENTION
|
||||||
|
|
||||||
These are recognized only at the very start of the pattern or after
|
These are recognized only at the very start of the pattern or after
|
||||||
option settings with a similar syntax.
|
option settings with a similar syntax.
|
||||||
|
|
||||||
(*CR) carriage return only
|
(*CR) carriage return only
|
||||||
|
@ -10160,7 +10182,7 @@ NEWLINE CONVENTION
|
||||||
|
|
||||||
WHAT \R MATCHES
|
WHAT \R MATCHES
|
||||||
|
|
||||||
These are recognized only at the very start of the pattern or after
|
These are recognized only at the very start of the pattern or after
|
||||||
option setting with a similar syntax.
|
option setting with a similar syntax.
|
||||||
|
|
||||||
(*BSR_ANYCRLF) CR, LF, or CRLF
|
(*BSR_ANYCRLF) CR, LF, or CRLF
|
||||||
|
@ -10229,16 +10251,16 @@ CONDITIONAL PATTERNS
|
||||||
(?(VERSION[>]=n.m) test PCRE2 version
|
(?(VERSION[>]=n.m) test PCRE2 version
|
||||||
(?(assert) assertion condition
|
(?(assert) assertion condition
|
||||||
|
|
||||||
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||||
conditions or recursion tests. Such a condition is interpreted as a
|
conditions or recursion tests. Such a condition is interpreted as a
|
||||||
reference condition if the relevant named group exists.
|
reference condition if the relevant named group exists.
|
||||||
|
|
||||||
|
|
||||||
BACKTRACKING CONTROL
|
BACKTRACKING CONTROL
|
||||||
|
|
||||||
All backtracking control verbs may be in the form (*VERB:NAME). For
|
All backtracking control verbs may be in the form (*VERB:NAME). For
|
||||||
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
(*MARK) the name is mandatory, for the others it is optional. (*SKIP)
|
||||||
changes its behaviour if :NAME is present. The others just set a name
|
changes its behaviour if :NAME is present. The others just set a name
|
||||||
for passing back to the caller, but this is not a name that (*SKIP) can
|
for passing back to the caller, but this is not a name that (*SKIP) can
|
||||||
see. The following act immediately they are reached:
|
see. The following act immediately they are reached:
|
||||||
|
|
||||||
|
@ -10246,7 +10268,7 @@ BACKTRACKING CONTROL
|
||||||
(*FAIL) force backtrack; synonym (*F)
|
(*FAIL) force backtrack; synonym (*F)
|
||||||
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
|
||||||
|
|
||||||
The following act only when a subsequent match failure causes a back-
|
The following act only when a subsequent match failure causes a back-
|
||||||
track to reach them. They all force a match failure, but they differ in
|
track to reach them. They all force a match failure, but they differ in
|
||||||
what happens afterwards. Those that advance the start-of-match point do
|
what happens afterwards. Those that advance the start-of-match point do
|
||||||
so only if the pattern is not anchored.
|
so only if the pattern is not anchored.
|
||||||
|
@ -10258,7 +10280,7 @@ BACKTRACKING CONTROL
|
||||||
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
(*MARK:NAME); if not found, the (*SKIP) is ignored
|
||||||
(*THEN) local failure, backtrack to next alternation
|
(*THEN) local failure, backtrack to next alternation
|
||||||
|
|
||||||
The effect of one of these verbs in a group called as a subroutine is
|
The effect of one of these verbs in a group called as a subroutine is
|
||||||
confined to the subroutine call.
|
confined to the subroutine call.
|
||||||
|
|
||||||
|
|
||||||
|
@ -10269,14 +10291,14 @@ CALLOUTS
|
||||||
(?C"text") callout with string data
|
(?C"text") callout with string data
|
||||||
|
|
||||||
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
|
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
|
||||||
the start and the end), and the starting delimiter { matched with the
|
the start and the end), and the starting delimiter { matched with the
|
||||||
ending delimiter }. To encode the ending delimiter within the string,
|
ending delimiter }. To encode the ending delimiter within the string,
|
||||||
double it.
|
double it.
|
||||||
|
|
||||||
|
|
||||||
SEE ALSO
|
SEE ALSO
|
||||||
|
|
||||||
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
|
||||||
pcre2(3).
|
pcre2(3).
|
||||||
|
|
||||||
|
|
||||||
|
@ -10289,7 +10311,7 @@ AUTHOR
|
||||||
|
|
||||||
REVISION
|
REVISION
|
||||||
|
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
------------------------------------------------------------------------------
|
------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
|
.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.sp
|
.sp
|
||||||
|
@ -1400,7 +1400,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
|
||||||
not match when the current position in the subject is at a newline. This option
|
not match when the current position in the subject is at a newline. This option
|
||||||
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
is equivalent to Perl's /s option, and it can be changed within a pattern by a
|
||||||
(?s) option setting. A negative class such as [^a] always matches newline
|
(?s) option setting. A negative class such as [^a] always matches newline
|
||||||
characters, independent of the setting of this option.
|
characters, and the \eN escape sequence always matches a non-newline character,
|
||||||
|
independent of the setting of PCRE2_DOTALL.
|
||||||
.sp
|
.sp
|
||||||
PCRE2_DUPNAMES
|
PCRE2_DUPNAMES
|
||||||
.sp
|
.sp
|
||||||
|
@ -3640,6 +3641,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "20 July 2018" "PCRE2 10.32"
|
.TH PCRE2PATTERN 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -218,10 +218,11 @@ is used.
|
||||||
.P
|
.P
|
||||||
The newline convention affects where the circumflex and dollar assertions are
|
The newline convention affects where the circumflex and dollar assertions are
|
||||||
true. It also affects the interpretation of the dot metacharacter when
|
true. It also affects the interpretation of the dot metacharacter when
|
||||||
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
|
PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
|
||||||
what the \eR escape sequence matches. By default, this is any Unicode newline
|
opening brace. However, it does not affect what the \eR escape sequence
|
||||||
sequence, for Perl compatibility. However, this can be changed; see the next
|
matches. By default, this is any Unicode newline sequence, for Perl
|
||||||
section and the description of \eR in the section entitled
|
compatibility. However, this can be changed; see the next section and the
|
||||||
|
description of \eR in the section entitled
|
||||||
.\" HTML <a href="#newlineseq">
|
.\" HTML <a href="#newlineseq">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Newline sequences"
|
"Newline sequences"
|
||||||
|
@ -359,20 +360,26 @@ text editing, it is often easier to use one of the following escape sequences
|
||||||
than the binary character it represents. In an ASCII or Unicode environment,
|
than the binary character it represents. In an ASCII or Unicode environment,
|
||||||
these escapes are as follows:
|
these escapes are as follows:
|
||||||
.sp
|
.sp
|
||||||
\ea alarm, that is, the BEL character (hex 07)
|
\ea alarm, that is, the BEL character (hex 07)
|
||||||
\ecx "control-x", where x is any printable ASCII character
|
\ecx "control-x", where x is any printable ASCII character
|
||||||
\ee escape (hex 1B)
|
\ee escape (hex 1B)
|
||||||
\ef form feed (hex 0C)
|
\ef form feed (hex 0C)
|
||||||
\en linefeed (hex 0A)
|
\en linefeed (hex 0A)
|
||||||
\er carriage return (hex 0D)
|
\er carriage return (hex 0D)
|
||||||
\et tab (hex 09)
|
\et tab (hex 09)
|
||||||
\e0dd character with octal code 0dd
|
\e0dd character with octal code 0dd
|
||||||
\eddd character with octal code ddd, or backreference
|
\eddd character with octal code ddd, or backreference
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh.. (default mode)
|
\ex{hhh..} character with hex code hhh.. (default mode)
|
||||||
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
\eN{U+hhh..} character with Unicode code point hhh..
|
||||||
|
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
|
||||||
.sp
|
.sp
|
||||||
|
Note that when \eN is not followed by an opening brace (curly bracket) it has
|
||||||
|
an entirely different meaning, matching any character that is not a newline.
|
||||||
|
Perl also uses \eN{name} to specify characters by Unicode name; PCRE2 does not
|
||||||
|
support this.
|
||||||
|
.P
|
||||||
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
The precise effect of \ecx on ASCII characters is as follows: if x is a lower
|
||||||
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
case letter, it is converted to upper case. Then bit 6 of the character (hex
|
||||||
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
|
||||||
|
@ -380,14 +387,14 @@ but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
|
||||||
code unit following \ec has a value less than 32 or greater than 126, a
|
code unit following \ec has a value less than 32 or greater than 126, a
|
||||||
compile-time error occurs.
|
compile-time error occurs.
|
||||||
.P
|
.P
|
||||||
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
|
When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
|
||||||
generate the appropriate EBCDIC code values. The \ec escape is processed
|
\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
|
||||||
as specified for Perl in the \fBperlebcdic\fP document. The only characters
|
escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
|
||||||
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
|
only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
|
||||||
other character provokes a compile-time error. The sequence \ec@ encodes
|
^, _, or ?. Any other character provokes a compile-time error. The sequence
|
||||||
character code 0; after \ec the letters (in either case) encode characters 1-26
|
\ec@ encodes character code 0; after \ec the letters (in either case) encode
|
||||||
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
|
characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
|
||||||
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
|
||||||
.P
|
.P
|
||||||
Thus, apart from \ec?, these escapes generate the same character code values as
|
Thus, apart from \ec?, these escapes generate the same character code values as
|
||||||
they do in an ASCII environment, though the meanings of the values mostly
|
they do in an ASCII environment, though the meanings of the values mostly
|
||||||
|
@ -414,9 +421,9 @@ numbers greater than 0777, and it also allows octal numbers and backreferences
|
||||||
to be unambiguously specified.
|
to be unambiguously specified.
|
||||||
.P
|
.P
|
||||||
For greater clarity and unambiguity, it is best to avoid following \e by a
|
For greater clarity and unambiguity, it is best to avoid following \e by a
|
||||||
digit greater than zero. Instead, use \eo{} or \ex{} to specify character
|
digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
|
||||||
numbers, and \eg{} to specify backreferences. The following paragraphs
|
character code points, and \eg{} to specify backreferences. The following
|
||||||
describe the old, ambiguous syntax.
|
paragraphs describe the old, ambiguous syntax.
|
||||||
.P
|
.P
|
||||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||||
and Perl has changed over time, causing PCRE2 also to change.
|
and Perl has changed over time, causing PCRE2 also to change.
|
||||||
|
@ -507,10 +514,10 @@ All the sequences that define a single character value can be used both inside
|
||||||
and outside character classes. In addition, inside a character class, \eb is
|
and outside character classes. In addition, inside a character class, \eb is
|
||||||
interpreted as the backspace character (hex 08).
|
interpreted as the backspace character (hex 08).
|
||||||
.P
|
.P
|
||||||
\eN is not allowed in a character class. \eB, \eR, and \eX are not special
|
When not followed by an opening brace, \eN is not allowed in a character class.
|
||||||
inside a character class. Like other unrecognized alphabetic escape sequences,
|
\eB, \eR, and \eX are not special inside a character class. Like other
|
||||||
they cause an error. Outside a character class, these sequences have different
|
unrecognized alphabetic escape sequences, they cause an error. Outside a
|
||||||
meanings.
|
character class, these sequences have different meanings.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
.SS "Unsupported escape sequences"
|
.SS "Unsupported escape sequences"
|
||||||
|
@ -569,6 +576,7 @@ Another use of backslash is for specifying generic character types:
|
||||||
\eD any character that is not a decimal digit
|
\eD any character that is not a decimal digit
|
||||||
\eh any horizontal white space character
|
\eh any horizontal white space character
|
||||||
\eH any character that is not a horizontal white space character
|
\eH any character that is not a horizontal white space character
|
||||||
|
\eN any character that is not a newline
|
||||||
\es any white space character
|
\es any white space character
|
||||||
\eS any character that is not a white space character
|
\eS any character that is not a white space character
|
||||||
\ev any vertical white space character
|
\ev any vertical white space character
|
||||||
|
@ -576,14 +584,20 @@ Another use of backslash is for specifying generic character types:
|
||||||
\ew any "word" character
|
\ew any "word" character
|
||||||
\eW any "non-word" character
|
\eW any "non-word" character
|
||||||
.sp
|
.sp
|
||||||
There is also the single sequence \eN, which matches a non-newline character.
|
The \eN escape sequence has the same meaning as
|
||||||
This is the same as
|
|
||||||
.\" HTML <a href="#fullstopdot">
|
.\" HTML <a href="#fullstopdot">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
the "." metacharacter
|
the "." metacharacter
|
||||||
.\"
|
.\"
|
||||||
when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name;
|
when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
|
||||||
PCRE2 does not support this.
|
meaning of \eN. Note that when \eN is followed by an opening brace it has a
|
||||||
|
different meaning. See the section entitled
|
||||||
|
.\" HTML <a href="#digitsafterbackslash">
|
||||||
|
.\" </a>
|
||||||
|
"Non-printing characters"
|
||||||
|
.\"
|
||||||
|
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||||
|
name; PCRE2 does not support this.
|
||||||
.P
|
.P
|
||||||
Each pair of lower and upper case escape sequences partitions the complete set
|
Each pair of lower and upper case escape sequences partitions the complete set
|
||||||
of characters into two disjoint sets. Any given character matches one, and only
|
of characters into two disjoint sets. Any given character matches one, and only
|
||||||
|
@ -1289,9 +1303,17 @@ The handling of dot is entirely independent of the handling of circumflex and
|
||||||
dollar, the only relationship being that they both involve newlines. Dot has no
|
dollar, the only relationship being that they both involve newlines. Dot has no
|
||||||
special meaning in a character class.
|
special meaning in a character class.
|
||||||
.P
|
.P
|
||||||
The escape sequence \eN behaves like a dot, except that it is not affected by
|
The escape sequence \eN when not followed by an opening brace behaves like a
|
||||||
the PCRE2_DOTALL option. In other words, it matches any character except one
|
dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
|
||||||
that signifies the end of a line. Perl also uses \eN to match characters by
|
it matches any character except one that signifies the end of a line.
|
||||||
|
.P
|
||||||
|
When \eN is followed by an opening brace it has a different meaning. See the
|
||||||
|
section entitled
|
||||||
|
.\" HTML <a href="digitsafterbackslash">
|
||||||
|
.\" </a>
|
||||||
|
"Non-printing characters"
|
||||||
|
.\"
|
||||||
|
above for details. Perl also uses \eN{name} to specify characters by Unicode
|
||||||
name; PCRE2 does not support this.
|
name; PCRE2 does not support this.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -1380,30 +1402,32 @@ circumflex is not an assertion; it still consumes a character from the subject
|
||||||
string, and therefore it fails if the current pointer is at the end of the
|
string, and therefore it fails if the current pointer is at the end of the
|
||||||
string.
|
string.
|
||||||
.P
|
.P
|
||||||
When caseless matching is set, any letters in a class represent both their
|
Characters in a class may be specified by their code points using \eo, \ex, or
|
||||||
upper case and lower case versions, so for example, a caseless [aeiou] matches
|
\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
|
||||||
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
|
class represent both their upper case and lower case versions, so for example,
|
||||||
caseful version would.
|
a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
|
||||||
|
match "A", whereas a caseful version would.
|
||||||
.P
|
.P
|
||||||
Characters that might indicate line breaks are never treated in any special way
|
Characters that might indicate line breaks are never treated in any special way
|
||||||
when matching character classes, whatever line-ending sequence is in use, and
|
when matching character classes, whatever line-ending sequence is in use, and
|
||||||
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
|
||||||
class such as [^a] always matches one of these characters.
|
class such as [^a] always matches one of these characters.
|
||||||
.P
|
.P
|
||||||
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
|
The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
|
||||||
\eV, \ew, and \eW may appear in a character class, and add the characters that
|
\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
|
||||||
they match to the class. For example, [\edABCDEF] matches any hexadecimal
|
characters that they match to the class. For example, [\edABCDEF] matches any
|
||||||
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
|
hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
|
||||||
and their upper case partners, just as it does when they appear outside a
|
\ed, \es, \ew and their upper case partners, just as it does when they appear
|
||||||
character class, as described in the section entitled
|
outside a character class, as described in the section entitled
|
||||||
.\" HTML <a href="#genericchartypes">
|
.\" HTML <a href="#genericchartypes">
|
||||||
.\" </a>
|
.\" </a>
|
||||||
"Generic character types"
|
"Generic character types"
|
||||||
.\"
|
.\"
|
||||||
above. The escape sequence \eb has a different meaning inside a character
|
above. The escape sequence \eb has a different meaning inside a character
|
||||||
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
|
class; it matches the backspace character. The sequences \eB, \eR, and \eX are
|
||||||
are not special inside a character class. Like any other unrecognized escape
|
not special inside a character class. Like any other unrecognized escape
|
||||||
sequences, they cause an error.
|
sequences, they cause an error. The same is true for \eN when not followed by
|
||||||
|
an opening brace.
|
||||||
.P
|
.P
|
||||||
The minus (hyphen) character can be used to specify a range of characters in a
|
The minus (hyphen) character can be used to specify a range of characters in a
|
||||||
character class. For example, [d-m] matches any letter between d and m,
|
character class. For example, [d-m] matches any letter between d and m,
|
||||||
|
@ -3580,6 +3604,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 20 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "21 July 2018" "PCRE2 10.32"
|
.TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -35,9 +35,10 @@ This table applies to ASCII and Unicode environments.
|
||||||
\eddd character with octal code ddd, or backreference
|
\eddd character with octal code ddd, or backreference
|
||||||
\eo{ddd..} character with octal code ddd..
|
\eo{ddd..} character with octal code ddd..
|
||||||
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
|
||||||
|
\eN{U+hh..} character with Unicode code point hh..
|
||||||
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
|
||||||
\exhh character with hex code hh
|
\exhh character with hex code hh
|
||||||
\ex{hhh..} character with hex code hhh..
|
\ex{hh..} character with hex code hh..
|
||||||
.sp
|
.sp
|
||||||
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
Note that \e0dd is always an octal code. The treatment of backslash followed by
|
||||||
a non-zero digit is complicated; for details see the section
|
a non-zero digit is complicated; for details see the section
|
||||||
|
@ -50,7 +51,9 @@ in the
|
||||||
\fBpcre2pattern\fP
|
\fBpcre2pattern\fP
|
||||||
.\"
|
.\"
|
||||||
documentation, where details of escape processing in EBCDIC environments are
|
documentation, where details of escape processing in EBCDIC environments are
|
||||||
also given.
|
also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
|
||||||
|
supported in EBCDIC environments. Note that \eN not followed by an opening
|
||||||
|
curly bracket has a different meaning (see below).
|
||||||
.P
|
.P
|
||||||
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
When \ex is not followed by {, from zero to two hexadecimal digits are read,
|
||||||
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
|
||||||
|
@ -609,6 +612,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 21 July 2018
|
Last updated: 27 July 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -316,6 +316,7 @@ pcre2_pattern_convert(). */
|
||||||
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
|
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
|
||||||
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
|
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
|
||||||
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
|
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
|
||||||
|
#define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC 193
|
||||||
|
|
||||||
|
|
||||||
/* "Expected" matching error codes: no match and partial match. */
|
/* "Expected" matching error codes: no match and partial match. */
|
||||||
|
|
|
@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
|
||||||
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
|
||||||
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
|
||||||
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
|
||||||
ERR91, ERR92};
|
ERR91, ERR92, ERR93 };
|
||||||
|
|
||||||
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
/* This is a table of start-of-pattern options such as (*UTF) and settings such
|
||||||
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
|
||||||
|
@ -1441,6 +1441,42 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
|
||||||
escape = -i; /* Else return a special escape */
|
escape = -i; /* Else return a special escape */
|
||||||
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
|
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
|
||||||
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
|
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
|
||||||
|
|
||||||
|
/* Perl supports \N{name} for character names and \N{U+dddd} for numerical
|
||||||
|
Unicode code points, as well as plain \N for "not newline". PCRE does not
|
||||||
|
support \N{name}. However, it does support quantification such as \N{2,3},
|
||||||
|
so if \N{ is not followed by U+dddd we check for a quantifier. */
|
||||||
|
|
||||||
|
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||||
|
{
|
||||||
|
PCRE2_SPTR p = ptr + 1;
|
||||||
|
|
||||||
|
/* \N{U+ can be handled by the \x{ code. However, this construction is
|
||||||
|
not valid in EBCDIC environments because it specifies a Unicode
|
||||||
|
character, not a codepoint in the local code. For example \N{U+0041}
|
||||||
|
must be "A" in all environments. */
|
||||||
|
|
||||||
|
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
|
||||||
|
{
|
||||||
|
#ifdef EBCDIC
|
||||||
|
*errorcodeptr = ERR93;
|
||||||
|
#else
|
||||||
|
ptr = p + 1;
|
||||||
|
escape = 0; /* Not a fancy escape after all */
|
||||||
|
goto COME_FROM_NU;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Give an error if what follows is not a quantifier, but don't override
|
||||||
|
an error set by the quantifier reader (e.g. number overflow). */
|
||||||
|
|
||||||
|
else
|
||||||
|
{
|
||||||
|
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
||||||
|
*errorcodeptr == 0)
|
||||||
|
*errorcodeptr = ERR37;
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@ -1725,6 +1761,9 @@ else
|
||||||
{
|
{
|
||||||
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
|
||||||
{
|
{
|
||||||
|
#ifndef EBCDIC
|
||||||
|
COME_FROM_NU:
|
||||||
|
#endif
|
||||||
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
|
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
|
||||||
{
|
{
|
||||||
*errorcodeptr = ERR78;
|
*errorcodeptr = ERR78;
|
||||||
|
@ -1858,19 +1897,6 @@ else
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Perl supports \N{name} for character names, as well as plain \N for "not
|
|
||||||
newline". PCRE does not support \N{name}. However, it does support
|
|
||||||
quantification such as \N{2,3}. */
|
|
||||||
|
|
||||||
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET &&
|
|
||||||
ptrend - ptr > 2)
|
|
||||||
{
|
|
||||||
PCRE2_SPTR p = ptr + 1;
|
|
||||||
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
|
|
||||||
*errorcodeptr == 0)
|
|
||||||
*errorcodeptr = ERR37;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Set the pointer to the next character before returning. */
|
/* Set the pointer to the next character before returning. */
|
||||||
|
|
||||||
*ptrptr = ptr;
|
*ptrptr = ptr;
|
||||||
|
@ -3223,7 +3249,6 @@ while (ptr < ptrend)
|
||||||
tempptr = ptr;
|
tempptr = ptr;
|
||||||
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
|
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
|
||||||
options, TRUE, cb);
|
options, TRUE, cb);
|
||||||
|
|
||||||
if (errorcode != 0)
|
if (errorcode != 0)
|
||||||
{
|
{
|
||||||
CLASS_ESCAPE_FAILED:
|
CLASS_ESCAPE_FAILED:
|
||||||
|
|
|
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
|
||||||
"using UCP is disabled by the application\0"
|
"using UCP is disabled by the application\0"
|
||||||
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
|
||||||
"character code point value in \\u.... sequence is too large\0"
|
"character code point value in \\u.... sequence is too large\0"
|
||||||
"digits missing in \\x{} or \\o{}\0"
|
"digits missing in \\x{} or \\o{} or \\N{U+}\0"
|
||||||
"syntax error or number too big in (?(VERSION condition\0"
|
"syntax error or number too big in (?(VERSION condition\0"
|
||||||
/* 80 */
|
/* 80 */
|
||||||
"internal error: unknown opcode in auto_possessify()\0"
|
"internal error: unknown opcode in auto_possessify()\0"
|
||||||
|
@ -179,6 +179,7 @@ static const unsigned char compile_error_texts[] =
|
||||||
"internal error: bad code value in parsed_skip()\0"
|
"internal error: bad code value in parsed_skip()\0"
|
||||||
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
|
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
|
||||||
"invalid option bits with PCRE2_LITERAL\0"
|
"invalid option bits with PCRE2_LITERAL\0"
|
||||||
|
"\\N{U+dddd} is not supported in EBCDIC mode\0"
|
||||||
;
|
;
|
||||||
|
|
||||||
/* Match-time and UTF error texts are in the same format. */
|
/* Match-time and UTF error texts are in the same format. */
|
||||||
|
|
|
@ -2287,5 +2287,11 @@
|
||||||
\x{123}\x{122}\x{123}
|
\x{123}\x{122}\x{123}
|
||||||
\= Expect no match
|
\= Expect no match
|
||||||
\x{123}\x{124}\x{123}
|
\x{123}\x{124}\x{123}
|
||||||
|
|
||||||
|
/\N{U+1234}/utf
|
||||||
|
\x{1234}
|
||||||
|
|
||||||
|
/[\N{U+1234}]/utf
|
||||||
|
\x{1234}
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -2087,4 +2087,8 @@
|
||||||
\x{655}
|
\x{655}
|
||||||
\x{1D1AA}
|
\x{1D1AA}
|
||||||
|
|
||||||
|
/\N{U+}/
|
||||||
|
|
||||||
|
/\N{U}/
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -13194,7 +13194,7 @@ Failed: error 167 at offset 5: non-hex character in \x{} (closing brace missing?
|
||||||
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
|
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
|
||||||
|
|
||||||
/^A\x{/
|
/^A\x{/
|
||||||
Failed: error 178 at offset 5: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/[ab]++/B,no_auto_possess
|
/[ab]++/B,no_auto_possess
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
|
@ -13408,7 +13408,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
|
||||||
Failed: error 155 at offset 2: missing opening brace after \o
|
Failed: error 155 at offset 2: missing opening brace after \o
|
||||||
|
|
||||||
/\o{}/
|
/\o{}/
|
||||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/\o{whatever}/
|
/\o{whatever}/
|
||||||
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
|
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
|
||||||
|
@ -13416,7 +13416,7 @@ Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missin
|
||||||
/\xthing/
|
/\xthing/
|
||||||
|
|
||||||
/\x{}/
|
/\x{}/
|
||||||
Failed: error 178 at offset 3: digits missing in \x{} or \o{}
|
Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
/\x{whatever}/
|
/\x{whatever}/
|
||||||
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
|
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
|
||||||
|
|
|
@ -3703,5 +3703,13 @@ No match
|
||||||
\= Expect no match
|
\= Expect no match
|
||||||
\x{123}\x{124}\x{123}
|
\x{123}\x{124}\x{123}
|
||||||
No match
|
No match
|
||||||
|
|
||||||
|
/\N{U+1234}/utf
|
||||||
|
\x{1234}
|
||||||
|
0: \x{1234}
|
||||||
|
|
||||||
|
/[\N{U+1234}]/utf
|
||||||
|
\x{1234}
|
||||||
|
0: \x{1234}
|
||||||
|
|
||||||
# End of testinput4
|
# End of testinput4
|
||||||
|
|
|
@ -4750,4 +4750,10 @@ No match
|
||||||
\x{1D1AA}
|
\x{1D1AA}
|
||||||
0: \x{1d1aa}
|
0: \x{1d1aa}
|
||||||
|
|
||||||
|
/\N{U+}/
|
||||||
|
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
|
||||||
|
|
||||||
|
/\N{U}/
|
||||||
|
Failed: error 137 at offset 2: PCRE does not support \L, \l, \N{name}, \U, or \u
|
||||||
|
|
||||||
# End of testinput5
|
# End of testinput5
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
PCRE2 version 10.32-RC1 2018-02-19
|
||||||
# This is a specialized test for checking, when PCRE2 is compiled with the
|
# This is a specialized test for checking, when PCRE2 is compiled with the
|
||||||
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
# EBCDIC option but in an ASCII environment, that newline, white space, and \c
|
||||||
# functionality is working. It catches cases where explicit values such as 0x0a
|
# functionality is working. It catches cases where explicit values such as 0x0a
|
||||||
|
@ -200,6 +201,6 @@ No match
|
||||||
0: \xff
|
0: \xff
|
||||||
|
|
||||||
/\ƒ&/
|
/\ƒ&/
|
||||||
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
Failed: error 168 at offset 3: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
|
||||||
|
|
||||||
# End
|
# End
|
||||||
|
|
Loading…
Reference in New Issue