Add support for \N{U+dd...}, for ASCII and Unicode modes only.

This commit is contained in:
Philip.Hazel 2018-07-27 16:30:40 +00:00
parent 775481293a
commit e9aa3c0a21
16 changed files with 449 additions and 322 deletions

View File

@ -130,6 +130,8 @@ present.
28. A (*MARK) name was not being passed back for positive assertions that were 28. A (*MARK) name was not being passed back for positive assertions that were
terminated by (*ACCEPT). terminated by (*ACCEPT).
29. Add support for \N{U+dddd}, but not in EBCDIC environments.
Version 10.31 12-February-2018 Version 10.31 12-February-2018
------------------------------ ------------------------------

View File

@ -249,10 +249,11 @@ is used.
<P> <P>
The newline convention affects where the circumflex and dollar assertions are The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
what the \R escape sequence matches. By default, this is any Unicode newline opening brace. However, it does not affect what the \R escape sequence
sequence, for Perl compatibility. However, this can be changed; see the next matches. By default, this is any Unicode newline sequence, for Perl
section and the description of \R in the section entitled compatibility. However, this can be changed; see the next section and the
description of \R in the section entitled
<a href="#newlineseq">"Newline sequences"</a> <a href="#newlineseq">"Newline sequences"</a>
below. A change of \R setting can be combined with a change of newline below. A change of \R setting can be combined with a change of newline
convention. convention.
@ -382,20 +383,27 @@ text editing, it is often easier to use one of the following escape sequences
than the binary character it represents. In an ASCII or Unicode environment, than the binary character it represents. In an ASCII or Unicode environment,
these escapes are as follows: these escapes are as follows:
<pre> <pre>
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any printable ASCII character \cx "control-x", where x is any printable ASCII character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n linefeed (hex 0A) \n linefeed (hex 0A)
\r carriage return (hex 0D) \r carriage return (hex 0D)
\t tab (hex 09) \t tab (hex 09)
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (default mode) \x{hhh..} character with hex code hhh.. (default mode)
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) \N{U+hhh..} character with Unicode code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
</pre> </pre>
Note that when \N is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
Perl also uses \N{name} to specify characters by Unicode name; PCRE2 does not
support this.
</P>
<P>
The precise effect of \cx on ASCII characters is as follows: if x is a lower The precise effect of \cx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -404,14 +412,14 @@ code unit following \c has a value less than 32 or greater than 126, a
compile-time error occurs. compile-time error occurs.
</P> </P>
<P> <P>
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
generate the appropriate EBCDIC code values. The \c escape is processed \f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
as specified for Perl in the <b>perlebcdic</b> document. The only characters escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
other character provokes a compile-time error. The sequence \c@ encodes ^, _, or ?. Any other character provokes a compile-time error. The sequence
character code 0; after \c the letters (in either case) encode characters 1-26 \c@ encodes character code 0; after \c the letters (in either case) encode
(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F). (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
</P> </P>
<P> <P>
Thus, apart from \c?, these escapes generate the same character code values as Thus, apart from \c?, these escapes generate the same character code values as
@ -443,9 +451,9 @@ to be unambiguously specified.
</P> </P>
<P> <P>
For greater clarity and unambiguity, it is best to avoid following \ by a For greater clarity and unambiguity, it is best to avoid following \ by a
digit greater than zero. Instead, use \o{} or \x{} to specify character digit greater than zero. Instead, use \o{} or \x{} to specify numerical
numbers, and \g{} to specify backreferences. The following paragraphs character code points, and \g{} to specify backreferences. The following
describe the old, ambiguous syntax. paragraphs describe the old, ambiguous syntax.
</P> </P>
<P> <P>
The handling of a backslash followed by a digit other than 0 is complicated, The handling of a backslash followed by a digit other than 0 is complicated,
@ -528,10 +536,10 @@ and outside character classes. In addition, inside a character class, \b is
interpreted as the backspace character (hex 08). interpreted as the backspace character (hex 08).
</P> </P>
<P> <P>
\N is not allowed in a character class. \B, \R, and \X are not special When not followed by an opening brace, \N is not allowed in a character class.
inside a character class. Like other unrecognized alphabetic escape sequences, \B, \R, and \X are not special inside a character class. Like other
they cause an error. Outside a character class, these sequences have different unrecognized alphabetic escape sequences, they cause an error. Outside a
meanings. character class, these sequences have different meanings.
</P> </P>
<br><b> <br><b>
Unsupported escape sequences Unsupported escape sequences
@ -577,6 +585,7 @@ Another use of backslash is for specifying generic character types:
\D any character that is not a decimal digit \D any character that is not a decimal digit
\h any horizontal white space character \h any horizontal white space character
\H any character that is not a horizontal white space character \H any character that is not a horizontal white space character
\N any character that is not a newline
\s any white space character \s any white space character
\S any character that is not a white space character \S any character that is not a white space character
\v any vertical white space character \v any vertical white space character
@ -584,11 +593,14 @@ Another use of backslash is for specifying generic character types:
\w any "word" character \w any "word" character
\W any "non-word" character \W any "non-word" character
</pre> </pre>
There is also the single sequence \N, which matches a non-newline character. The \N escape sequence has the same meaning as
This is the same as
<a href="#fullstopdot">the "." metacharacter</a> <a href="#fullstopdot">the "." metacharacter</a>
when PCRE2_DOTALL is not set. Perl also uses \N to match characters by name; when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
PCRE2 does not support this. meaning of \N. Note that when \N is followed by an opening brace it has a
different meaning. See the section entitled
<a href="#digitsafterbackslash">"Non-printing characters"</a>
above for details. Perl also uses \N{name} to specify characters by Unicode
name; PCRE2 does not support this.
</P> </P>
<P> <P>
Each pair of lower and upper case escape sequences partitions the complete set Each pair of lower and upper case escape sequences partitions the complete set
@ -1297,9 +1309,15 @@ dollar, the only relationship being that they both involve newlines. Dot has no
special meaning in a character class. special meaning in a character class.
</P> </P>
<P> <P>
The escape sequence \N behaves like a dot, except that it is not affected by The escape sequence \N when not followed by an opening brace behaves like a
the PCRE2_DOTALL option. In other words, it matches any character except one dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
that signifies the end of a line. Perl also uses \N to match characters by it matches any character except one that signifies the end of a line.
</P>
<P>
When \N is followed by an opening brace it has a different meaning. See the
section entitled
<a href="digitsafterbackslash">"Non-printing characters"</a>
above for details. Perl also uses \N{name} to specify characters by Unicode
name; PCRE2 does not support this. name; PCRE2 does not support this.
</P> </P>
<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br> <br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
@ -1385,10 +1403,11 @@ string, and therefore it fails if the current pointer is at the end of the
string. string.
</P> </P>
<P> <P>
When caseless matching is set, any letters in a class represent both their Characters in a class may be specified by their code points using \o, \x, or
upper case and lower case versions, so for example, a caseless [aeiou] matches \N{U+hh..} in the usual way. When caseless matching is set, any letters in a
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a class represent both their upper case and lower case versions, so for example,
caseful version would. a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would.
</P> </P>
<P> <P>
Characters that might indicate line breaks are never treated in any special way Characters that might indicate line breaks are never treated in any special way
@ -1397,17 +1416,18 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters. class such as [^a] always matches one of these characters.
</P> </P>
<P> <P>
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
\V, \w, and \W may appear in a character class, and add the characters that \S, \v, \V, \w, and \W may appear in a character class, and add the
they match to the class. For example, [\dABCDEF] matches any hexadecimal characters that they match to the class. For example, [\dABCDEF] matches any
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
and their upper case partners, just as it does when they appear outside a \d, \s, \w and their upper case partners, just as it does when they appear
character class, as described in the section entitled outside a character class, as described in the section entitled
<a href="#genericchartypes">"Generic character types"</a> <a href="#genericchartypes">"Generic character types"</a>
above. The escape sequence \b has a different meaning inside a character above. The escape sequence \b has a different meaning inside a character
class; it matches the backspace character. The sequences \B, \N, \R, and \X class; it matches the backspace character. The sequences \B, \R, and \X are
are not special inside a character class. Like any other unrecognized escape not special inside a character class. Like any other unrecognized escape
sequences, they cause an error. sequences, they cause an error. The same is true for \N when not followed by
an opening brace.
</P> </P>
<P> <P>
The minus (hyphen) character can be used to specify a range of characters in a The minus (hyphen) character can be used to specify a range of characters in a
@ -3559,7 +3579,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 20 July 2018 Last updated: 27 July 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -70,9 +70,10 @@ This table applies to ASCII and Unicode environments.
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\N{U+hh..} character with Unicode code point hh..
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hh..} character with hex code hh..
</pre> </pre>
Note that \0dd is always an octal code. The treatment of backslash followed by Note that \0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section a non-zero digit is complicated; for details see the section
@ -80,7 +81,9 @@ a non-zero digit is complicated; for details see the section
in the in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a> <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation, where details of escape processing in EBCDIC environments are documentation, where details of escape processing in EBCDIC environments are
also given. also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \N not followed by an opening
curly bracket has a different meaning (see below).
</P> </P>
<P> <P>
When \x is not followed by {, from zero to two hexadecimal digits are read, When \x is not followed by {, from zero to two hexadecimal digits are read,
@ -621,7 +624,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br> <br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 21 July 2018 Last updated: 27 July 2018
<br> <br>
Copyright &copy; 1997-2018 University of Cambridge. Copyright &copy; 1997-2018 University of Cambridge.
<br> <br>

View File

@ -6015,36 +6015,37 @@ SPECIAL START-OF-PATTERN ITEMS
The newline convention affects where the circumflex and dollar asser- The newline convention affects where the circumflex and dollar asser-
tions are true. It also affects the interpretation of the dot metachar- tions are true. It also affects the interpretation of the dot metachar-
acter when PCRE2_DOTALL is not set, and the behaviour of \N. However, acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
it does not affect what the \R escape sequence matches. By default, followed by an opening brace. However, it does not affect what the \R
this is any Unicode newline sequence, for Perl compatibility. However, escape sequence matches. By default, this is any Unicode newline
this can be changed; see the next section and the description of \R in sequence, for Perl compatibility. However, this can be changed; see the
the section entitled "Newline sequences" below. A change of \R setting next section and the description of \R in the section entitled "Newline
can be combined with a change of newline convention. sequences" below. A change of \R setting can be combined with a change
of newline convention.
Specifying what \R matches Specifying what \R matches
It is possible to restrict \R to match only CR, LF, or CRLF (instead of It is possible to restrict \R to match only CR, LF, or CRLF (instead of
the complete set of Unicode line endings) by setting the option the complete set of Unicode line endings) by setting the option
PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
EBCDIC CHARACTER CODES EBCDIC CHARACTER CODES
PCRE2 can be compiled to run in an environment that uses EBCDIC as its PCRE2 can be compiled to run in an environment that uses EBCDIC as its
character code instead of ASCII or Unicode (typically a mainframe sys- character code instead of ASCII or Unicode (typically a mainframe sys-
tem). In the sections below, character code values are ASCII or Uni- tem). In the sections below, character code values are ASCII or Uni-
code; in an EBCDIC environment these characters may have different code code; in an EBCDIC environment these characters may have different code
values, and there are no code points greater than 255. values, and there are no code points greater than 255.
CHARACTERS AND METACHARACTERS CHARACTERS AND METACHARACTERS
A regular expression is a pattern that is matched against a subject A regular expression is a pattern that is matched against a subject
string from left to right. Most characters stand for themselves in a string from left to right. Most characters stand for themselves in a
pattern, and match the corresponding characters in the subject. As a pattern, and match the corresponding characters in the subject. As a
trivial example, the pattern trivial example, the pattern
The quick brown fox The quick brown fox
@ -6053,14 +6054,14 @@ CHARACTERS AND METACHARACTERS
caseless matching is specified (the PCRE2_CASELESS option), letters are caseless matching is specified (the PCRE2_CASELESS option), letters are
matched independently of case. matched independently of case.
The power of regular expressions comes from the ability to include The power of regular expressions comes from the ability to include
alternatives and repetitions in the pattern. These are encoded in the alternatives and repetitions in the pattern. These are encoded in the
pattern by the use of metacharacters, which do not stand for themselves pattern by the use of metacharacters, which do not stand for themselves
but instead are interpreted in some special way. but instead are interpreted in some special way.
There are two different sets of metacharacters: those that are recog- There are two different sets of metacharacters: those that are recog-
nized anywhere in the pattern except within square brackets, and those nized anywhere in the pattern except within square brackets, and those
that are recognized within square brackets. Outside square brackets, that are recognized within square brackets. Outside square brackets,
the metacharacters are as follows: the metacharacters are as follows:
\ general escape character with several uses \ general escape character with several uses
@ -6079,7 +6080,7 @@ CHARACTERS AND METACHARACTERS
also "possessive quantifier" also "possessive quantifier"
{ start min/max quantifier { start min/max quantifier
Part of a pattern that is in square brackets is called a "character Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are: class". In a character class the only metacharacters are:
\ general escape character \ general escape character
@ -6096,30 +6097,30 @@ BACKSLASH
The backslash character has several uses. Firstly, if it is followed by The backslash character has several uses. Firstly, if it is followed by
a character that is not a number or a letter, it takes away any special a character that is not a number or a letter, it takes away any special
meaning that character may have. This use of backslash as an escape meaning that character may have. This use of backslash as an escape
character applies both inside and outside character classes. character applies both inside and outside character classes.
For example, if you want to match a * character, you must write \* in For example, if you want to match a * character, you must write \* in
the pattern. This escaping action applies whether or not the following the pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a metacharacter, so it is character would otherwise be interpreted as a metacharacter, so it is
always safe to precede a non-alphanumeric with backslash to specify always safe to precede a non-alphanumeric with backslash to specify
that it stands for itself. In particular, if you want to match a back- that it stands for itself. In particular, if you want to match a back-
slash, you write \\. slash, you write \\.
In a UTF mode, only ASCII numbers and letters have any special meaning In a UTF mode, only ASCII numbers and letters have any special meaning
after a backslash. All other characters (in particular, those whose after a backslash. All other characters (in particular, those whose
code points are greater than 127) are treated as literals. code points are greater than 127) are treated as literals.
If a pattern is compiled with the PCRE2_EXTENDED option, most white If a pattern is compiled with the PCRE2_EXTENDED option, most white
space in the pattern (other than in a character class), and characters space in the pattern (other than in a character class), and characters
between a # outside a character class and the next newline, inclusive, between a # outside a character class and the next newline, inclusive,
are ignored. An escaping backslash can be used to include a white space are ignored. An escaping backslash can be used to include a white space
or # character as part of the pattern. or # character as part of the pattern.
If you want to remove the special meaning from a sequence of charac- If you want to remove the special meaning from a sequence of charac-
ters, you can do so by putting them between \Q and \E. This is differ- ters, you can do so by putting them between \Q and \E. This is differ-
ent from Perl in that $ and @ are handled as literals in \Q...\E ent from Perl in that $ and @ are handled as literals in \Q...\E
sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola- sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
tion. Note the following examples: tion. Note the following examples:
Pattern PCRE2 matches Perl matches Pattern PCRE2 matches Perl matches
@ -6129,36 +6130,42 @@ BACKSLASH
\Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside character The \Q...\E sequence is recognized both inside and outside character
classes. An isolated \E that is not preceded by \Q is ignored. If \Q classes. An isolated \E that is not preceded by \Q is ignored. If \Q
is not followed by \E later in the pattern, the literal interpretation is not followed by \E later in the pattern, the literal interpretation
continues to the end of the pattern (that is, \E is assumed at the continues to the end of the pattern (that is, \E is assumed at the
end). If the isolated \Q is inside a character class, this causes an end). If the isolated \Q is inside a character class, this causes an
error, because the character class is not terminated by a closing error, because the character class is not terminated by a closing
square bracket. square bracket.
Non-printing characters Non-printing characters
A second use of backslash provides a way of encoding non-printing char- A second use of backslash provides a way of encoding non-printing char-
acters in patterns in a visible manner. There is no restriction on the acters in patterns in a visible manner. There is no restriction on the
appearance of non-printing characters in a pattern, but when a pattern appearance of non-printing characters in a pattern, but when a pattern
is being prepared by text editing, it is often easier to use one of the is being prepared by text editing, it is often easier to use one of the
following escape sequences than the binary character it represents. In following escape sequences than the binary character it represents. In
an ASCII or Unicode environment, these escapes are as follows: an ASCII or Unicode environment, these escapes are as follows:
\a alarm, that is, the BEL character (hex 07) \a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any printable ASCII character \cx "control-x", where x is any printable ASCII character
\e escape (hex 1B) \e escape (hex 1B)
\f form feed (hex 0C) \f form feed (hex 0C)
\n linefeed (hex 0A) \n linefeed (hex 0A)
\r carriage return (hex 0D) \r carriage return (hex 0D)
\t tab (hex 09) \t tab (hex 09)
\0dd character with octal code 0dd \0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (default mode) \x{hhh..} character with hex code hhh.. (default mode)
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) \N{U+hhh..} character with Unicode code point hhh..
\uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
Note that when \N is not followed by an opening brace (curly bracket)
it has an entirely different meaning, matching any character that is
not a newline. Perl also uses \N{name} to specify characters by Uni-
code name; PCRE2 does not support this.
The precise effect of \cx on ASCII characters is as follows: if x is a The precise effect of \cx on ASCII characters is as follows: if x is a
lower case letter, it is converted to upper case. Then bit 6 of the lower case letter, it is converted to upper case. Then bit 6 of the
@ -6167,15 +6174,15 @@ BACKSLASH
hex 7B (; is 3B). If the code unit following \c has a value less than hex 7B (; is 3B). If the code unit following \c has a value less than
32 or greater than 126, a compile-time error occurs. 32 or greater than 126, a compile-time error occurs.
When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen- When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
erate the appropriate EBCDIC code values. The \c escape is processed as \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
specified for Perl in the perlebcdic document. The only characters that The \c escape is processed as specified for Perl in the perlebcdic doc-
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. ument. The only characters that are allowed after \c are A-Z, a-z, or
Any other character provokes a compile-time error. The sequence \c@ one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
encodes character code 0; after \c the letters (in either case) encode time error. The sequence \c@ encodes character code 0; after \c the
characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
(hex 5F). becomes either 255 (hex FF) or 95 (hex 5F).
Thus, apart from \c?, these escapes generate the same character code Thus, apart from \c?, these escapes generate the same character code
values as they do in an ASCII environment, though the meanings of the values as they do in an ASCII environment, though the meanings of the
@ -6203,9 +6210,9 @@ BACKSLASH
numbers and backreferences to be unambiguously specified. numbers and backreferences to be unambiguously specified.
For greater clarity and unambiguity, it is best to avoid following \ by For greater clarity and unambiguity, it is best to avoid following \ by
a digit greater than zero. Instead, use \o{} or \x{} to specify charac- a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
ter numbers, and \g{} to specify backreferences. The following para- cal character code points, and \g{} to specify backreferences. The fol-
graphs describe the old, ambiguous syntax. lowing paragraphs describe the old, ambiguous syntax.
The handling of a backslash followed by a digit other than 0 is compli- The handling of a backslash followed by a digit other than 0 is compli-
cated, and Perl has changed over time, causing PCRE2 also to change. cated, and Perl has changed over time, causing PCRE2 also to change.
@ -6281,10 +6288,10 @@ BACKSLASH
inside and outside character classes. In addition, inside a character inside and outside character classes. In addition, inside a character
class, \b is interpreted as the backspace character (hex 08). class, \b is interpreted as the backspace character (hex 08).
\N is not allowed in a character class. \B, \R, and \X are not special When not followed by an opening brace, \N is not allowed in a character
inside a character class. Like other unrecognized alphabetic escape class. \B, \R, and \X are not special inside a character class. Like
sequences, they cause an error. Outside a character class, these other unrecognized alphabetic escape sequences, they cause an error.
sequences have different meanings. Outside a character class, these sequences have different meanings.
Unsupported escape sequences Unsupported escape sequences
@ -6318,6 +6325,7 @@ BACKSLASH
\D any character that is not a decimal digit \D any character that is not a decimal digit
\h any horizontal white space character \h any horizontal white space character
\H any character that is not a horizontal white space character \H any character that is not a horizontal white space character
\N any character that is not a newline
\s any white space character \s any white space character
\S any character that is not a white space character \S any character that is not a white space character
\v any vertical white space character \v any vertical white space character
@ -6325,10 +6333,12 @@ BACKSLASH
\w any "word" character \w any "word" character
\W any "non-word" character \W any "non-word" character
There is also the single sequence \N, which matches a non-newline char- The \N escape sequence has the same meaning as the "." metacharacter
acter. This is the same as the "." metacharacter when PCRE2_DOTALL is when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
not set. Perl also uses \N to match characters by name; PCRE2 does not the meaning of \N. Note that when \N is followed by an opening brace it
support this. has a different meaning. See the section entitled "Non-printing charac-
ters" above for details. Perl also uses \N{name} to specify characters
by Unicode name; PCRE2 does not support this.
Each pair of lower and upper case escape sequences partitions the com- Each pair of lower and upper case escape sequences partitions the com-
plete set of characters into two disjoint sets. Any given character plete set of characters into two disjoint sets. Any given character
@ -6867,49 +6877,54 @@ FULL STOP (PERIOD, DOT) AND \N
flex and dollar, the only relationship being that they both involve flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class. newlines. Dot has no special meaning in a character class.
The escape sequence \N behaves like a dot, except that it is not The escape sequence \N when not followed by an opening brace behaves
affected by the PCRE2_DOTALL option. In other words, it matches any like a dot, except that it is not affected by the PCRE2_DOTALL option.
character except one that signifies the end of a line. Perl also uses In other words, it matches any character except one that signifies the
\N to match characters by name; PCRE2 does not support this. end of a line.
When \N is followed by an opening brace it has a different meaning. See
the section entitled "Non-printing characters" above for details. Perl
also uses \N{name} to specify characters by Unicode name; PCRE2 does
not support this.
MATCHING A SINGLE CODE UNIT MATCHING A SINGLE CODE UNIT
Outside a character class, the escape sequence \C matches any one code Outside a character class, the escape sequence \C matches any one code
unit, whether or not a UTF mode is set. In the 8-bit library, one code unit, whether or not a UTF mode is set. In the 8-bit library, one code
unit is one byte; in the 16-bit library it is a 16-bit unit; in the unit is one byte; in the 16-bit library it is a 16-bit unit; in the
32-bit library it is a 32-bit unit. Unlike a dot, \C always matches 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
line-ending characters. The feature is provided in Perl in order to line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode, but it is unclear how it can use- match individual bytes in UTF-8 mode, but it is unclear how it can use-
fully be used. fully be used.
Because \C breaks up characters into individual code units, matching Because \C breaks up characters into individual code units, matching
one unit with \C in UTF-8 or UTF-16 mode means that the rest of the one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
string may start with a malformed UTF character. This has undefined string may start with a malformed UTF character. This has undefined
results, because PCRE2 assumes that it is matching character by charac- results, because PCRE2 assumes that it is matching character by charac-
ter in a valid UTF string (by default it checks the subject string's ter in a valid UTF string (by default it checks the subject string's
validity at the start of processing unless the PCRE2_NO_UTF_CHECK validity at the start of processing unless the PCRE2_NO_UTF_CHECK
option is used). option is used).
An application can lock out the use of \C by setting the An application can lock out the use of \C by setting the
PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
possible to build PCRE2 with the use of \C permanently disabled. possible to build PCRE2 with the use of \C permanently disabled.
PCRE2 does not allow \C to appear in lookbehind assertions (described PCRE2 does not allow \C to appear in lookbehind assertions (described
below) in UTF-8 or UTF-16 modes, because this would make it impossible below) in UTF-8 or UTF-16 modes, because this would make it impossible
to calculate the length of the lookbehind. Neither the alternative to calculate the length of the lookbehind. Neither the alternative
matching function pcre2_dfa_match() nor the JIT optimizer support \C in matching function pcre2_dfa_match() nor the JIT optimizer support \C in
these UTF modes. The former gives a match-time error; the latter fails these UTF modes. The former gives a match-time error; the latter fails
to optimize and so the match is always run using the interpreter. to optimize and so the match is always run using the interpreter.
In the 32-bit library, however, \C is always supported (when not In the 32-bit library, however, \C is always supported (when not
explicitly locked out) because it always matches a single code unit, explicitly locked out) because it always matches a single code unit,
whether or not UTF-32 is specified. whether or not UTF-32 is specified.
In general, the \C escape sequence is best avoided. However, one way of In general, the \C escape sequence is best avoided. However, one way of
using it that avoids the problem of malformed UTF-8 or UTF-16 charac- using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
ters is to use a lookahead to check the length of the next character, ters is to use a lookahead to check the length of the next character,
as in this pattern, which could be used with a UTF-8 string (ignore as in this pattern, which could be used with a UTF-8 string (ignore
white space and line breaks): white space and line breaks):
(?| (?=[\x00-\x7f])(\C) | (?| (?=[\x00-\x7f])(\C) |
@ -6917,10 +6932,10 @@ MATCHING A SINGLE CODE UNIT
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
In this example, a group that starts with (?| resets the capturing In this example, a group that starts with (?| resets the capturing
parentheses numbers in each alternative (see "Duplicate Subpattern Num- parentheses numbers in each alternative (see "Duplicate Subpattern Num-
bers" below). The assertions at the start of each branch check the next bers" below). The assertions at the start of each branch check the next
UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
respectively. The character's individual bytes are then captured by the respectively. The character's individual bytes are then captured by the
appropriate number of \C groups. appropriate number of \C groups.
@ -6929,50 +6944,53 @@ SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a An opening square bracket introduces a character class, terminated by a
closing square bracket. A closing square bracket on its own is not spe- closing square bracket. A closing square bracket on its own is not spe-
cial by default. If a closing square bracket is required as a member cial by default. If a closing square bracket is required as a member
of the class, it should be the first data character in the class (after of the class, it should be the first data character in the class (after
an initial circumflex, if present) or escaped with a backslash. This an initial circumflex, if present) or escaped with a backslash. This
means that, by default, an empty class cannot be defined. However, if means that, by default, an empty class cannot be defined. However, if
the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
the start does end the (empty) class. the start does end the (empty) class.
A character class matches a single character in the subject. A matched A character class matches a single character in the subject. A matched
character must be in the set of characters defined by the class, unless character must be in the set of characters defined by the class, unless
the first character in the class definition is a circumflex, in which the first character in the class definition is a circumflex, in which
case the subject character must not be in the set defined by the class. case the subject character must not be in the set defined by the class.
If a circumflex is actually required as a member of the class, ensure If a circumflex is actually required as a member of the class, ensure
it is not the first character, or escape it with a backslash. it is not the first character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower case vowel, For example, the character class [aeiou] matches any lower case vowel,
while [^aeiou] matches any character that is not a lower case vowel. while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the Note that a circumflex is just a convenient notation for specifying the
characters that are in the class by enumerating those that are not. A characters that are in the class by enumerating those that are not. A
class that starts with a circumflex is not an assertion; it still con- class that starts with a circumflex is not an assertion; it still con-
sumes a character from the subject string, and therefore it fails if sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string. the current pointer is at the end of the string.
When caseless matching is set, any letters in a class represent both Characters in a class may be specified by their code points using \o,
their upper case and lower case versions, so for example, a caseless \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not letters in a class represent both their upper case and lower case ver-
match "A", whereas a caseful version would. sions, so for example, a caseless [aeiou] matches "A" as well as "a",
and a caseless [^aeiou] does not match "A", whereas a caseful version
would.
Characters that might indicate line breaks are never treated in any Characters that might indicate line breaks are never treated in any
special way when matching character classes, whatever line-ending special way when matching character classes, whatever line-ending
sequence is in use, and whatever setting of the PCRE2_DOTALL and sequence is in use, and whatever setting of the PCRE2_DOTALL and
PCRE2_MULTILINE options is used. A class such as [^a] always matches PCRE2_MULTILINE options is used. A class such as [^a] always matches
one of these characters. one of these characters.
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
\w, and \W may appear in a character class, and add the characters that \S, \v, \V, \w, and \W may appear in a character class, and add the
they match to the class. For example, [\dABCDEF] matches any hexadeci- characters that they match to the class. For example, [\dABCDEF]
mal digit. In UTF modes, the PCRE2_UCP option affects the meanings of matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
\d, \s, \w and their upper case partners, just as it does when they affects the meanings of \d, \s, \w and their upper case partners, just
appear outside a character class, as described in the section entitled as it does when they appear outside a character class, as described in
"Generic character types" above. The escape sequence \b has a different the section entitled "Generic character types" above. The escape
meaning inside a character class; it matches the backspace character. sequence \b has a different meaning inside a character class; it
The sequences \B, \N, \R, and \X are not special inside a character matches the backspace character. The sequences \B, \R, and \X are not
class. Like any other unrecognized escape sequences, they cause an special inside a character class. Like any other unrecognized escape
error. sequences, they cause an error. The same is true for \N when not fol-
lowed by an opening brace.
The minus (hyphen) character can be used to specify a range of charac- The minus (hyphen) character can be used to specify a range of charac-
ters in a character class. For example, [d-m] matches any letter ters in a character class. For example, [d-m] matches any letter
@ -9012,7 +9030,7 @@ AUTHOR
REVISION REVISION
Last updated: 20 July 2018 Last updated: 27 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -9873,19 +9891,23 @@ ESCAPED CHARACTERS
\ddd character with octal code ddd, or backreference \ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd.. \o{ddd..} character with octal code ddd..
\U "U" if PCRE2_ALT_BSUX is set (otherwise is an error) \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\N{U+hh..} character with Unicode code point hh..
\uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\xhh character with hex code hh \xhh character with hex code hh
\x{hhh..} character with hex code hhh.. \x{hh..} character with hex code hh..
Note that \0dd is always an octal code. The treatment of backslash fol- Note that \0dd is always an octal code. The treatment of backslash fol-
lowed by a non-zero digit is complicated; for details see the section lowed by a non-zero digit is complicated; for details see the section
"Non-printing characters" in the pcre2pattern documentation, where "Non-printing characters" in the pcre2pattern documentation, where
details of escape processing in EBCDIC environments are also given. details of escape processing in EBCDIC environments are also given.
\N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
EBCDIC environments. Note that \N not followed by an opening curly
bracket has a different meaning (see below).
When \x is not followed by {, from zero to two hexadecimal digits are When \x is not followed by {, from zero to two hexadecimal digits are
read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec- read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
imal digits to be recognized as a hexadecimal escape; otherwise it imal digits to be recognized as a hexadecimal escape; otherwise it
matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol- matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
lowed by four hexadecimal digits, it matches a literal "u". lowed by four hexadecimal digits, it matches a literal "u".
@ -9910,14 +9932,14 @@ CHARACTER TYPES
\W a "non-word" character \W a "non-word" character
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
\C is dangerous because it may leave the current matching point in the \C is dangerous because it may leave the current matching point in the
middle of a UTF-8 or UTF-16 character. The application can lock out the middle of a UTF-8 or UTF-16 character. The application can lock out the
use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
possible to build PCRE2 with the use of \C permanently disabled. possible to build PCRE2 with the use of \C permanently disabled.
By default, \d, \s, and \w match only ASCII characters, even in UTF-8 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
mode or in the 16-bit and 32-bit libraries. However, if locale-specific mode or in the 16-bit and 32-bit libraries. However, if locale-specific
matching is happening, \s and \w may also match characters with code matching is happening, \s and \w may also match characters with code
points in the range 128-255. If the PCRE2_UCP option is set, the behav- points in the range 128-255. If the PCRE2_UCP option is set, the behav-
iour of these escape sequences is changed to use Unicode properties and iour of these escape sequences is changed to use Unicode properties and
they match many more characters. they match many more characters.
@ -9986,28 +10008,28 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
SCRIPT NAMES FOR \p AND \P SCRIPT NAMES FOR \p AND \P
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali- Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi, nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba- Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs, Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya, Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan- Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao, nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha- Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar- Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog- ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya, dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha- Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo, vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi- Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square. nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
@ -10034,8 +10056,8 @@ CHARACTER CLASSES
word same as \w word same as \w
xdigit hexadecimal digit xdigit hexadecimal digit
In PCRE2, POSIX character set names recognize only ASCII characters by In PCRE2, POSIX character set names recognize only ASCII characters by
default, but some of them use Unicode properties if PCRE2_UCP is set. default, but some of them use Unicode properties if PCRE2_UCP is set.
You can use \Q...\E inside a character class. You can use \Q...\E inside a character class.
@ -10121,8 +10143,8 @@ OPTION SETTING
(?xx) as (?x) but also ignore space and tab in classes (?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s) (?-...) unset option(s)
The following are recognized only at the very start of a pattern or The following are recognized only at the very start of a pattern or
after one of the newline or \R options with similar syntax. More than after one of the newline or \R options with similar syntax. More than
one of them may appear. For the first three, d is a decimal number. one of them may appear. For the first three, d is a decimal number.
(*LIMIT_DEPTH=d) set the backtracking limit to d (*LIMIT_DEPTH=d) set the backtracking limit to d
@ -10137,17 +10159,17 @@ OPTION SETTING
(*UTF) set appropriate UTF mode for the library in use (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc) (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
value of the limits set by the caller of pcre2_match() or value of the limits set by the caller of pcre2_match() or
pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
respectively, at compile time. respectively, at compile time.
NEWLINE CONVENTION NEWLINE CONVENTION
These are recognized only at the very start of the pattern or after These are recognized only at the very start of the pattern or after
option settings with a similar syntax. option settings with a similar syntax.
(*CR) carriage return only (*CR) carriage return only
@ -10160,7 +10182,7 @@ NEWLINE CONVENTION
WHAT \R MATCHES WHAT \R MATCHES
These are recognized only at the very start of the pattern or after These are recognized only at the very start of the pattern or after
option setting with a similar syntax. option setting with a similar syntax.
(*BSR_ANYCRLF) CR, LF, or CRLF (*BSR_ANYCRLF) CR, LF, or CRLF
@ -10229,16 +10251,16 @@ CONDITIONAL PATTERNS
(?(VERSION[>]=n.m) test PCRE2 version (?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition (?(assert) assertion condition
Note the ambiguity of (?(R) and (?(Rn) which might be named reference Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a conditions or recursion tests. Such a condition is interpreted as a
reference condition if the relevant named group exists. reference condition if the relevant named group exists.
BACKTRACKING CONTROL BACKTRACKING CONTROL
All backtracking control verbs may be in the form (*VERB:NAME). For All backtracking control verbs may be in the form (*VERB:NAME). For
(*MARK) the name is mandatory, for the others it is optional. (*SKIP) (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
changes its behaviour if :NAME is present. The others just set a name changes its behaviour if :NAME is present. The others just set a name
for passing back to the caller, but this is not a name that (*SKIP) can for passing back to the caller, but this is not a name that (*SKIP) can
see. The following act immediately they are reached: see. The following act immediately they are reached:
@ -10246,7 +10268,7 @@ BACKTRACKING CONTROL
(*FAIL) force backtrack; synonym (*F) (*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME) (*MARK:NAME) set name to be passed back; synonym (*:NAME)
The following act only when a subsequent match failure causes a back- The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored. so only if the pattern is not anchored.
@ -10258,7 +10280,7 @@ BACKTRACKING CONTROL
(*MARK:NAME); if not found, the (*SKIP) is ignored (*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation (*THEN) local failure, backtrack to next alternation
The effect of one of these verbs in a group called as a subroutine is The effect of one of these verbs in a group called as a subroutine is
confined to the subroutine call. confined to the subroutine call.
@ -10269,14 +10291,14 @@ CALLOUTS
(?C"text") callout with string data (?C"text") callout with string data
The allowed string delimiters are ` ' " ^ % # $ (which are the same for The allowed string delimiters are ` ' " ^ % # $ (which are the same for
the start and the end), and the starting delimiter { matched with the the start and the end), and the starting delimiter { matched with the
ending delimiter }. To encode the ending delimiter within the string, ending delimiter }. To encode the ending delimiter within the string,
double it. double it.
SEE ALSO SEE ALSO
pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
pcre2(3). pcre2(3).
@ -10289,7 +10311,7 @@ AUTHOR
REVISION REVISION
Last updated: 21 July 2018 Last updated: 27 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32" .TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.sp .sp
@ -1400,7 +1400,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
not match when the current position in the subject is at a newline. This option not match when the current position in the subject is at a newline. This option
is equivalent to Perl's /s option, and it can be changed within a pattern by a is equivalent to Perl's /s option, and it can be changed within a pattern by a
(?s) option setting. A negative class such as [^a] always matches newline (?s) option setting. A negative class such as [^a] always matches newline
characters, independent of the setting of this option. characters, and the \eN escape sequence always matches a non-newline character,
independent of the setting of PCRE2_DOTALL.
.sp .sp
PCRE2_DUPNAMES PCRE2_DUPNAMES
.sp .sp
@ -3640,6 +3641,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 02 July 2018 Last updated: 27 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "20 July 2018" "PCRE2 10.32" .TH PCRE2PATTERN 3 "27 July 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -218,10 +218,11 @@ is used.
.P .P
The newline convention affects where the circumflex and dollar assertions are The newline convention affects where the circumflex and dollar assertions are
true. It also affects the interpretation of the dot metacharacter when true. It also affects the interpretation of the dot metacharacter when
PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
what the \eR escape sequence matches. By default, this is any Unicode newline opening brace. However, it does not affect what the \eR escape sequence
sequence, for Perl compatibility. However, this can be changed; see the next matches. By default, this is any Unicode newline sequence, for Perl
section and the description of \eR in the section entitled compatibility. However, this can be changed; see the next section and the
description of \eR in the section entitled
.\" HTML <a href="#newlineseq"> .\" HTML <a href="#newlineseq">
.\" </a> .\" </a>
"Newline sequences" "Newline sequences"
@ -359,20 +360,26 @@ text editing, it is often easier to use one of the following escape sequences
than the binary character it represents. In an ASCII or Unicode environment, than the binary character it represents. In an ASCII or Unicode environment,
these escapes are as follows: these escapes are as follows:
.sp .sp
\ea alarm, that is, the BEL character (hex 07) \ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any printable ASCII character \ecx "control-x", where x is any printable ASCII character
\ee escape (hex 1B) \ee escape (hex 1B)
\ef form feed (hex 0C) \ef form feed (hex 0C)
\en linefeed (hex 0A) \en linefeed (hex 0A)
\er carriage return (hex 0D) \er carriage return (hex 0D)
\et tab (hex 09) \et tab (hex 09)
\e0dd character with octal code 0dd \e0dd character with octal code 0dd
\eddd character with octal code ddd, or backreference \eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd.. \eo{ddd..} character with octal code ddd..
\exhh character with hex code hh \exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (default mode) \ex{hhh..} character with hex code hhh.. (default mode)
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) \eN{U+hhh..} character with Unicode code point hhh..
\euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
.sp .sp
Note that when \eN is not followed by an opening brace (curly bracket) it has
an entirely different meaning, matching any character that is not a newline.
Perl also uses \eN{name} to specify characters by Unicode name; PCRE2 does not
support this.
.P
The precise effect of \ecx on ASCII characters is as follows: if x is a lower The precise effect of \ecx on ASCII characters is as follows: if x is a lower
case letter, it is converted to upper case. Then bit 6 of the character (hex case letter, it is converted to upper case. Then bit 6 of the character (hex
40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -380,14 +387,14 @@ but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
code unit following \ec has a value less than 32 or greater than 126, a code unit following \ec has a value less than 32 or greater than 126, a
compile-time error occurs. compile-time error occurs.
.P .P
When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
generate the appropriate EBCDIC code values. The \ec escape is processed \ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
as specified for Perl in the \fBperlebcdic\fP document. The only characters escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
other character provokes a compile-time error. The sequence \ec@ encodes ^, _, or ?. Any other character provokes a compile-time error. The sequence
character code 0; after \ec the letters (in either case) encode characters 1-26 \ec@ encodes character code 0; after \ec the letters (in either case) encode
(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F). (hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
.P .P
Thus, apart from \ec?, these escapes generate the same character code values as Thus, apart from \ec?, these escapes generate the same character code values as
they do in an ASCII environment, though the meanings of the values mostly they do in an ASCII environment, though the meanings of the values mostly
@ -414,9 +421,9 @@ numbers greater than 0777, and it also allows octal numbers and backreferences
to be unambiguously specified. to be unambiguously specified.
.P .P
For greater clarity and unambiguity, it is best to avoid following \e by a For greater clarity and unambiguity, it is best to avoid following \e by a
digit greater than zero. Instead, use \eo{} or \ex{} to specify character digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
numbers, and \eg{} to specify backreferences. The following paragraphs character code points, and \eg{} to specify backreferences. The following
describe the old, ambiguous syntax. paragraphs describe the old, ambiguous syntax.
.P .P
The handling of a backslash followed by a digit other than 0 is complicated, The handling of a backslash followed by a digit other than 0 is complicated,
and Perl has changed over time, causing PCRE2 also to change. and Perl has changed over time, causing PCRE2 also to change.
@ -507,10 +514,10 @@ All the sequences that define a single character value can be used both inside
and outside character classes. In addition, inside a character class, \eb is and outside character classes. In addition, inside a character class, \eb is
interpreted as the backspace character (hex 08). interpreted as the backspace character (hex 08).
.P .P
\eN is not allowed in a character class. \eB, \eR, and \eX are not special When not followed by an opening brace, \eN is not allowed in a character class.
inside a character class. Like other unrecognized alphabetic escape sequences, \eB, \eR, and \eX are not special inside a character class. Like other
they cause an error. Outside a character class, these sequences have different unrecognized alphabetic escape sequences, they cause an error. Outside a
meanings. character class, these sequences have different meanings.
. .
. .
.SS "Unsupported escape sequences" .SS "Unsupported escape sequences"
@ -569,6 +576,7 @@ Another use of backslash is for specifying generic character types:
\eD any character that is not a decimal digit \eD any character that is not a decimal digit
\eh any horizontal white space character \eh any horizontal white space character
\eH any character that is not a horizontal white space character \eH any character that is not a horizontal white space character
\eN any character that is not a newline
\es any white space character \es any white space character
\eS any character that is not a white space character \eS any character that is not a white space character
\ev any vertical white space character \ev any vertical white space character
@ -576,14 +584,20 @@ Another use of backslash is for specifying generic character types:
\ew any "word" character \ew any "word" character
\eW any "non-word" character \eW any "non-word" character
.sp .sp
There is also the single sequence \eN, which matches a non-newline character. The \eN escape sequence has the same meaning as
This is the same as
.\" HTML <a href="#fullstopdot"> .\" HTML <a href="#fullstopdot">
.\" </a> .\" </a>
the "." metacharacter the "." metacharacter
.\" .\"
when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name; when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
PCRE2 does not support this. meaning of \eN. Note that when \eN is followed by an opening brace it has a
different meaning. See the section entitled
.\" HTML <a href="#digitsafterbackslash">
.\" </a>
"Non-printing characters"
.\"
above for details. Perl also uses \eN{name} to specify characters by Unicode
name; PCRE2 does not support this.
.P .P
Each pair of lower and upper case escape sequences partitions the complete set Each pair of lower and upper case escape sequences partitions the complete set
of characters into two disjoint sets. Any given character matches one, and only of characters into two disjoint sets. Any given character matches one, and only
@ -1289,9 +1303,17 @@ The handling of dot is entirely independent of the handling of circumflex and
dollar, the only relationship being that they both involve newlines. Dot has no dollar, the only relationship being that they both involve newlines. Dot has no
special meaning in a character class. special meaning in a character class.
.P .P
The escape sequence \eN behaves like a dot, except that it is not affected by The escape sequence \eN when not followed by an opening brace behaves like a
the PCRE2_DOTALL option. In other words, it matches any character except one dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
that signifies the end of a line. Perl also uses \eN to match characters by it matches any character except one that signifies the end of a line.
.P
When \eN is followed by an opening brace it has a different meaning. See the
section entitled
.\" HTML <a href="digitsafterbackslash">
.\" </a>
"Non-printing characters"
.\"
above for details. Perl also uses \eN{name} to specify characters by Unicode
name; PCRE2 does not support this. name; PCRE2 does not support this.
. .
. .
@ -1380,30 +1402,32 @@ circumflex is not an assertion; it still consumes a character from the subject
string, and therefore it fails if the current pointer is at the end of the string, and therefore it fails if the current pointer is at the end of the
string. string.
.P .P
When caseless matching is set, any letters in a class represent both their Characters in a class may be specified by their code points using \eo, \ex, or
upper case and lower case versions, so for example, a caseless [aeiou] matches \eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a class represent both their upper case and lower case versions, so for example,
caseful version would. a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
match "A", whereas a caseful version would.
.P .P
Characters that might indicate line breaks are never treated in any special way Characters that might indicate line breaks are never treated in any special way
when matching character classes, whatever line-ending sequence is in use, and when matching character classes, whatever line-ending sequence is in use, and
whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
class such as [^a] always matches one of these characters. class such as [^a] always matches one of these characters.
.P .P
The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev, The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
\eV, \ew, and \eW may appear in a character class, and add the characters that \eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
they match to the class. For example, [\edABCDEF] matches any hexadecimal characters that they match to the class. For example, [\edABCDEF] matches any
digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
and their upper case partners, just as it does when they appear outside a \ed, \es, \ew and their upper case partners, just as it does when they appear
character class, as described in the section entitled outside a character class, as described in the section entitled
.\" HTML <a href="#genericchartypes"> .\" HTML <a href="#genericchartypes">
.\" </a> .\" </a>
"Generic character types" "Generic character types"
.\" .\"
above. The escape sequence \eb has a different meaning inside a character above. The escape sequence \eb has a different meaning inside a character
class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX class; it matches the backspace character. The sequences \eB, \eR, and \eX are
are not special inside a character class. Like any other unrecognized escape not special inside a character class. Like any other unrecognized escape
sequences, they cause an error. sequences, they cause an error. The same is true for \eN when not followed by
an opening brace.
.P .P
The minus (hyphen) character can be used to specify a range of characters in a The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m, character class. For example, [d-m] matches any letter between d and m,
@ -3580,6 +3604,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 20 July 2018 Last updated: 27 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "21 July 2018" "PCRE2 10.32" .TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -35,9 +35,10 @@ This table applies to ASCII and Unicode environments.
\eddd character with octal code ddd, or backreference \eddd character with octal code ddd, or backreference
\eo{ddd..} character with octal code ddd.. \eo{ddd..} character with octal code ddd..
\eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error) \eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
\eN{U+hh..} character with Unicode code point hh..
\euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set) \euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
\exhh character with hex code hh \exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. \ex{hh..} character with hex code hh..
.sp .sp
Note that \e0dd is always an octal code. The treatment of backslash followed by Note that \e0dd is always an octal code. The treatment of backslash followed by
a non-zero digit is complicated; for details see the section a non-zero digit is complicated; for details see the section
@ -50,7 +51,9 @@ in the
\fBpcre2pattern\fP \fBpcre2pattern\fP
.\" .\"
documentation, where details of escape processing in EBCDIC environments are documentation, where details of escape processing in EBCDIC environments are
also given. also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
supported in EBCDIC environments. Note that \eN not followed by an opening
curly bracket has a different meaning (see below).
.P .P
When \ex is not followed by {, from zero to two hexadecimal digits are read, When \ex is not followed by {, from zero to two hexadecimal digits are read,
but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@ -609,6 +612,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 21 July 2018 Last updated: 27 July 2018
Copyright (c) 1997-2018 University of Cambridge. Copyright (c) 1997-2018 University of Cambridge.
.fi .fi

View File

@ -316,6 +316,7 @@ pcre2_pattern_convert(). */
#define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190 #define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP 190
#define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191 #define PCRE2_ERROR_NO_SURROGATES_IN_UTF16 191
#define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192 #define PCRE2_ERROR_BAD_LITERAL_OPTIONS 192
#define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC 193
/* "Expected" matching error codes: no match and partial match. */ /* "Expected" matching error codes: no match and partial match. */

View File

@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
ERR91, ERR92}; ERR91, ERR92, ERR93 };
/* This is a table of start-of-pattern options such as (*UTF) and settings such /* This is a table of start-of-pattern options such as (*UTF) and settings such
as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1441,6 +1441,42 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
escape = -i; /* Else return a special escape */ escape = -i; /* Else return a special escape */
if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X)) if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */ cb->external_flags |= PCRE2_HASBKPORX; /* Note \P, \p, or \X */
/* Perl supports \N{name} for character names and \N{U+dddd} for numerical
Unicode code points, as well as plain \N for "not newline". PCRE does not
support \N{name}. However, it does support quantification such as \N{2,3},
so if \N{ is not followed by U+dddd we check for a quantifier. */
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
{
PCRE2_SPTR p = ptr + 1;
/* \N{U+ can be handled by the \x{ code. However, this construction is
not valid in EBCDIC environments because it specifies a Unicode
character, not a codepoint in the local code. For example \N{U+0041}
must be "A" in all environments. */
if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
{
#ifdef EBCDIC
*errorcodeptr = ERR93;
#else
ptr = p + 1;
escape = 0; /* Not a fancy escape after all */
goto COME_FROM_NU;
#endif
}
/* Give an error if what follows is not a quantifier, but don't override
an error set by the quantifier reader (e.g. number overflow). */
else
{
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
*errorcodeptr == 0)
*errorcodeptr = ERR37;
}
}
} }
} }
@ -1725,6 +1761,9 @@ else
{ {
if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET) if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
{ {
#ifndef EBCDIC
COME_FROM_NU:
#endif
if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET) if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
{ {
*errorcodeptr = ERR78; *errorcodeptr = ERR78;
@ -1858,19 +1897,6 @@ else
} }
} }
/* Perl supports \N{name} for character names, as well as plain \N for "not
newline". PCRE does not support \N{name}. However, it does support
quantification such as \N{2,3}. */
if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET &&
ptrend - ptr > 2)
{
PCRE2_SPTR p = ptr + 1;
if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
*errorcodeptr == 0)
*errorcodeptr = ERR37;
}
/* Set the pointer to the next character before returning. */ /* Set the pointer to the next character before returning. */
*ptrptr = ptr; *ptrptr = ptr;
@ -3223,7 +3249,6 @@ while (ptr < ptrend)
tempptr = ptr; tempptr = ptr;
escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode, escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
options, TRUE, cb); options, TRUE, cb);
if (errorcode != 0) if (errorcode != 0)
{ {
CLASS_ESCAPE_FAILED: CLASS_ESCAPE_FAILED:

View File

@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
"using UCP is disabled by the application\0" "using UCP is disabled by the application\0"
"name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0" "name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
"character code point value in \\u.... sequence is too large\0" "character code point value in \\u.... sequence is too large\0"
"digits missing in \\x{} or \\o{}\0" "digits missing in \\x{} or \\o{} or \\N{U+}\0"
"syntax error or number too big in (?(VERSION condition\0" "syntax error or number too big in (?(VERSION condition\0"
/* 80 */ /* 80 */
"internal error: unknown opcode in auto_possessify()\0" "internal error: unknown opcode in auto_possessify()\0"
@ -179,6 +179,7 @@ static const unsigned char compile_error_texts[] =
"internal error: bad code value in parsed_skip()\0" "internal error: bad code value in parsed_skip()\0"
"PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0" "PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
"invalid option bits with PCRE2_LITERAL\0" "invalid option bits with PCRE2_LITERAL\0"
"\\N{U+dddd} is not supported in EBCDIC mode\0"
; ;
/* Match-time and UTF error texts are in the same format. */ /* Match-time and UTF error texts are in the same format. */

6
testdata/testinput4 vendored
View File

@ -2288,4 +2288,10 @@
\= Expect no match \= Expect no match
\x{123}\x{124}\x{123} \x{123}\x{124}\x{123}
/\N{U+1234}/utf
\x{1234}
/[\N{U+1234}]/utf
\x{1234}
# End of testinput4 # End of testinput4

4
testdata/testinput5 vendored
View File

@ -2087,4 +2087,8 @@
\x{655} \x{655}
\x{1D1AA} \x{1D1AA}
/\N{U+}/
/\N{U}/
# End of testinput5 # End of testinput5

View File

@ -13194,7 +13194,7 @@ Failed: error 167 at offset 5: non-hex character in \x{} (closing brace missing?
Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?) Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
/^A\x{/ /^A\x{/
Failed: error 178 at offset 5: digits missing in \x{} or \o{} Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
/[ab]++/B,no_auto_possess /[ab]++/B,no_auto_possess
------------------------------------------------------------------ ------------------------------------------------------------------
@ -13408,7 +13408,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
Failed: error 155 at offset 2: missing opening brace after \o Failed: error 155 at offset 2: missing opening brace after \o
/\o{}/ /\o{}/
Failed: error 178 at offset 3: digits missing in \x{} or \o{} Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
/\o{whatever}/ /\o{whatever}/
Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?) Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
@ -13416,7 +13416,7 @@ Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missin
/\xthing/ /\xthing/
/\x{}/ /\x{}/
Failed: error 178 at offset 3: digits missing in \x{} or \o{} Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
/\x{whatever}/ /\x{whatever}/
Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?) Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)

View File

@ -3704,4 +3704,12 @@ No match
\x{123}\x{124}\x{123} \x{123}\x{124}\x{123}
No match No match
/\N{U+1234}/utf
\x{1234}
0: \x{1234}
/[\N{U+1234}]/utf
\x{1234}
0: \x{1234}
# End of testinput4 # End of testinput4

View File

@ -4750,4 +4750,10 @@ No match
\x{1D1AA} \x{1D1AA}
0: \x{1d1aa} 0: \x{1d1aa}
/\N{U+}/
Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
/\N{U}/
Failed: error 137 at offset 2: PCRE does not support \L, \l, \N{name}, \U, or \u
# End of testinput5 # End of testinput5

View File

@ -1,3 +1,4 @@
PCRE2 version 10.32-RC1 2018-02-19
# This is a specialized test for checking, when PCRE2 is compiled with the # This is a specialized test for checking, when PCRE2 is compiled with the
# EBCDIC option but in an ASCII environment, that newline, white space, and \c # EBCDIC option but in an ASCII environment, that newline, white space, and \c
# functionality is working. It catches cases where explicit values such as 0x0a # functionality is working. It catches cases where explicit values such as 0x0a
@ -200,6 +201,6 @@ No match
0: \xff 0: \xff
/\ƒ&/ /\ƒ&/
Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f Failed: error 168 at offset 3: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
# End # End