Add support for \N{U+dd...}, for ASCII and Unicode modes only.

2018-07-27 16:30:40 +00:00 · 2018-07-27 16:30:40 +00:00 · e9aa3c0a21
parent 775481293a
commit e9aa3c0a21
16 changed files with 449 additions and 322 deletions
--- a/2
+++ b/2
@ -130,6 +130,8 @@ present.
 28. A (*MARK) name was not being passed back for positive assertions that were 
 terminated by (*ACCEPT).
 29. Add support for \N{U+dddd}, but not in EBCDIC environments.
 Version 10.31 12-February-2018
 ------------------------------
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -249,10 +249,11 @@ is used.
 <P>
 The newline convention affects where the circumflex and dollar assertions are
 true. It also affects the interpretation of the dot metacharacter when
-PCRE2_DOTALL is not set, and the behaviour of \N. However, it does not affect
+PCRE2_DOTALL is not set, and the behaviour of \N when not followed by an 
-what the \R escape sequence matches. By default, this is any Unicode newline
+opening brace. However, it does not affect what the \R escape sequence
-sequence, for Perl compatibility. However, this can be changed; see the next
+matches. By default, this is any Unicode newline sequence, for Perl
-section and the description of \R in the section entitled
+compatibility. However, this can be changed; see the next section and the
 description of \R in the section entitled
 <a href="#newlineseq">"Newline sequences"</a>
 below. A change of \R setting can be combined with a change of newline
 convention.
@ -394,8 +395,15 @@ these escapes are as follows:
  \o{ddd..}   character with octal code ddd..
  \xhh        character with hex code hh
  \x{hhh..}   character with hex code hhh.. (default mode)
  \N{U+hhh..} character with Unicode code point hhh.. 
  \uhhhh      character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 </pre>
 Note that when \N is not followed by an opening brace (curly bracket) it has
 an entirely different meaning, matching any character that is not a newline.
 Perl also uses \N{name} to specify characters by Unicode name; PCRE2 does not
 support this.
 </P>
 <P>
 The precise effect of \cx on ASCII characters is as follows: if x is a lower
 case letter, it is converted to upper case. Then bit 6 of the character (hex
 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -404,14 +412,14 @@ code unit following \c has a value less than 32 or greater than 126, a
 compile-time error occurs.
 </P>
 <P>
-When PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
+When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
-generate the appropriate EBCDIC code values. The \c escape is processed
+\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
-as specified for Perl in the <b>perlebcdic</b> document. The only characters
+escape is processed as specified for Perl in the <b>perlebcdic</b> document. The
-that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any
+only characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
-other character provokes a compile-time error. The sequence \c@ encodes
+^, _, or ?. Any other character provokes a compile-time error. The sequence
-character code 0; after \c the letters (in either case) encode characters 1-26
+\c@ encodes character code 0; after \c the letters (in either case) encode
-(hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
+characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
-1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
+(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
 </P>
 <P>
 Thus, apart from \c?, these escapes generate the same character code values as
@ -443,9 +451,9 @@ to be unambiguously specified.
 </P>
 <P>
 For greater clarity and unambiguity, it is best to avoid following \ by a
-digit greater than zero. Instead, use \o{} or \x{} to specify character
+digit greater than zero. Instead, use \o{} or \x{} to specify numerical
-numbers, and \g{} to specify backreferences. The following paragraphs
+character code points, and \g{} to specify backreferences. The following
-describe the old, ambiguous syntax.
+paragraphs describe the old, ambiguous syntax.
 </P>
 <P>
 The handling of a backslash followed by a digit other than 0 is complicated,
@ -528,10 +536,10 @@ and outside character classes. In addition, inside a character class, \b is
 interpreted as the backspace character (hex 08).
 </P>
 <P>
-\N is not allowed in a character class. \B, \R, and \X are not special
+When not followed by an opening brace, \N is not allowed in a character class.
-inside a character class. Like other unrecognized alphabetic escape sequences,
+\B, \R, and \X are not special inside a character class. Like other
-they cause an error. Outside a character class, these sequences have different
+unrecognized alphabetic escape sequences, they cause an error. Outside a
-meanings.
+character class, these sequences have different meanings.
 </P>
 <br><b>
 Unsupported escape sequences
@ -577,6 +585,7 @@ Another use of backslash is for specifying generic character types:
  \D     any character that is not a decimal digit
  \h     any horizontal white space character
  \H     any character that is not a horizontal white space character
  \N     any character that is not a newline 
  \s     any white space character
  \S     any character that is not a white space character
  \v     any vertical white space character
@ -584,11 +593,14 @@ Another use of backslash is for specifying generic character types:
  \w     any "word" character
  \W     any "non-word" character
 </pre>
-There is also the single sequence \N, which matches a non-newline character.
+The \N escape sequence has the same meaning as
 This is the same as
 <a href="#fullstopdot">the "." metacharacter</a>
-when PCRE2_DOTALL is not set. Perl also uses \N to match characters by name;
+when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the 
-PCRE2 does not support this.
+meaning of \N. Note that when \N is followed by an opening brace it has a 
 different meaning. See the section entitled
 <a href="#digitsafterbackslash">"Non-printing characters"</a>
 above for details. Perl also uses \N{name} to specify characters by Unicode
 name; PCRE2 does not support this.
 </P>
 <P>
 Each pair of lower and upper case escape sequences partitions the complete set
@ -1297,9 +1309,15 @@ dollar, the only relationship being that they both involve newlines. Dot has no
 special meaning in a character class.
 </P>
 <P>
-The escape sequence \N behaves like a dot, except that it is not affected by
+The escape sequence \N when not followed by an opening brace behaves like a
-the PCRE2_DOTALL option. In other words, it matches any character except one
+dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
-that signifies the end of a line. Perl also uses \N to match characters by
+it matches any character except one that signifies the end of a line. 
 </P>
 <P>
 When \N is followed by an opening brace it has a different meaning. See the
 section entitled
 <a href="digitsafterbackslash">"Non-printing characters"</a>
 above for details. Perl also uses \N{name} to specify characters by Unicode
 name; PCRE2 does not support this.
 </P>
 <br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
@ -1385,10 +1403,11 @@ string, and therefore it fails if the current pointer is at the end of the
 string.
 </P>
 <P>
-When caseless matching is set, any letters in a class represent both their
+Characters in a class may be specified by their code points using \o, \x, or
-upper case and lower case versions, so for example, a caseless [aeiou] matches
+\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
-"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
+class represent both their upper case and lower case versions, so for example,
-caseful version would.
+a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
 match "A", whereas a caseful version would.
 </P>
 <P>
 Characters that might indicate line breaks are never treated in any special way
@ -1397,17 +1416,18 @@ whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
 class such as [^a] always matches one of these characters.
 </P>
 <P>
-The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
+The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
-\V, \w, and \W may appear in a character class, and add the characters that
+\S, \v, \V, \w, and \W may appear in a character class, and add the
-they match to the class. For example, [\dABCDEF] matches any hexadecimal
+characters that they match to the class. For example, [\dABCDEF] matches any
-digit. In UTF modes, the PCRE2_UCP option affects the meanings of \d, \s, \w
+hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
-and their upper case partners, just as it does when they appear outside a
+\d, \s, \w and their upper case partners, just as it does when they appear
-character class, as described in the section entitled
+outside a character class, as described in the section entitled
 <a href="#genericchartypes">"Generic character types"</a>
 above. The escape sequence \b has a different meaning inside a character
-class; it matches the backspace character. The sequences \B, \N, \R, and \X
+class; it matches the backspace character. The sequences \B, \R, and \X are
-are not special inside a character class. Like any other unrecognized escape
+not special inside a character class. Like any other unrecognized escape
-sequences, they cause an error.
+sequences, they cause an error. The same is true for \N when not followed by
 an opening brace.
 </P>
 <P>
 The minus (hyphen) character can be used to specify a range of characters in a
@ -3559,7 +3579,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 20 July 2018
+Last updated: 27 July 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -70,9 +70,10 @@ This table applies to ASCII and Unicode environments.
  \ddd       character with octal code ddd, or backreference
  \o{ddd..}  character with octal code ddd..
  \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
  \N{U+hh..} character with Unicode code point hh.. 
  \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
  \xhh       character with hex code hh
-  \x{hhh..}  character with hex code hhh..
+  \x{hh..}   character with hex code hh..
 </pre>
 Note that \0dd is always an octal code. The treatment of backslash followed by
 a non-zero digit is complicated; for details see the section
@ -80,7 +81,9 @@ a non-zero digit is complicated; for details see the section
 in the
 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
 documentation, where details of escape processing in EBCDIC environments are
-also given.
+also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
 supported in EBCDIC environments. Note that \N not followed by an opening
 curly bracket has a different meaning (see below).
 </P>
 <P>
 When \x is not followed by {, from zero to two hexadecimal digits are read,
@ -621,7 +624,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 21 July 2018
+Last updated: 27 July 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@ -6015,12 +6015,13 @@ SPECIAL START-OF-PATTERN ITEMS
       The newline convention affects where the circumflex and  dollar  asser-
       tions are true. It also affects the interpretation of the dot metachar-
-       acter when PCRE2_DOTALL is not set, and the behaviour of  \N.  However,
+       acter when PCRE2_DOTALL is not set, and the behaviour of  \N  when  not
-       it  does  not  affect  what the \R escape sequence matches. By default,
+       followed  by  an opening brace. However, it does not affect what the \R
-       this is any Unicode newline sequence, for Perl compatibility.  However,
+       escape sequence matches.  By  default,  this  is  any  Unicode  newline
-       this  can be changed; see the next section and the description of \R in
+       sequence, for Perl compatibility. However, this can be changed; see the
-       the section entitled "Newline sequences" below. A change of \R  setting
+       next section and the description of \R in the section entitled "Newline
-       can be combined with a change of newline convention.
+       sequences"  below. A change of \R setting can be combined with a change
       of newline convention.
   Specifying what \R matches
@ -6158,8 +6159,14 @@ BACKSLASH
         \o{ddd..}   character with octal code ddd..
         \xhh        character with hex code hh
         \x{hhh..}   character with hex code hhh.. (default mode)
         \N{U+hhh..} character with Unicode code point hhh..
         \uhhhh      character with hex code hhhh (when PCRE2_ALT_BSUX is set)
       Note  that  when \N is not followed by an opening brace (curly bracket)
       it has an entirely different meaning, matching any  character  that  is
       not  a  newline.  Perl also uses \N{name} to specify characters by Uni-
       code name; PCRE2 does not support this.
       The precise effect of \cx on ASCII characters is as follows: if x is  a
       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
@ -6167,15 +6174,15 @@ BACKSLASH
       hex 7B (; is 3B). If the code unit following \c has a value  less  than
       32 or greater than 126, a compile-time error occurs.
-       When  PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
+       When  PCRE2  is  compiled in EBCDIC mode, \N{U+hhh..} is not supported.
-       erate the appropriate EBCDIC code values. The \c escape is processed as
+       \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
-       specified for Perl in the perlebcdic document. The only characters that
+       The \c escape is processed as specified for Perl in the perlebcdic doc-
-       are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^,  _,  or  ?.
+       ument. The only characters that are allowed after \c are A-Z,  a-z,  or
-       Any  other  character  provokes  a compile-time error. The sequence \c@
+       one  of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
-       encodes character code 0; after \c the letters (in either case)  encode
+       time error. The sequence \c@ encodes character code  0;  after  \c  the
-       characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
+       letters  (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
-       27-31 (hex 1B to hex 1F), and \c? becomes either 255  (hex  FF)  or  95
+       \, ], ^, and _ encode characters 27-31 (hex 1B  to  hex  1F),  and  \c?
-       (hex 5F).
+       becomes either 255 (hex FF) or 95 (hex 5F).
       Thus,  apart  from  \c?, these escapes generate the same character code
       values as they do in an ASCII environment, though the meanings  of  the
@ -6203,9 +6210,9 @@ BACKSLASH
       numbers and backreferences to be unambiguously specified.
       For greater clarity and unambiguity, it is best to avoid following \ by
-       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
+       a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
-       ter numbers, and \g{} to specify backreferences.  The  following  para-
+       cal character code points, and \g{} to specify backreferences. The fol-
-       graphs describe the old, ambiguous syntax.
+       lowing paragraphs describe the old, ambiguous syntax.
       The handling of a backslash followed by a digit other than 0 is compli-
       cated, and Perl has changed over time, causing PCRE2 also to change.
@ -6281,10 +6288,10 @@ BACKSLASH
       inside  and  outside character classes. In addition, inside a character
       class, \b is interpreted as the backspace character (hex 08).
-       \N is not allowed in a character class. \B, \R, and \X are not  special
+       When not followed by an opening brace, \N is not allowed in a character
-       inside  a  character  class.  Like other unrecognized alphabetic escape
+       class.   \B,  \R, and \X are not special inside a character class. Like
-       sequences, they cause  an  error.  Outside  a  character  class,  these
+       other unrecognized alphabetic escape sequences, they  cause  an  error.
-       sequences have different meanings.
+       Outside a character class, these sequences have different meanings.
   Unsupported escape sequences
@ -6318,6 +6325,7 @@ BACKSLASH
         \D     any character that is not a decimal digit
         \h     any horizontal white space character
         \H     any character that is not a horizontal white space character
         \N     any character that is not a newline
         \s     any white space character
         \S     any character that is not a white space character
         \v     any vertical white space character
@ -6325,10 +6333,12 @@ BACKSLASH
         \w     any "word" character
         \W     any "non-word" character
-       There is also the single sequence \N, which matches a non-newline char-
+       The  \N  escape  sequence has the same meaning as the "." metacharacter
-       acter.  This is the same as the "." metacharacter when PCRE2_DOTALL  is
+       when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not  change
-       not  set. Perl also uses \N to match characters by name; PCRE2 does not
+       the meaning of \N. Note that when \N is followed by an opening brace it
-       support this.
+       has a different meaning. See the section entitled "Non-printing charac-
       ters"  above for details. Perl also uses \N{name} to specify characters
       by Unicode name; PCRE2 does not support this.
       Each pair of lower and upper case escape sequences partitions the  com-
       plete  set  of  characters  into two disjoint sets. Any given character
@ -6867,10 +6877,15 @@ FULL STOP (PERIOD, DOT) AND \N
       flex and dollar, the only relationship being  that  they  both  involve
       newlines. Dot has no special meaning in a character class.
-       The  escape  sequence  \N  behaves  like  a  dot, except that it is not
+       The  escape  sequence  \N when not followed by an opening brace behaves
-       affected by the PCRE2_DOTALL option. In other  words,  it  matches  any
+       like a dot, except that it is not affected by the PCRE2_DOTALL  option.
-       character  except  one that signifies the end of a line. Perl also uses
+       In  other words, it matches any character except one that signifies the
-       \N to match characters by name; PCRE2 does not support this.
+       end of a line.
       When \N is followed by an opening brace it has a different meaning. See
       the  section entitled "Non-printing characters" above for details. Perl
       also uses \N{name} to specify characters by Unicode  name;  PCRE2  does
       not support this.
 MATCHING A SINGLE CODE UNIT
@ -6951,10 +6966,12 @@ SQUARE BRACKETS AND CHARACTER CLASSES
       sumes a character from the subject string, and therefore  it  fails  if
       the current pointer is at the end of the string.
-       When caseless matching is set, any letters in a  class  represent  both
+       Characters  in  a class may be specified by their code points using \o,
-       their  upper  case  and lower case versions, so for example, a caseless
+       \x, or \N{U+hh..} in the usual way. When caseless matching is set,  any
-       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
+       letters  in a class represent both their upper case and lower case ver-
-       match "A", whereas a caseful version would.
+       sions, so for example, a caseless [aeiou] matches "A" as well  as  "a",
       and  a  caseless [^aeiou] does not match "A", whereas a caseful version
       would.
       Characters that might indicate line breaks are  never  treated  in  any
       special  way  when  matching  character  classes,  whatever line-ending
@ -6962,17 +6979,18 @@ SQUARE BRACKETS AND CHARACTER CLASSES
       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
       one of these characters.
-       The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
+       The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
-       \w, and \W may appear in a character class, and add the characters that
+       \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the
-       they  match to the class. For example, [\dABCDEF] matches any hexadeci-
+       characters that they  match  to  the  class.  For  example,  [\dABCDEF]
-       mal digit. In UTF modes, the PCRE2_UCP option affects the  meanings  of
+       matches  any  hexadecimal  digit.  In  UTF  modes, the PCRE2_UCP option
-       \d,  \s,  \w  and  their upper case partners, just as it does when they
+       affects the meanings of \d, \s, \w and their upper case partners,  just
-       appear outside a character class, as described in the section  entitled
+       as  it does when they appear outside a character class, as described in
-       "Generic character types" above. The escape sequence \b has a different
+       the section  entitled  "Generic  character  types"  above.  The  escape
-       meaning inside a character class; it matches the  backspace  character.
+       sequence  \b  has  a  different  meaning  inside  a character class; it
-       The  sequences  \B,  \N,  \R, and \X are not special inside a character
+       matches the backspace character. The sequences \B, \R, and \X  are  not
-       class. Like any other unrecognized  escape  sequences,  they  cause  an
+       special  inside  a  character class. Like any other unrecognized escape
-       error.
+       sequences, they cause an error. The same is true for \N when  not  fol-
       lowed by an opening brace.
       The  minus (hyphen) character can be used to specify a range of charac-
       ters in a character  class.  For  example,  [d-m]  matches  any  letter
@ -9012,7 +9030,7 @@ AUTHOR
 REVISION
-       Last updated: 20 July 2018
+       Last updated: 27 July 2018
       Copyright (c) 1997-2018 University of Cambridge.
 ------------------------------------------------------------------------------
@ -9873,14 +9891,18 @@ ESCAPED CHARACTERS
         \ddd       character with octal code ddd, or backreference
         \o{ddd..}  character with octal code ddd..
         \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
         \N{U+hh..} character with Unicode code point hh..
         \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
         \xhh       character with hex code hh
-         \x{hhh..}  character with hex code hhh..
+         \x{hh..}   character with hex code hh..
       Note that \0dd is always an octal code. The treatment of backslash fol-
       lowed by a non-zero digit is complicated; for details see  the  section
       "Non-printing  characters"  in  the  pcre2pattern  documentation, where
       details of escape processing in EBCDIC  environments  are  also  given.
       \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
       EBCDIC environments. Note that \N not  followed  by  an  opening  curly
       bracket has a different meaning (see below).
       When  \x  is not followed by {, from zero to two hexadecimal digits are
       read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
@ -10289,7 +10311,7 @@ AUTHOR
 REVISION
-       Last updated: 21 July 2018
+       Last updated: 27 July 2018
       Copyright (c) 1997-2018 University of Cambridge.
 ------------------------------------------------------------------------------
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "02 July 2018" "PCRE2 10.32"
+.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -1400,7 +1400,8 @@ character, even if newlines are coded as CRLF. Without this option, a dot does
 not match when the current position in the subject is at a newline. This option
 is equivalent to Perl's /s option, and it can be changed within a pattern by a
 (?s) option setting. A negative class such as [^a] always matches newline
-characters, independent of the setting of this option.
+characters, and the \eN escape sequence always matches a non-newline character,
 independent of the setting of PCRE2_DOTALL.
 .sp
  PCRE2_DUPNAMES
 .sp
@ -3640,6 +3641,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 02 July 2018
+Last updated: 27 July 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "20 July 2018" "PCRE2 10.32"
+.TH PCRE2PATTERN 3 "27 July 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -218,10 +218,11 @@ is used.
 .P
 The newline convention affects where the circumflex and dollar assertions are
 true. It also affects the interpretation of the dot metacharacter when
-PCRE2_DOTALL is not set, and the behaviour of \eN. However, it does not affect
+PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an 
-what the \eR escape sequence matches. By default, this is any Unicode newline
+opening brace. However, it does not affect what the \eR escape sequence
-sequence, for Perl compatibility. However, this can be changed; see the next
+matches. By default, this is any Unicode newline sequence, for Perl
-section and the description of \eR in the section entitled
+compatibility. However, this can be changed; see the next section and the
 description of \eR in the section entitled
 .\" HTML <a href="#newlineseq">
 .\" </a>
 "Newline sequences"
@ -371,8 +372,14 @@ these escapes are as follows:
  \eo{ddd..}   character with octal code ddd..
  \exhh        character with hex code hh
  \ex{hhh..}   character with hex code hhh.. (default mode)
  \eN{U+hhh..} character with Unicode code point hhh.. 
  \euhhhh      character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 .sp
 Note that when \eN is not followed by an opening brace (curly bracket) it has
 an entirely different meaning, matching any character that is not a newline.
 Perl also uses \eN{name} to specify characters by Unicode name; PCRE2 does not
 support this.
 .P
 The precise effect of \ecx on ASCII characters is as follows: if x is a lower
 case letter, it is converted to upper case. Then bit 6 of the character (hex
 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
@ -380,14 +387,14 @@ but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
 code unit following \ec has a value less than 32 or greater than 126, a
 compile-time error occurs.
 .P
-When PCRE2 is compiled in EBCDIC mode, \ea, \ee, \ef, \en, \er, and \et
+When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
-generate the appropriate EBCDIC code values. The \ec escape is processed
+\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
-as specified for Perl in the \fBperlebcdic\fP document. The only characters
+escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
-that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], ^, _, or ?. Any
+only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
-other character provokes a compile-time error. The sequence \ec@ encodes
+^, _, or ?. Any other character provokes a compile-time error. The sequence
-character code 0; after \ec the letters (in either case) encode characters 1-26
+\ec@ encodes character code 0; after \ec the letters (in either case) encode
-(hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 (hex 1B to hex
+characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
-1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
+(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
 .P
 Thus, apart from \ec?, these escapes generate the same character code values as
 they do in an ASCII environment, though the meanings of the values mostly
@ -414,9 +421,9 @@ numbers greater than 0777, and it also allows octal numbers and backreferences
 to be unambiguously specified.
 .P
 For greater clarity and unambiguity, it is best to avoid following \e by a
-digit greater than zero. Instead, use \eo{} or \ex{} to specify character
+digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
-numbers, and \eg{} to specify backreferences. The following paragraphs
+character code points, and \eg{} to specify backreferences. The following
-describe the old, ambiguous syntax.
+paragraphs describe the old, ambiguous syntax.
 .P
 The handling of a backslash followed by a digit other than 0 is complicated,
 and Perl has changed over time, causing PCRE2 also to change.
@ -507,10 +514,10 @@ All the sequences that define a single character value can be used both inside
 and outside character classes. In addition, inside a character class, \eb is
 interpreted as the backspace character (hex 08).
 .P
-\eN is not allowed in a character class. \eB, \eR, and \eX are not special
+When not followed by an opening brace, \eN is not allowed in a character class.
-inside a character class. Like other unrecognized alphabetic escape sequences,
+\eB, \eR, and \eX are not special inside a character class. Like other
-they cause an error. Outside a character class, these sequences have different
+unrecognized alphabetic escape sequences, they cause an error. Outside a
-meanings.
+character class, these sequences have different meanings.
 .
 .
 .SS "Unsupported escape sequences"
@ -569,6 +576,7 @@ Another use of backslash is for specifying generic character types:
  \eD     any character that is not a decimal digit
  \eh     any horizontal white space character
  \eH     any character that is not a horizontal white space character
  \eN     any character that is not a newline 
  \es     any white space character
  \eS     any character that is not a white space character
  \ev     any vertical white space character
@ -576,14 +584,20 @@ Another use of backslash is for specifying generic character types:
  \ew     any "word" character
  \eW     any "non-word" character
 .sp
-There is also the single sequence \eN, which matches a non-newline character.
+The \eN escape sequence has the same meaning as
 This is the same as
 .\" HTML <a href="#fullstopdot">
 .\" </a>
 the "." metacharacter
 .\"
-when PCRE2_DOTALL is not set. Perl also uses \eN to match characters by name;
+when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the 
-PCRE2 does not support this.
+meaning of \eN. Note that when \eN is followed by an opening brace it has a 
 different meaning. See the section entitled
 .\" HTML <a href="#digitsafterbackslash">
 .\" </a>
 "Non-printing characters"
 .\"
 above for details. Perl also uses \eN{name} to specify characters by Unicode
 name; PCRE2 does not support this.
 .P
 Each pair of lower and upper case escape sequences partitions the complete set
 of characters into two disjoint sets. Any given character matches one, and only
@ -1289,9 +1303,17 @@ The handling of dot is entirely independent of the handling of circumflex and
 dollar, the only relationship being that they both involve newlines. Dot has no
 special meaning in a character class.
 .P
-The escape sequence \eN behaves like a dot, except that it is not affected by
+The escape sequence \eN when not followed by an opening brace behaves like a
-the PCRE2_DOTALL option. In other words, it matches any character except one
+dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
-that signifies the end of a line. Perl also uses \eN to match characters by
+it matches any character except one that signifies the end of a line. 
 .P
 When \eN is followed by an opening brace it has a different meaning. See the
 section entitled
 .\" HTML <a href="digitsafterbackslash">
 .\" </a>
 "Non-printing characters"
 .\"
 above for details. Perl also uses \eN{name} to specify characters by Unicode
 name; PCRE2 does not support this.
 .
 .
@ -1380,30 +1402,32 @@ circumflex is not an assertion; it still consumes a character from the subject
 string, and therefore it fails if the current pointer is at the end of the
 string.
 .P
-When caseless matching is set, any letters in a class represent both their
+Characters in a class may be specified by their code points using \eo, \ex, or
-upper case and lower case versions, so for example, a caseless [aeiou] matches
+\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
-"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
+class represent both their upper case and lower case versions, so for example,
-caseful version would.
+a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
 match "A", whereas a caseful version would.
 .P
 Characters that might indicate line breaks are never treated in any special way
 when matching character classes, whatever line-ending sequence is in use, and
 whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
 class such as [^a] always matches one of these characters.
 .P
-The character escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, \eS, \ev,
+The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
-\eV, \ew, and \eW may appear in a character class, and add the characters that
+\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
-they match to the class. For example, [\edABCDEF] matches any hexadecimal
+characters that they match to the class. For example, [\edABCDEF] matches any
-digit. In UTF modes, the PCRE2_UCP option affects the meanings of \ed, \es, \ew
+hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
-and their upper case partners, just as it does when they appear outside a
+\ed, \es, \ew and their upper case partners, just as it does when they appear
-character class, as described in the section entitled
+outside a character class, as described in the section entitled
 .\" HTML <a href="#genericchartypes">
 .\" </a>
 "Generic character types"
 .\"
 above. The escape sequence \eb has a different meaning inside a character
-class; it matches the backspace character. The sequences \eB, \eN, \eR, and \eX
+class; it matches the backspace character. The sequences \eB, \eR, and \eX are
-are not special inside a character class. Like any other unrecognized escape
+not special inside a character class. Like any other unrecognized escape
-sequences, they cause an error.
+sequences, they cause an error. The same is true for \eN when not followed by
 an opening brace.
 .P
 The minus (hyphen) character can be used to specify a range of characters in a
 character class. For example, [d-m] matches any letter between d and m,
@ -3580,6 +3604,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 20 July 2018
+Last updated: 27 July 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "21 July 2018" "PCRE2 10.32"
+.TH PCRE2SYNTAX 3 "27 July 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -35,9 +35,10 @@ This table applies to ASCII and Unicode environments.
  \eddd       character with octal code ddd, or backreference
  \eo{ddd..}  character with octal code ddd..
  \eU         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
  \eN{U+hh..} character with Unicode code point hh.. 
  \euhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
  \exhh       character with hex code hh
-  \ex{hhh..}  character with hex code hhh..
+  \ex{hh..}   character with hex code hh..
 .sp
 Note that \e0dd is always an octal code. The treatment of backslash followed by
 a non-zero digit is complicated; for details see the section
@ -50,7 +51,9 @@ in the
 \fBpcre2pattern\fP
 .\"
 documentation, where details of escape processing in EBCDIC environments are
-also given.
+also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
 supported in EBCDIC environments. Note that \eN not followed by an opening
 curly bracket has a different meaning (see below).
 .P
 When \ex is not followed by {, from zero to two hexadecimal digits are read,
 but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
@ -609,6 +612,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 21 July 2018
+Last updated: 27 July 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/src/pcre2.h.in
+++ b/src/pcre2.h.in
@ -316,6 +316,7 @@ pcre2_pattern_convert(). */
 #define PCRE2_ERROR_INTERNAL_BAD_CODE_IN_SKIP      190
 #define PCRE2_ERROR_NO_SURROGATES_IN_UTF16         191
 #define PCRE2_ERROR_BAD_LITERAL_OPTIONS            192
 #define PCRE2_ERROR_NOT_SUPPORTED_IN_EBCDIC        193
 /* "Expected" matching error codes: no match and partial match. */
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -731,7 +731,7 @@ enum { ERR0 = COMPILE_ERROR_BASE,
       ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69, ERR70,
       ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80,
       ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90,
-       ERR91, ERR92};
+       ERR91, ERR92, ERR93 };
 /* This is a table of start-of-pattern options such as (*UTF) and settings such
 as (*LIMIT_MATCH=nnnn) and (*CRLF). For completeness and backward
@ -1441,6 +1441,42 @@ else if ((i = escapes[c - ESCAPES_FIRST]) != 0)
    escape = -i;                    /* Else return a special escape */
    if (cb != NULL && (escape == ESC_P || escape == ESC_p || escape == ESC_X))
      cb->external_flags |= PCRE2_HASBKPORX;   /* Note \P, \p, or \X */
    /* Perl supports \N{name} for character names and \N{U+dddd} for numerical
    Unicode code points, as well as plain \N for "not newline". PCRE does not
    support \N{name}. However, it does support quantification such as \N{2,3}, 
    so if \N{ is not followed by U+dddd we check for a quantifier. */
    if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
      {
      PCRE2_SPTR p = ptr + 1;
      /* \N{U+ can be handled by the \x{ code. However, this construction is 
      not valid in EBCDIC environments because it specifies a Unicode 
      character, not a codepoint in the local code. For example \N{U+0041} 
      must be "A" in all environments. */
      if (ptrend - p > 1 && *p == CHAR_U && p[1] == CHAR_PLUS)
        {
 #ifdef EBCDIC
        *errorcodeptr = ERR93;
 #else        
        ptr = p + 1;
        escape = 0;   /* Not a fancy escape after all */ 
        goto COME_FROM_NU;
 #endif 
        }  
      /* Give an error if what follows is not a quantifier, but don't override 
      an error set by the quantifier reader (e.g. number overflow). */
      else
        { 
        if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
             *errorcodeptr == 0)
          *errorcodeptr = ERR37;
        }   
      }
    }
  }
@ -1725,6 +1761,9 @@ else
      {
      if (ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET)
        {
 #ifndef EBCDIC         
        COME_FROM_NU: 
 #endif         
        if (++ptr >= ptrend || *ptr == CHAR_RIGHT_CURLY_BRACKET)
          {
          *errorcodeptr = ERR78;
@ -1858,19 +1897,6 @@ else
    }
  }
 /* Perl supports \N{name} for character names, as well as plain \N for "not
 newline". PCRE does not support \N{name}. However, it does support
 quantification such as \N{2,3}. */
 if (escape == ESC_N && ptr < ptrend && *ptr == CHAR_LEFT_CURLY_BRACKET &&
    ptrend - ptr > 2)
  {
  PCRE2_SPTR p = ptr + 1;
  if (!read_repeat_counts(&p, ptrend, NULL, NULL, errorcodeptr) &&
       *errorcodeptr == 0)
    *errorcodeptr = ERR37;
  }
 /* Set the pointer to the next character before returning. */
 *ptrptr = ptr;
@ -3223,7 +3249,6 @@ while (ptr < ptrend)
        tempptr = ptr;
        escape = PRIV(check_escape)(&ptr, ptrend, &c, &errorcode,
          options, TRUE, cb);
        if (errorcode != 0)
          {
          CLASS_ESCAPE_FAILED:
--- a/src/pcre2_error.c
+++ b/src/pcre2_error.c
@ -161,7 +161,7 @@ static const unsigned char compile_error_texts[] =
  "using UCP is disabled by the application\0"
  "name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)\0"
  "character code point value in \\u.... sequence is too large\0"
-  "digits missing in \\x{} or \\o{}\0"
+  "digits missing in \\x{} or \\o{} or \\N{U+}\0"
  "syntax error or number too big in (?(VERSION condition\0"
  /* 80 */
  "internal error: unknown opcode in auto_possessify()\0"
@ -179,6 +179,7 @@ static const unsigned char compile_error_texts[] =
  "internal error: bad code value in parsed_skip()\0"
  "PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is not allowed in UTF-16 mode\0"
  "invalid option bits with PCRE2_LITERAL\0"
  "\\N{U+dddd} is not supported in EBCDIC mode\0" 
  ;
 /* Match-time and UTF error texts are in the same format. */
--- a/testdata/testinput4
+++ b/testdata/testinput4
@ -2288,4 +2288,10 @@
 \= Expect no match     
    \x{123}\x{124}\x{123}
 /\N{U+1234}/utf
    \x{1234}
 /[\N{U+1234}]/utf
    \x{1234}
 # End of testinput4
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2087,4 +2087,8 @@
    \x{655}
    \x{1D1AA} 
 /\N{U+}/
 /\N{U}/
 # End of testinput5
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@ -13194,7 +13194,7 @@ Failed: error 167 at offset 5: non-hex character in \x{} (closing brace missing?
 Failed: error 167 at offset 7: non-hex character in \x{} (closing brace missing?)
 /^A\x{/
-Failed: error 178 at offset 5: digits missing in \x{} or \o{}
+Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
 /[ab]++/B,no_auto_possess
 ------------------------------------------------------------------
@ -13408,7 +13408,7 @@ Failed: error 133 at offset 7: parentheses are too deeply nested (stack check)
 Failed: error 155 at offset 2: missing opening brace after \o
 /\o{}/
-Failed: error 178 at offset 3: digits missing in \x{} or \o{}
+Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
 /\o{whatever}/
 Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missing?)
@ -13416,7 +13416,7 @@ Failed: error 164 at offset 3: non-octal character in \o{} (closing brace missin
 /\xthing/
 /\x{}/
-Failed: error 178 at offset 3: digits missing in \x{} or \o{}
+Failed: error 178 at offset 3: digits missing in \x{} or \o{} or \N{U+}
 /\x{whatever}/
 Failed: error 167 at offset 3: non-hex character in \x{} (closing brace missing?)
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@ -3704,4 +3704,12 @@ No match
    \x{123}\x{124}\x{123}
 No match
 /\N{U+1234}/utf
    \x{1234}
 0: \x{1234}
 /[\N{U+1234}]/utf
    \x{1234}
 0: \x{1234}
 # End of testinput4
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4750,4 +4750,10 @@ No match
    \x{1D1AA} 
 0: \x{1d1aa}
 /\N{U+}/
 Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
 /\N{U}/
 Failed: error 137 at offset 2: PCRE does not support \L, \l, \N{name}, \U, or \u
 # End of testinput5
--- a/testdata/testoutputEBC
+++ b/testdata/testoutputEBC
@ -1,3 +1,4 @@
 PCRE2 version 10.32-RC1 2018-02-19
 # This is a specialized test for checking, when PCRE2 is compiled with the
 # EBCDIC option but in an ASCII environment, that newline, white space, and \c
 # functionality is working. It catches cases where explicit values such as 0x0a
@ -200,6 +201,6 @@ No match
 0: \xff
 /\ƒ&/
-Failed: error 168 at offset 2: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
+Failed: error 168 at offset 3: \c\x20must\x20be\x20followed\x20by\x20a\x20letter\x20or\x20one\x20of\x20[\]^_\x3f
 # End