Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White

Space" characters, not just the ASCII ones.
2018-08-03 09:38:36 +00:00 · 2018-08-03 09:38:36 +00:00 · b196143523
parent 6e245572b8
commit b196143523
15 changed files with 1374 additions and 1205 deletions
--- a/7
+++ b/7
@ -133,6 +133,13 @@ terminated by (*ACCEPT).
 29. Add support for \N{U+dddd}, but not in EBCDIC environments.

 30. Add support for (?^) for unsetting all imnsx options.
+
+31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
+code point was less than 256 and that were recognized by the lookup table
+generated by pcre2_maketables(), which uses isspace() to identify white space.
+Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
+U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
+Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
      

 Version 10.31 12-February-2018
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -837,10 +837,10 @@ page for details.
 </P>
 <P>
 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-option, the newline convention affects the recognition of white space and the
-end of internal comments starting with #. The value is saved with the compiled
-pattern for subsequent use by the JIT compiler and by the two interpreted
-matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
+option, the newline convention affects the recognition of the end of internal
+comments starting with #. The value is saved with the compiled pattern for
+subsequent use by the JIT compiler and by the two interpreted matching
+functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
 <br>
 <br>
 <b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
 option is set, normal backslash processing is applied to verb names and only an
 unescaped closing parenthesis terminates the name. A closing parenthesis can be
 included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
-or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
-skipped and #-comments are recognized in this mode, exactly as in the rest of
-the pattern.
+or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
+whitespace in verb names is skipped and #-comments are recognized, exactly as
+in the rest of the pattern.
 <pre>
  PCRE2_AUTO_CALLOUT
 </pre>
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?&#62; that introduce various
 parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
 Ignorable white space is permitted between an item and a following quantifier
 and between a quantifier and a following + that indicates possessiveness.
+PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
+a pattern by a (?x) option setting.
 </P>
 <P>
-PCRE2_EXTENDED also causes characters between an unescaped # outside a
-character class and the next newline, inclusive, to be ignored, which makes it
-possible to include comments inside complicated patterns. Note that the end of
-this type of comment is a literal newline sequence in the pattern; escape
-sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
-equivalent to Perl's /x option, and it can be changed within a pattern by a
-(?x) option setting.
+When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as 
+white space only those characters with code points less than 256 that are 
+flagged as white space in its low-character table. The table is normally 
+created by 
+<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
+which uses the <b>isspace()</b> function to identify space characters. In most
+ASCII environments, the relevant characters are those with code points 0x0009
+(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
+(carriage return), and 0x0020 (space). 
+</P>
+<P>
+When PCRE2 is compiled with Unicode support, in addition to these characters,
+five more Unicode "Pattern White Space" characters are recognized by
+PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
+U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
+separator). This set of characters is the same as recognized by Perl's /x
+option. Note that the horizontal and vertical space characters that are matched
+by the \h and \v escapes in patterns are a much bigger set.
+</P>
+<P>
+As well as ignoring most white space, PCRE2_EXTENDED also causes characters
+between an unescaped # outside a character class and the next newline,
+inclusive, to be ignored, which makes it possible to include comments inside
+complicated patterns. Note that the end of this type of comment is a literal
+newline sequence in the pattern; escape sequences that happen to represent a
+newline do not count.
 </P>
 <P>
 Which characters are interpreted as newlines can be specified by a setting in
@ -1531,9 +1552,11 @@ built.
  PCRE2_EXTENDED_MORE
 </pre>
 This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
-and horizontal tab characters are ignored inside a character class.
-PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
-changed within a pattern by a (?xx) option setting.
+and horizontal tab characters are ignored inside a character class. Note: only 
+these two characters are ignored, not the full set of pattern white space 
+characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
+equivalent to Perl's /xx option, and it can be changed within a pattern by a
+(?xx) option setting.
 <pre>
  PCRE2_FIRSTLINE
 </pre>
@ -3635,7 +3658,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 27 July 2018
+Last updated: 03 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -1628,9 +1628,11 @@ alternative in the subpattern.
 <br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
 <P>
 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
-PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
-are Perl-compatible) can be changed from within the pattern by a sequence of
-Perl option letters enclosed between "(?" and ")". The option letters are
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
+changed from within the pattern by a sequence of letters enclosed between "(?"
+and ")". These options are Perl-compatible, and are described in detail in the
+<a href="pcre2api.html"><b>pcre2api</b></a>
+documentation. The option letters are:
 <pre>
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
@ -2275,8 +2277,9 @@ unset value matches an empty string.
 Because there may be many capturing parentheses in a pattern, all digits
 following a backslash are taken as part of a potential backreference number.
 If the pattern continues with a digit character, some delimiter must be used to
-terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
-white space. Otherwise, the \g{ syntax or an empty comment (see
+terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
+option is set, this can be white space. Otherwise, the \g{ syntax or an empty
+comment (see
 <a href="#comments">"Comments"</a>
 below) can be used.
 </P>
@ -2744,12 +2747,12 @@ no part in the pattern matching.
 <P>
 The sequence (?# marks the start of a comment that continues up to the next
 closing parenthesis. Nested parentheses are not permitted. If the
-PCRE2_EXTENDED option is set, an unescaped # character also introduces a
-comment, which in this case continues to immediately after the next newline
-character or character sequence in the pattern. Which characters are
-interpreted as newlines is controlled by an option passed to the compiling
-function or by a special sequence at the start of the pattern, as described in
-the section entitled
+PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
+also introduces a comment, which in this case continues to immediately after
+the next newline character or character sequence in the pattern. Which
+characters are interpreted as newlines is controlled by an option passed to the
+compiling function or by a special sequence at the start of the pattern, as
+described in the section entitled
 <a href="#newlines">"Newline conventions"</a>
 above. Note that the end of this type of comment is a literal newline sequence
 in the pattern; escape sequences that happen to represent a newline do not
@ -3108,10 +3111,11 @@ are faulted.
 </P>
 <P>
 A closing parenthesis can be included in a name either as \) or between \Q
-and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
-also set, unescaped whitespace in verb names is skipped, and #-comments are
-recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
-affect verb names unless PCRE2_ALT_VERBNAMES is also set.
+and \E. In addition to backslash processing, if the PCRE2_EXTENDED or 
+PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
+skipped, and #-comments are recognized, exactly as in the rest of the pattern.
+PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
+PCRE2_ALT_VERBNAMES is also set.
 </P>
 <P>
 The maximum length of a name is 255 in the 8-bit library and 65535 in the
@ -3590,7 +3594,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 28 July 2018
+Last updated: 03 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
 </P>
 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
 <P>
+Changes of these options within a group are automatically cancelled at the end 
+of the group.
 <pre>
  (?i)            caseless
  (?J)            allow duplicate names
@ -632,7 +634,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 28 July 2018
+Last updated: 01 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
+.TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
 page for details.
 .P
 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-option, the newline convention affects the recognition of white space and the
-end of internal comments starting with #. The value is saved with the compiled
-pattern for subsequent use by the JIT compiler and by the two interpreted
-matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
+option, the newline convention affects the recognition of the end of internal
+comments starting with #. The value is saved with the compiled pattern for
+subsequent use by the JIT compiler and by the two interpreted matching
+functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
 .sp
 .nf
 .B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
 option is set, normal backslash processing is applied to verb names and only an
 unescaped closing parenthesis terminates the name. A closing parenthesis can be
 included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
-or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
-skipped and #-comments are recognized in this mode, exactly as in the rest of
-the pattern.
+or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
+whitespace in verb names is skipped and #-comments are recognized, exactly as
+in the rest of the pattern.
 .sp
  PCRE2_AUTO_CALLOUT
 .sp
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
 parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
 Ignorable white space is permitted between an item and a following quantifier
 and between a quantifier and a following + that indicates possessiveness.
+PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
+a pattern by a (?x) option setting.
 .P
-PCRE2_EXTENDED also causes characters between an unescaped # outside a
-character class and the next newline, inclusive, to be ignored, which makes it
-possible to include comments inside complicated patterns. Note that the end of
-this type of comment is a literal newline sequence in the pattern; escape
-sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
-equivalent to Perl's /x option, and it can be changed within a pattern by a
-(?x) option setting.
+When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as 
+white space only those characters with code points less than 256 that are 
+flagged as white space in its low-character table. The table is normally 
+created by 
+.\" HREF
+\fBpcre2_maketables()\fP, 
+.\"
+which uses the \fBisspace()\fP function to identify space characters. In most
+ASCII environments, the relevant characters are those with code points 0x0009
+(tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
+(carriage return), and 0x0020 (space). 
+.P
+When PCRE2 is compiled with Unicode support, in addition to these characters,
+five more Unicode "Pattern White Space" characters are recognized by
+PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
+U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
+separator). This set of characters is the same as recognized by Perl's /x
+option. Note that the horizontal and vertical space characters that are matched
+by the \eh and \ev escapes in patterns are a much bigger set.
+.P
+As well as ignoring most white space, PCRE2_EXTENDED also causes characters
+between an unescaped # outside a character class and the next newline,
+inclusive, to be ignored, which makes it possible to include comments inside
+complicated patterns. Note that the end of this type of comment is a literal
+newline sequence in the pattern; escape sequences that happen to represent a
+newline do not count.
 .P
 Which characters are interpreted as newlines can be specified by a setting in
 the compile context that is passed to \fBpcre2_compile()\fP or by a special
@ -1467,9 +1488,11 @@ built.
  PCRE2_EXTENDED_MORE
 .sp
 This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
-and horizontal tab characters are ignored inside a character class.
-PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
-changed within a pattern by a (?xx) option setting.
+and horizontal tab characters are ignored inside a character class. Note: only 
+these two characters are ignored, not the full set of pattern white space 
+characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
+equivalent to Perl's /xx option, and it can be changed within a pattern by a
+(?xx) option setting.
 .sp
  PCRE2_FIRSTLINE
 .sp
@ -3641,6 +3664,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 27 July 2018
+Last updated: 03 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32"
+.TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1627,9 +1627,13 @@ alternative in the subpattern.
 .rs
 .sp
 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
-PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
-are Perl-compatible) can be changed from within the pattern by a sequence of
-Perl option letters enclosed between "(?" and ")". The option letters are
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
+changed from within the pattern by a sequence of letters enclosed between "(?"
+and ")". These options are Perl-compatible, and are described in detail in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation. The option letters are:
 .sp
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
@ -2273,8 +2277,9 @@ unset value matches an empty string.
 Because there may be many capturing parentheses in a pattern, all digits
 following a backslash are taken as part of a potential backreference number.
 If the pattern continues with a digit character, some delimiter must be used to
-terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
-white space. Otherwise, the \eg{ syntax or an empty comment (see
+terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
+option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
+comment (see
 .\" HTML <a href="#comments">
 .\" </a>
 "Comments"
@ -2762,12 +2767,12 @@ no part in the pattern matching.
 .P
 The sequence (?# marks the start of a comment that continues up to the next
 closing parenthesis. Nested parentheses are not permitted. If the
-PCRE2_EXTENDED option is set, an unescaped # character also introduces a
-comment, which in this case continues to immediately after the next newline
-character or character sequence in the pattern. Which characters are
-interpreted as newlines is controlled by an option passed to the compiling
-function or by a special sequence at the start of the pattern, as described in
-the section entitled
+PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
+also introduces a comment, which in this case continues to immediately after
+the next newline character or character sequence in the pattern. Which
+characters are interpreted as newlines is controlled by an option passed to the
+compiling function or by a special sequence at the start of the pattern, as
+described in the section entitled
 .\" HTML <a href="#newlines">
 .\" </a>
 "Newline conventions"
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
 are faulted.
 .P
 A closing parenthesis can be included in a name either as \e) or between \eQ
-and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
-also set, unescaped whitespace in verb names is skipped, and #-comments are
-recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
-affect verb names unless PCRE2_ALT_VERBNAMES is also set.
+and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or 
+PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
+skipped, and #-comments are recognized, exactly as in the rest of the pattern.
+PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
+PCRE2_ALT_VERBNAMES is also set.
 .P
 The maximum length of a name is 255 in the 8-bit library and 65535 in the
 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@ -3614,6 +3620,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 28 July 2018
+Last updated: 03 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
+.TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
 .
 .SH "OPTION SETTING"
 .rs
+Changes of these options within a group are automatically cancelled at the end 
+of the group.
 .sp
  (?i)            caseless
  (?J)            allow duplicate names
@ -619,6 +621,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 28 July 2018
+Last updated: 01 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -2468,11 +2468,17 @@ while (ptr < ptrend)
        /* EITHER: not both options set */
        ((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
                    (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
-        /* OR: character > 255 */
-        c > 255 ||
-        /* OR: not a # comment or white space */
-        (c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0)
-       ))
+#ifdef SUPPORT_UNICODE                     
+        /* OR: character > 255 AND not Unicode Pattern White Space */
+        (c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
+#endif         
+        /* OR: not a # comment or isspace() white space */
+        (c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
+#ifdef SUPPORT_UNICODE
+        /* and not CHAR_NEL when Unicode is supported */
+          && c != CHAR_NEL
+#endif                     
+       )))
    {
    PCRE2_SIZE verbnamelength;

@ -2554,11 +2560,18 @@ while (ptr < ptrend)

  /* Skip over whitespace and # comments in extended mode. Note that c is a
  character, not a code unit, so we must not use MAX_255 to test its size
-  because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */
+  because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
+  whitespace characters are those designated as "Pattern White Space" by
+  Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is 
+  U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a 
+  subset of space characters that match \h and \v. */

  if ((options & PCRE2_EXTENDED) != 0)
    {
    if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
+#ifdef SUPPORT_UNICODE     
+    if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
+#endif     
    if (c == CHAR_NUMBER_SIGN)
      {
      while (ptr < ptrend)
--- a/testdata/testinput1
+++ b/testdata/testinput1
@ -6257,5 +6257,5 @@ ef) x/x,mark
 \= Expect no match
    aBCDEF
    AbCDe f
-    
+
 # End of testinput1 
--- a/testdata/testinput4
+++ b/testdata/testinput4
@ -2293,5 +2293,20 @@

 /[\N{U+1234}]/utf
    \x{1234}
+    
+# Test the full list of Unicode "Pattern White Space" characters that are to
+# be ignored by /x. The pattern lines below may show up oddly in text editors
+# or when listed to the screen. Note that characters such as U+2002, which are
+# matched as space by \h and \v are *not* "Pattern White Space".
+
+/A‎‏  B/x,utf
+    AB
+
+/A B/x,utf
+    A\x{2002}B
+\= Expect no match
+    AB
+    
+# ------- 

 # End of testinput4
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2091,4 +2091,18 @@

 /\N{U}/

+# This tests the non-UTF Unicode NEL pattern whitespace character, only
+# recognized by PCRE2 with /x when there is Unicode support.
+
+/A      
+
…B/x
+    AB 
+    
+# This tests Unicode Pattern White Space characters in verb names when they
+# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
+# with code points greater than 255 between A, B, and C in the pattern.
+
+/(*: Aâ€ŽBâ€¨C)abc/x,utf,mark,alt_verbnames
+    abc
+
 # End of testinput5
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@ -9920,5 +9920,5 @@ No match, mark = X
 No match
    AbCDe f
 No match
-    
+
 # End of testinput1 
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@ -3711,5 +3711,23 @@ No match
 /[\N{U+1234}]/utf
    \x{1234}
 0: \x{1234}
+    
+# Test the full list of Unicode "Pattern White Space" characters that are to
+# be ignored by /x. The pattern lines below may show up oddly in text editors
+# or when listed to the screen. Note that characters such as U+2002, which are
+# matched as space by \h and \v are *not* "Pattern White Space".
+
+/A‎‏  B/x,utf
+    AB
+ 0: AB
+
+/A B/x,utf
+    A\x{2002}B
+ 0: A\x{2002}B
+\= Expect no match
+    AB
+No match
+    
+# ------- 

 # End of testinput4
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
 /\N{U}/
 Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u

+# This tests the non-UTF Unicode NEL pattern whitespace character, only
+# recognized by PCRE2 with /x when there is Unicode support.
+
+/A      
+
…B/x
+    AB 
+ 0: AB
+    
+# This tests Unicode Pattern White Space characters in verb names when they
+# are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
+# with code points greater than 255 between A, B, and C in the pattern.
+
+/(*: Aâ€ŽBâ€¨C)abc/x,utf,mark,alt_verbnames
+    abc
+ 0: abc
+MK: ABC
+
 # End of testinput5