Make /x more Perl-compatible by recognizing all of Unicode's "Pattern White

Space" characters, not just the ASCII ones.
2018-08-03 09:38:36 +00:00 · 2018-08-03 09:38:36 +00:00 · b196143523
parent 6e245572b8
commit b196143523
15 changed files with 1374 additions and 1205 deletions
--- a/7
+++ b/7
@ -133,6 +133,13 @@ terminated by (*ACCEPT).
 29. Add support for \N{U+dddd}, but not in EBCDIC environments.
 30. Add support for (?^) for unsetting all imnsx options.
 31. The PCRE2_EXTENDED (/x) option only ever discarded space characters whose
 code point was less than 256 and that were recognized by the lookup table
 generated by pcre2_maketables(), which uses isspace() to identify white space.
 Now, when Unicode support is compiled, PCRE2_EXTENDED also discards U+0085,
 U+200E, U+200F, U+2028, and U+2029, which are additional characters defined by
 Unicode as "Pattern White Space". This makes PCRE2 compatible with Perl.
 Version 10.31 12-February-2018
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@ -837,10 +837,10 @@ page for details.
 </P>
 <P>
 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-option, the newline convention affects the recognition of white space and the
+option, the newline convention affects the recognition of the end of internal
-end of internal comments starting with #. The value is saved with the compiled
+comments starting with #. The value is saved with the compiled pattern for
-pattern for subsequent use by the JIT compiler and by the two interpreted
+subsequent use by the JIT compiler and by the two interpreted matching
-matching functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
+functions, <i>pcre2_match()</i> and <i>pcre2_dfa_match()</i>.
 <br>
 <br>
 <b>int pcre2_set_parens_nest_limit(pcre2_compile_context *<i>ccontext</i>,</b>
@ -1424,9 +1424,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
 option is set, normal backslash processing is applied to verb names and only an
 unescaped closing parenthesis terminates the name. A closing parenthesis can be
 included in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED
-or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
+or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
-skipped and #-comments are recognized in this mode, exactly as in the rest of
+whitespace in verb names is skipped and #-comments are recognized, exactly as
-the pattern.
+in the rest of the pattern.
 <pre>
  PCRE2_AUTO_CALLOUT
 </pre>
@ -1510,15 +1510,36 @@ is not allowed within sequences such as (?&#62; that introduce various
 parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
 Ignorable white space is permitted between an item and a following quantifier
 and between a quantifier and a following + that indicates possessiveness.
 PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
 a pattern by a (?x) option setting.
 </P>
 <P>
-PCRE2_EXTENDED also causes characters between an unescaped # outside a
+When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as 
-character class and the next newline, inclusive, to be ignored, which makes it
+white space only those characters with code points less than 256 that are 
-possible to include comments inside complicated patterns. Note that the end of
+flagged as white space in its low-character table. The table is normally 
-this type of comment is a literal newline sequence in the pattern; escape
+created by 
-sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
+<a href="pcre2_maketables.html"><b>pcre2_maketables()</b>,</a>
-equivalent to Perl's /x option, and it can be changed within a pattern by a
+which uses the <b>isspace()</b> function to identify space characters. In most
-(?x) option setting.
+ASCII environments, the relevant characters are those with code points 0x0009
 (tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
 (carriage return), and 0x0020 (space). 
 </P>
 <P>
 When PCRE2 is compiled with Unicode support, in addition to these characters,
 five more Unicode "Pattern White Space" characters are recognized by
 PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
 U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
 separator). This set of characters is the same as recognized by Perl's /x
 option. Note that the horizontal and vertical space characters that are matched
 by the \h and \v escapes in patterns are a much bigger set.
 </P>
 <P>
 As well as ignoring most white space, PCRE2_EXTENDED also causes characters
 between an unescaped # outside a character class and the next newline,
 inclusive, to be ignored, which makes it possible to include comments inside
 complicated patterns. Note that the end of this type of comment is a literal
 newline sequence in the pattern; escape sequences that happen to represent a
 newline do not count.
 </P>
 <P>
 Which characters are interpreted as newlines can be specified by a setting in
@ -1531,9 +1552,11 @@ built.
  PCRE2_EXTENDED_MORE
 </pre>
 This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
-and horizontal tab characters are ignored inside a character class.
+and horizontal tab characters are ignored inside a character class. Note: only 
-PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
+these two characters are ignored, not the full set of pattern white space 
-changed within a pattern by a (?xx) option setting.
+characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
 equivalent to Perl's /xx option, and it can be changed within a pattern by a
 (?xx) option setting.
 <pre>
  PCRE2_FIRSTLINE
 </pre>
@ -3635,7 +3658,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC42" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 27 July 2018
+Last updated: 03 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -1628,9 +1628,11 @@ alternative in the subpattern.
 <br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
 <P>
 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
-PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
-are Perl-compatible) can be changed from within the pattern by a sequence of
+changed from within the pattern by a sequence of letters enclosed between "(?"
-Perl option letters enclosed between "(?" and ")". The option letters are
+and ")". These options are Perl-compatible, and are described in detail in the
 <a href="pcre2api.html"><b>pcre2api</b></a>
 documentation. The option letters are:
 <pre>
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
@ -2275,8 +2277,9 @@ unset value matches an empty string.
 Because there may be many capturing parentheses in a pattern, all digits
 following a backslash are taken as part of a potential backreference number.
 If the pattern continues with a digit character, some delimiter must be used to
-terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
+terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-white space. Otherwise, the \g{ syntax or an empty comment (see
+option is set, this can be white space. Otherwise, the \g{ syntax or an empty
 comment (see
 <a href="#comments">"Comments"</a>
 below) can be used.
 </P>
@ -2744,12 +2747,12 @@ no part in the pattern matching.
 <P>
 The sequence (?# marks the start of a comment that continues up to the next
 closing parenthesis. Nested parentheses are not permitted. If the
-PCRE2_EXTENDED option is set, an unescaped # character also introduces a
+PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
-comment, which in this case continues to immediately after the next newline
+also introduces a comment, which in this case continues to immediately after
-character or character sequence in the pattern. Which characters are
+the next newline character or character sequence in the pattern. Which
-interpreted as newlines is controlled by an option passed to the compiling
+characters are interpreted as newlines is controlled by an option passed to the
-function or by a special sequence at the start of the pattern, as described in
+compiling function or by a special sequence at the start of the pattern, as
-the section entitled
+described in the section entitled
 <a href="#newlines">"Newline conventions"</a>
 above. Note that the end of this type of comment is a literal newline sequence
 in the pattern; escape sequences that happen to represent a newline do not
@ -3108,10 +3111,11 @@ are faulted.
 </P>
 <P>
 A closing parenthesis can be included in a name either as \) or between \Q
-and \E. In addition to backslash processing, if the PCRE2_EXTENDED option is
+and \E. In addition to backslash processing, if the PCRE2_EXTENDED or 
-also set, unescaped whitespace in verb names is skipped, and #-comments are
+PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
-recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
+skipped, and #-comments are recognized, exactly as in the rest of the pattern.
-affect verb names unless PCRE2_ALT_VERBNAMES is also set.
+PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
 PCRE2_ALT_VERBNAMES is also set.
 </P>
 <P>
 The maximum length of a name is 255 in the 8-bit library and 65535 in the
@ -3590,7 +3594,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC30" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 28 July 2018
+Last updated: 03 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -446,6 +446,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
 </P>
 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
 <P>
 Changes of these options within a group are automatically cancelled at the end 
 of the group.
 <pre>
  (?i)            caseless
  (?J)            allow duplicate names
@ -632,7 +634,7 @@ Cambridge, England.
 </P>
 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 28 July 2018
+Last updated: 01 August 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@ -1,4 +1,4 @@
-.TH PCRE2API 3 "27 July 2018" "PCRE2 10.32"
+.TH PCRE2API 3 "03 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@ -775,10 +775,10 @@ sequence such as (*CRLF). See the
 page for details.
 .P
 When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-option, the newline convention affects the recognition of white space and the
+option, the newline convention affects the recognition of the end of internal
-end of internal comments starting with #. The value is saved with the compiled
+comments starting with #. The value is saved with the compiled pattern for
-pattern for subsequent use by the JIT compiler and by the two interpreted
+subsequent use by the JIT compiler and by the two interpreted matching
-matching functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
+functions, \fIpcre2_match()\fP and \fIpcre2_dfa_match()\fP.
 .sp
 .nf
 .B int pcre2_set_parens_nest_limit(pcre2_compile_context *\fIccontext\fP,
@ -1356,9 +1356,9 @@ include a closing parenthesis in the name. However, if the PCRE2_ALT_VERBNAMES
 option is set, normal backslash processing is applied to verb names and only an
 unescaped closing parenthesis terminates the name. A closing parenthesis can be
 included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
-or PCRE2_EXTENDED_MORE option is set, unescaped whitespace in verb names is
+or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
-skipped and #-comments are recognized in this mode, exactly as in the rest of
+whitespace in verb names is skipped and #-comments are recognized, exactly as
-the pattern.
+in the rest of the pattern.
 .sp
  PCRE2_AUTO_CALLOUT
 .sp
@ -1445,14 +1445,35 @@ is not allowed within sequences such as (?> that introduce various
 parenthesized subpatterns, nor within numerical quantifiers such as {1,3}.
 Ignorable white space is permitted between an item and a following quantifier
 and between a quantifier and a following + that indicates possessiveness.
 PCRE2_EXTENDED is equivalent to Perl's /x option, and it can be changed within
 a pattern by a (?x) option setting.
 .P
-PCRE2_EXTENDED also causes characters between an unescaped # outside a
+When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recognizes as 
-character class and the next newline, inclusive, to be ignored, which makes it
+white space only those characters with code points less than 256 that are 
-possible to include comments inside complicated patterns. Note that the end of
+flagged as white space in its low-character table. The table is normally 
-this type of comment is a literal newline sequence in the pattern; escape
+created by 
-sequences that happen to represent a newline do not count. PCRE2_EXTENDED is
+.\" HREF
-equivalent to Perl's /x option, and it can be changed within a pattern by a
+\fBpcre2_maketables()\fP, 
-(?x) option setting.
+.\"
 which uses the \fBisspace()\fP function to identify space characters. In most
 ASCII environments, the relevant characters are those with code points 0x0009
 (tab), 0x000A (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D
 (carriage return), and 0x0020 (space). 
 .P
 When PCRE2 is compiled with Unicode support, in addition to these characters,
 five more Unicode "Pattern White Space" characters are recognized by
 PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-right mark),
 U+200F (right-to-left mark), U+2028 (line separator), and U+2029 (paragraph
 separator). This set of characters is the same as recognized by Perl's /x
 option. Note that the horizontal and vertical space characters that are matched
 by the \eh and \ev escapes in patterns are a much bigger set.
 .P
 As well as ignoring most white space, PCRE2_EXTENDED also causes characters
 between an unescaped # outside a character class and the next newline,
 inclusive, to be ignored, which makes it possible to include comments inside
 complicated patterns. Note that the end of this type of comment is a literal
 newline sequence in the pattern; escape sequences that happen to represent a
 newline do not count.
 .P
 Which characters are interpreted as newlines can be specified by a setting in
 the compile context that is passed to \fBpcre2_compile()\fP or by a special
@ -1467,9 +1488,11 @@ built.
  PCRE2_EXTENDED_MORE
 .sp
 This option has the effect of PCRE2_EXTENDED, but, in addition, unescaped space
-and horizontal tab characters are ignored inside a character class.
+and horizontal tab characters are ignored inside a character class. Note: only 
-PCRE2_EXTENDED_MORE is equivalent to Perl's 5.26 /xx option, and it can be
+these two characters are ignored, not the full set of pattern white space 
-changed within a pattern by a (?xx) option setting.
+characters that are ignored outside a character class. PCRE2_EXTENDED_MORE is
 equivalent to Perl's /xx option, and it can be changed within a pattern by a
 (?xx) option setting.
 .sp
  PCRE2_FIRSTLINE
 .sp
@ -3641,6 +3664,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 27 July 2018
+Last updated: 03 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "28 July 2018" "PCRE2 10.32"
+.TH PCRE2PATTERN 3 "03 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1627,9 +1627,13 @@ alternative in the subpattern.
 .rs
 .sp
 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
-PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options (which
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
-are Perl-compatible) can be changed from within the pattern by a sequence of
+changed from within the pattern by a sequence of letters enclosed between "(?"
-Perl option letters enclosed between "(?" and ")". The option letters are
+and ")". These options are Perl-compatible, and are described in detail in the
 .\" HREF
 \fBpcre2api\fP
 .\"
 documentation. The option letters are:
 .sp
  i  for PCRE2_CASELESS
  m  for PCRE2_MULTILINE
@ -2273,8 +2277,9 @@ unset value matches an empty string.
 Because there may be many capturing parentheses in a pattern, all digits
 following a backslash are taken as part of a potential backreference number.
 If the pattern continues with a digit character, some delimiter must be used to
-terminate the backreference. If the PCRE2_EXTENDED option is set, this can be
+terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
-white space. Otherwise, the \eg{ syntax or an empty comment (see
+option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
 comment (see
 .\" HTML <a href="#comments">
 .\" </a>
 "Comments"
@ -2762,12 +2767,12 @@ no part in the pattern matching.
 .P
 The sequence (?# marks the start of a comment that continues up to the next
 closing parenthesis. Nested parentheses are not permitted. If the
-PCRE2_EXTENDED option is set, an unescaped # character also introduces a
+PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
-comment, which in this case continues to immediately after the next newline
+also introduces a comment, which in this case continues to immediately after
-character or character sequence in the pattern. Which characters are
+the next newline character or character sequence in the pattern. Which
-interpreted as newlines is controlled by an option passed to the compiling
+characters are interpreted as newlines is controlled by an option passed to the
-function or by a special sequence at the start of the pattern, as described in
+compiling function or by a special sequence at the start of the pattern, as
-the section entitled
+described in the section entitled
 .\" HTML <a href="#newlines">
 .\" </a>
 "Newline conventions"
@ -3132,10 +3137,11 @@ only backslash items that are permitted are \eQ, \eE, and sequences such as
 are faulted.
 .P
 A closing parenthesis can be included in a name either as \e) or between \eQ
-and \eE. In addition to backslash processing, if the PCRE2_EXTENDED option is
+and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or 
-also set, unescaped whitespace in verb names is skipped, and #-comments are
+PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
-recognized, exactly as in the rest of the pattern. PCRE2_EXTENDED does not
+skipped, and #-comments are recognized, exactly as in the rest of the pattern.
-affect verb names unless PCRE2_ALT_VERBNAMES is also set.
+PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
 PCRE2_ALT_VERBNAMES is also set.
 .P
 The maximum length of a name is 255 in the 8-bit library and 65535 in the
 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
@ -3614,6 +3620,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 28 July 2018
+Last updated: 03 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "28 July 2018" "PCRE2 10.32"
+.TH PCRE2SYNTAX 3 "01 August 2018" "PCRE2 10.32"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -421,6 +421,8 @@ but some of them use Unicode properties if PCRE2_UCP is set. You can use
 .
 .SH "OPTION SETTING"
 .rs
 Changes of these options within a group are automatically cancelled at the end 
 of the group.
 .sp
  (?i)            caseless
  (?J)            allow duplicate names
@ -619,6 +621,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 28 July 2018
+Last updated: 01 August 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@ -2468,11 +2468,17 @@ while (ptr < ptrend)
        /* EITHER: not both options set */
        ((options & (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) !=
                    (PCRE2_EXTENDED | PCRE2_ALT_VERBNAMES)) ||
-        /* OR: character > 255 */
+#ifdef SUPPORT_UNICODE                     
-        c > 255 ||
+        /* OR: character > 255 AND not Unicode Pattern White Space */
-        /* OR: not a # comment or white space */
+        (c > 255 && (c|1) != 0x200f && (c|1) != 0x2029) ||
-        (c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0)
+#endif         
-       ))
+        /* OR: not a # comment or isspace() white space */
        (c < 256 && c != CHAR_NUMBER_SIGN && (cb->ctypes[c] & ctype_space) == 0
 #ifdef SUPPORT_UNICODE
        /* and not CHAR_NEL when Unicode is supported */
          && c != CHAR_NEL
 #endif                     
       )))
    {
    PCRE2_SIZE verbnamelength;
@ -2554,11 +2560,18 @@ while (ptr < ptrend)
  /* Skip over whitespace and # comments in extended mode. Note that c is a
  character, not a code unit, so we must not use MAX_255 to test its size
-  because MAX_255 tests code units and is assumed TRUE in 8-bit mode. */
+  because MAX_255 tests code units and is assumed TRUE in 8-bit mode. The
  whitespace characters are those designated as "Pattern White Space" by
  Unicode, which are the isspace() characters plus CHAR_NEL (newline), which is 
  U+0085 in Unicode, plus U+200E, U+200F, U+2028, and U+2029. These are a 
  subset of space characters that match \h and \v. */
  if ((options & PCRE2_EXTENDED) != 0)
    {
    if (c < 256 && (cb->ctypes[c] & ctype_space) != 0) continue;
 #ifdef SUPPORT_UNICODE     
    if (c == CHAR_NEL || (c|1) == 0x200f || (c|1) == 0x2029) continue;
 #endif     
    if (c == CHAR_NUMBER_SIGN)
      {
      while (ptr < ptrend)
--- a/testdata/testinput1
+++ b/testdata/testinput1
@ -6257,5 +6257,5 @@ ef) x/x,mark
 \= Expect no match
    aBCDEF
    AbCDe f
-    
+
 # End of testinput1 
--- a/testdata/testinput4
+++ b/testdata/testinput4
@ -2293,5 +2293,20 @@
 /[\N{U+1234}]/utf
    \x{1234}
 # Test the full list of Unicode "Pattern White Space" characters that are to
 # be ignored by /x. The pattern lines below may show up oddly in text editors
 # or when listed to the screen. Note that characters such as U+2002, which are
 # matched as space by \h and \v are *not* "Pattern White Space".
 /A‎‏  B/x,utf
    AB
 /A B/x,utf
    A\x{2002}B
 \= Expect no match
    AB
 # ------- 
 # End of testinput4
--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2091,4 +2091,18 @@
 /\N{U}/
 # This tests the non-UTF Unicode NEL pattern whitespace character, only
 # recognized by PCRE2 with /x when there is Unicode support.
 /A      
 
…B/x
    AB 
 # This tests Unicode Pattern White Space characters in verb names when they
 # are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
 # with code points greater than 255 between A, B, and C in the pattern.
 /(*: Aâ€ŽBâ€¨C)abc/x,utf,mark,alt_verbnames
    abc
 # End of testinput5
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@ -9920,5 +9920,5 @@ No match, mark = X
 No match
    AbCDe f
 No match
-    
+
 # End of testinput1 
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@ -3711,5 +3711,23 @@ No match
 /[\N{U+1234}]/utf
    \x{1234}
 0: \x{1234}
 # Test the full list of Unicode "Pattern White Space" characters that are to
 # be ignored by /x. The pattern lines below may show up oddly in text editors
 # or when listed to the screen. Note that characters such as U+2002, which are
 # matched as space by \h and \v are *not* "Pattern White Space".
 /A‎‏  B/x,utf
    AB
 0: AB
 /A B/x,utf
    A\x{2002}B
 0: A\x{2002}B
 \= Expect no match
    AB
 No match
 # ------- 
 # End of testinput4
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4756,4 +4756,21 @@ Failed: error 178 at offset 5: digits missing in \x{} or \o{} or \N{U+}
 /\N{U}/
 Failed: error 137 at offset 2: PCRE2 does not support \F, \L, \l, \N{name}, \U, or \u
 # This tests the non-UTF Unicode NEL pattern whitespace character, only
 # recognized by PCRE2 with /x when there is Unicode support.
 /A      
 
…B/x
    AB 
 0: AB
 # This tests Unicode Pattern White Space characters in verb names when they
 # are being processed with PCRE2_EXTENDED. Note: there are UTF-8 characters
 # with code points greater than 255 between A, B, and C in the pattern.
 /(*: Aâ€ŽBâ€¨C)abc/x,utf,mark,alt_verbnames
    abc
 0: abc
 MK: ABC
 # End of testinput5