Documentation and tests update for script runs.

2018-10-12 17:02:34 +00:00 · 2018-10-12 17:02:34 +00:00 · 0fc5cda13b
parent 4e7a204d18
commit 0fc5cda13b
14 changed files with 1697 additions and 1126 deletions
--- a/3
+++ b/3
@ -31,8 +31,7 @@ hexadecimal digit" bit was removed. The default tables in
 src/pcre2_chartables.c.dist are updated.

 8. Implement the new Perl "script run" features (*script_run:...) and 
-(*atomic_script_run:...) aka (*sr:...) and (*asr:...). At present, this is 
-not yet documented.
+(*atomic_script_run:...) aka (*sr:...) and (*asr:...).


 Version 10.32 10-September-2018
--- a/doc/html/pcre2matching.html
+++ b/doc/html/pcre2matching.html
@ -134,7 +134,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
 </P>
 <P>
 There are a number of features of PCRE2 regular expressions that are not
-supported by the alternative matching algorithm. They are as follows:
+supported or behave differently in the alternative matching function. Those
+that are not supported cause an error if encountered.
 </P>
 <P>
 1. Because the algorithm finds all possible matches, the greedy or ungreedy
@ -159,29 +160,32 @@ do this. This means that no captured substrings are available.
 </P>
 <P>
 3. Because no substrings are captured, backreferences within the pattern are
-not supported, and cause errors if encountered.
+not supported.
 </P>
 <P>
 4. For the same reason, conditional expressions that use a backreference as the
 condition or test for a specific group recursion are not supported.
 </P>
 <P>
-5. Because many paths through the tree may be active, the \K escape sequence,
-which resets the start of the match when encountered (but may be on some paths
-and not on others), is not supported. It causes an error if encountered.
+5. Again for the same reason, script runs are not supported.
 </P>
 <P>
-6. Callouts are supported, but the value of the <i>capture_top</i> field is
+6. Because many paths through the tree may be active, the \K escape sequence,
+which resets the start of the match when encountered (but may be on some paths
+and not on others), is not supported.
+</P>
+<P>
+7. Callouts are supported, but the value of the <i>capture_top</i> field is
 always 1, and the value of the <i>capture_last</i> field is always 0.
 </P>
 <P>
-7. The \C escape sequence, which (in the standard algorithm) always matches a
+8. The \C escape sequence, which (in the standard algorithm) always matches a
 single code unit, even in a UTF mode, is not supported in these modes, because
 the alternative algorithm moves through the subject string one character (not
 code unit) at a time, for all active paths through the tree.
 </P>
 <P>
-8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
+9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
 </P>
 <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
@ -215,7 +219,7 @@ because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 </P>
 <P>
-2. Capturing parentheses and backreferences are not supported.
+2. Capturing parentheses, backreferences, and script runs are not supported.
 </P>
 <P>
 3. Although atomic groups are supported, their use does not provide the
@ -232,9 +236,9 @@ Cambridge, England.
 </P>
 <br><a name="SEC8" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 29 September 2014
+Last updated: 10 October 2018
 <br>
-Copyright &copy; 1997-2014 University of Cambridge.
+Copyright &copy; 1997-2018 University of Cambridge.
 <br>
 <p>
 Return to the <a href="index.html">PCRE2 index page</a>.
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@ -33,16 +33,17 @@ please consult the man page, in case the conversion went wrong.
 <li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
 <li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
 <li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
-<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
-<li><a name="TOC22" href="#SEC22">COMMENTS</a>
-<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
-<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
-<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
-<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
-<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
-<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
-<li><a name="TOC29" href="#SEC29">AUTHOR</a>
-<li><a name="TOC30" href="#SEC30">REVISION</a>
+<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
+<li><a name="TOC22" href="#SEC22">CONDITIONAL SUBPATTERNS</a>
+<li><a name="TOC23" href="#SEC23">COMMENTS</a>
+<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
+<li><a name="TOC25" href="#SEC25">SUBPATTERNS AS SUBROUTINES</a>
+<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
+<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
+<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
+<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
+<li><a name="TOC30" href="#SEC30">AUTHOR</a>
+<li><a name="TOC31" href="#SEC31">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
 <P>
@ -756,7 +757,7 @@ sequences that match characters with specific properties are available. In
 8-bit non-UTF-8 mode, these sequences are of course limited to testing
 characters whose code points are less than 256, but they do work in this mode.
 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
-may be encountered. These are all treated as being in the Common script and
+may be encountered. These are all treated as being in the Unknown script and
 with an unassigned type. The extra escape sequences are:
 <pre>
  \p{<i>xx</i>}   a character with the <i>xx</i> property
@ -780,8 +781,10 @@ example:
  \p{Greek}
  \P{Han}
 </pre>
-Those that are not part of an identified script are lumped together as
-"Common". The current list of scripts is:
+Unassigned characters (and in non-UTF 32-bit mode, characters with code points
+greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
+part of an identified script are lumped together as "Common". The current list
+of scripts is:
 </P>
 <P>
 Adlam,
@ -928,6 +931,7 @@ Tibetan,
 Tifinagh,
 Tirhuta,
 Ugaritic,
+Unknown,
 Vai,
 Warang_Citi,
 Yi,
@ -2589,8 +2593,70 @@ preceded by "foo", while
 </pre>
 is another pattern that matches "foo" preceded by three digits and any three
 characters that are not "999".
+</P>
+<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
+<P>
+In concept, a script run is a sequence of characters that are all from the same
+Unicode script such as Latin or Greek. However, because some scripts are
+commonly used together, and because some diacritical and other marks are used
+with multiple scripts, it is not that simple. There is a full description of
+the rules that PCRE2 uses in the section entitled
+<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
+in the
+<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
+documentation.
+</P>
+<P>
+If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
+parenthesis, it fails if the sequence of characters that it matches are not a
+script run. After a failure, normal backtracking occurs. Script runs can be
+used to detect spoofing attacks using characters that look the same, but are
+from different scripts. The string "paypal.com" is an infamous example, where
+the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
+the matched characters in a sequence of non-spaces that follow white space are
+a script run:
+<pre>
+  \s+(*sr:\S+)
+</pre>
+To be sure that they are all from the Latin script (for example), a lookahead
+can be used:
+<pre>
+  \s+(?=\p{Latin})(*sr:\S+)
+</pre>
+This works as long as the first character is expected to be a character in that 
+script, and not (for example) punctuation, which is allowed with any script. If
+this is not the case, a more creative lookahead is needed. For example, if 
+digits, underscore, and dots are permitted at the start:
+<pre>
+  \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
+
+</PRE>
+</P>
+<P>
+In many cases, backtracking into a script run pattern fragment is not
+desirable. The script run can employ an atomic group to prevent this. Because
+this is a common requirement, a shorthand notation is provided by
+(*atomic_script_run: or (*asr:
+<pre>
+  (*asr:...) is the same as (*sr:(?&#62;...))
+</pre>
+Note that the atomic group is inside the script run. Putting it outside would
+not prevent backtracking into the script run pattern.
+</P>
+<P>
+Support for script runs is not available if PCRE2 is compiled without Unicode
+support. A compile-time error is given if any of the above constructs is
+encountered. Script runs are not supported by the alternate matching function,
+<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
+parentheses.
+</P>
+<P>
+<b>Warning:</b> The (*ACCEPT) control verb
+<a href="#acceptverb">(see below)</a>
+should not be used within a script run subpattern, because it causes an
+immediate exit from the subpattern, bypassing the script run checking.
 <a name="conditions"></a></P>
-<br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
+<br><a name="SEC22" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
 <P>
 It is possible to cause the matching process to obey a subpattern
 conditionally or to choose between two alternative subpatterns, depending on
@ -2790,7 +2856,7 @@ positive and negative assertions, because matching always continues after the
 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
 when captures are retained only for positive assertions that succeed.)
 <a name="comments"></a></P>
-<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
+<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
 <P>
 There are two ways of including comments in patterns that are processed by
 PCRE2. In both cases, the start of the comment must not be in a character
@ -2820,7 +2886,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
 it does not terminate the comment. Only an actual character with the code value
 0x0a (the default newline) does so.
 <a name="recursion"></a></P>
-<br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
+<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
 <P>
 Consider the problem of matching a string in parentheses, allowing for
 unlimited nested parentheses. Without the use of recursion, the best that can
@ -3008,7 +3074,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
 "b" and so the whole match succeeds. This match used to fail in Perl, but in
 later versions (I tried 5.024) it now works.
 <a name="subpatternsassubroutines"></a></P>
-<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
+<br><a name="SEC25" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
 <P>
 If the syntax for a recursive subpattern call (either by number or by
 name) is used outside the parentheses to which it refers, it operates a bit
@ -3057,7 +3123,7 @@ in subpatterns when called as subroutines is described in the section entitled
 <a href="#btsub">"Backtracking verbs in subroutines"</a>
 below.
 <a name="onigurumasubroutines"></a></P>
-<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
+<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
 <P>
 For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
 a number enclosed either in angle brackets or single quotes, is an alternative
@ -3075,7 +3141,7 @@ plus or a minus sign it is taken as a relative reference. For example:
 Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
 synonymous. The former is a backreference; the latter is a subroutine call.
 </P>
-<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
+<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
 <P>
 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
 code to be obeyed in the middle of matching a regular expression. This makes it
@ -3151,7 +3217,7 @@ example:
 </pre>
 The doubling is removed before the string is passed to the callout function.
 <a name="backtrackcontrol"></a></P>
-<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
+<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
 <P>
 There are a number of special "Backtracking Control Verbs" (to use Perl's
 terminology) that modify the behaviour of backtracking during matching. They
@ -3222,7 +3288,7 @@ documentation.
 <P>
 Experiments with Perl suggest that it too has similar optimizations, and like
 PCRE2, turning them off can change the result of a match.
-</P>
+<a name="acceptverb"></a></P>
 <br><b>
 Verbs that act immediately
 </b><br>
@ -3245,6 +3311,11 @@ example:
 </pre>
 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
 the outer parentheses.
+</P>
+<P>
+<b>Warning:</b> (*ACCEPT) should not be used within a script run subpattern,
+because it causes an immediate exit from the subpattern, bypassing the script
+run checking.
 <pre>
  (*FAIL) or (*FAIL:NAME)
 </pre>
@ -3644,12 +3715,12 @@ behaviour). However, if there is no such group within the subroutine
 subpattern, the subroutine match fails and there is a backtrack at the outer
 level.
 </P>
-<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
 <P>
 <b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
 <b>pcre2syntax</b>(3), <b>pcre2</b>(3).
 </P>
-<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
 <P>
 Philip Hazel
 <br>
@ -3658,9 +3729,9 @@ University Computing Service
 Cambridge, England.
 <br>
 </P>
-<br><a name="SEC30" href="#TOC1">REVISION</a><br>
+<br><a name="SEC31" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 24 September 2018
+Last updated: 12 October 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@ -32,14 +32,15 @@ please consult the man page, in case the conversion went wrong.
 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
 <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
 <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
-<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
-<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
-<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
-<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
-<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
-<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
-<li><a name="TOC26" href="#SEC26">AUTHOR</a>
-<li><a name="TOC27" href="#SEC27">REVISION</a>
+<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
+<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
+<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
+<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
+<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
+<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
+<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
+<li><a name="TOC27" href="#SEC27">AUTHOR</a>
+<li><a name="TOC28" href="#SEC28">REVISION</a>
 </ul>
 <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
 <P>
@ -533,7 +534,17 @@ setting with a similar syntax.
 </pre>
 Each top-level branch of a lookbehind must be of a fixed length.
 </P>
-<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
+<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
+<P>
+<pre>
+  (*script_run:...)           ) script run, can be backtracked into
+  (*sr:...)                   )
+
+  (*atomic_script_run:...)    ) atomic script run
+  (*asr:...)                  )
+</PRE>
+</P>
+<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
 <P>
 <pre>
  \n              reference by number (can be ambiguous)
@ -550,7 +561,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
  (?P=name)       reference by name (Python)
 </PRE>
 </P>
-<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
+<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
 <P>
 <pre>
  (?R)            recurse whole pattern
@ -569,7 +580,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
  \g'-n'          call subpattern by relative number (PCRE2 extension)
 </PRE>
 </P>
-<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
+<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
 <P>
 <pre>
  (?(condition)yes-pattern)
@ -592,7 +603,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
 conditions or recursion tests. Such a condition is interpreted as a reference
 condition if the relevant named group exists.
 </P>
-<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
+<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
 <P>
 All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
 name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -619,7 +630,7 @@ pattern is not anchored.
 The effect of one of these verbs in a group called as a subroutine is confined
 to the subroutine call.
 </P>
-<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
+<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
 <P>
 <pre>
  (?C)            callout (assumed number 0)
@ -630,12 +641,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
 start and the end), and the starting delimiter { matched with the ending
 delimiter }. To encode the ending delimiter within the string, double it.
 </P>
-<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
 <P>
 <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
 <b>pcre2matching</b>(3), <b>pcre2</b>(3).
 </P>
-<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
 <P>
 Philip Hazel
 <br>
@ -644,9 +655,9 @@ University Computing Service
 Cambridge, England.
 <br>
 </P>
-<br><a name="SEC27" href="#TOC1">REVISION</a><br>
+<br><a name="SEC28" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 24 September 2018
+Last updated: 10 October 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/html/pcre2unicode.html
+++ b/doc/html/pcre2unicode.html
@ -124,6 +124,116 @@ for characters whose code points are less than 128 and that have at most two
 case-equivalent values. For these, a direct table lookup is used for speed. A
 few Unicode characters such as Greek sigma have more than two code points that
 are case-equivalent, and these are treated as such.
+<a name="scriptruns"></a></P>
+<br><b>
+SCRIPT RUNS
+</b><br>
+<P>
+The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
+synonyms (*sr:...) and (*asr:...), verify that the string matched within the
+parentheses is a script run. In concept, a script run is a sequence of
+characters that are all from the same Unicode script. However, because some
+scripts are commonly used together, and because some diacritical and other
+marks are used with multiple scripts, it is not that simple.
+</P>
+<P>
+Every Unicode character has a Script property, mostly with a value 
+corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
+are also three special values:
+</P>
+<P>
+"Unknown" is used for code points that have not been assigned, and also for the
+surrogate code points. In the PCRE2 32-bit library, characters whose code
+points are greater than the Unicode maximum (U+10FFFF), which are accessible 
+only in non-UTF mode, are assigned the Unknown script.
+</P>
+<P>
+"Common" is used for characters that are used with many scripts. These include
+punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
+digits 0 to 9.
+</P>
+<P>
+"Inherited" is used for characters such as diacritical marks that modify a
+previous character. These are considered to take on the script of the character
+that they modify.
+</P>
+<P>
+Some Inherited characters are used with many scripts, but many of them are only 
+normally used with a small number of scripts. For example, U+102E0 (Coptic 
+Epact thousands mark) is used only with Arabic and Coptic. In order to make it 
+possible to check this, a Unicode property called Script Extension exists. Its 
+value is a list of scripts that apply to the character. For the majority of 
+characters, the list contains just one script, the same one as the Script
+property. However, for characters such as U+102E0 more than one Script is
+listed. There are also some Common characters that have a single, non-Common
+script in their Script Extension list.
+</P>
+<P>
+The next section describes the basic rules for deciding whether a given string 
+of characters is a script run. Note, however, that there are some special cases 
+involving the Chinese Han script, and an additional constraint for decimal 
+digits. These are covered in subsequent sections.
+</P>
+<br><b>
+Basic script run rules
+</b><br>
+<P>
+A string that is less than two characters long is a script run. This is the
+only case in which an Unknown character can be part of a script run. Longer
+strings are checked using only the Script Extensions property, not the basic
+Script property.
+</P>
+<P>
+If a character's Script Extension property is the single value "Inherited", it
+is always accepted as part of a script run. This is also true for the property
+"Common", subject to the checking of decimal digits described below. All the
+remaining characters in a script run must have at least one script in common in
+their Script Extension lists. In set-theoretic terminology, the intersection of
+all the sets of scripts must not be empty.
+</P>
+<P>
+A simple example is an Internet name such as "google.com". The letters are all
+in the Latin script, and the dot is Common, so this string is a script run.
+However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a 
+string that looks the same, but with Cyrillic "o"s is not a script run.
+</P>
+<P>
+More interesting examples involve characters with more than one script in their 
+Script Extension. Consider the following characters:
+<pre>
+  U+060C  Arabic comma
+  U+06D4  Arabic full stop
+</pre>
+The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and 
+Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
+appear in script runs of either Arabic or Hanifi Rohingya. The first could also
+appear in Syriac or Thaana script runs, but the second could not.
+</P>
+<br><b>
+The Chinese Han script
+</b><br>
+<P>
+The Chinese Han script is commonly used in conjunction with other scripts for 
+writing certain languages. Japanese uses the Hiragana and Katakana scripts 
+together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
+and Han. These three combinations are treated as special cases when checking
+script runs and are, in effect, "virtual scripts". Thus, a script run may
+contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
+Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
+Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
+Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
+in allowing such mixtures.
+</P>
+<br><b>
+Decimal digits
+</b><br>
+<P>
+Unicode contains many sets of 10 decimal digits in different scripts, and some
+scripts (including the Common script) contain more than one set. Some of these
+decimal digits them are visually indistinguishable from the common ASCII
+digits. In addition to the script checking described above, if a script run
+contains any decimal digits, they must all come from the same set of 10
+adjacent characters.
 </P>
 <br><b>
 VALIDITY OF UTF STRINGS
@ -300,7 +410,7 @@ Cambridge, England.
 REVISION
 </b><br>
 <P>
-Last updated: 02 September 2018
+Last updated: 12 October 2018
 <br>
 Copyright &copy; 1997-2018 University of Cambridge.
 <br>
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
--- a/doc/pcre2matching.3
+++ b/doc/pcre2matching.3
@ -1,4 +1,4 @@
-.TH PCRE2MATCHING 3 "29 September 2014" "PCRE2 10.00"
+.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 MATCHING ALGORITHMS"
@ -113,7 +113,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
 ("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
 .P
 There are a number of features of PCRE2 regular expressions that are not
-supported by the alternative matching algorithm. They are as follows:
+supported or behave differently in the alternative matching function. Those
+that are not supported cause an error if encountered.
 .P
 1. Because the algorithm finds all possible matches, the greedy or ungreedy
 nature of repetition quantifiers is not relevant (though it may affect
@ -135,24 +136,26 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
 do this. This means that no captured substrings are available.
 .P
 3. Because no substrings are captured, backreferences within the pattern are
-not supported, and cause errors if encountered.
+not supported.
 .P
 4. For the same reason, conditional expressions that use a backreference as the
 condition or test for a specific group recursion are not supported.
 .P
-5. Because many paths through the tree may be active, the \eK escape sequence,
-which resets the start of the match when encountered (but may be on some paths
-and not on others), is not supported. It causes an error if encountered.
+5. Again for the same reason, script runs are not supported.
 .P
-6. Callouts are supported, but the value of the \fIcapture_top\fP field is
+6. Because many paths through the tree may be active, the \eK escape sequence,
+which resets the start of the match when encountered (but may be on some paths
+and not on others), is not supported.
+.P
+7. Callouts are supported, but the value of the \fIcapture_top\fP field is
 always 1, and the value of the \fIcapture_last\fP field is always 0.
 .P
-7. The \eC escape sequence, which (in the standard algorithm) always matches a
+8. The \eC escape sequence, which (in the standard algorithm) always matches a
 single code unit, even in a UTF mode, is not supported in these modes, because
 the alternative algorithm moves through the subject string one character (not
 code unit) at a time, for all active paths through the tree.
 .P
-8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
+9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
 supported. (*FAIL) is supported, and behaves like a failing negative assertion.
 .
 .
@ -188,7 +191,7 @@ The alternative algorithm suffers from a number of disadvantages:
 because it has to search for all possible matches, but is also because it is
 less susceptible to optimization.
 .P
-2. Capturing parentheses and backreferences are not supported.
+2. Capturing parentheses, backreferences, and script runs are not supported.
 .P
 3. Although atomic groups are supported, their use does not provide the
 performance advantage that it does for the standard algorithm.
@ -208,6 +211,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 29 September 2014
-Copyright (c) 1997-2014 University of Cambridge.
+Last updated: 10 October 2018
+Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
+.TH PCRE2PATTERN 3 "12 October 2018" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -755,7 +755,7 @@ sequences that match characters with specific properties are available. In
 8-bit non-UTF-8 mode, these sequences are of course limited to testing
 characters whose code points are less than 256, but they do work in this mode.
 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
-may be encountered. These are all treated as being in the Common script and
+may be encountered. These are all treated as being in the Unknown script and
 with an unassigned type. The extra escape sequences are:
 .sp
  \ep{\fIxx\fP}   a character with the \fIxx\fP property
@ -781,8 +781,10 @@ example:
  \ep{Greek}
  \eP{Han}
 .sp
-Those that are not part of an identified script are lumped together as
-"Common". The current list of scripts is:
+Unassigned characters (and in non-UTF 32-bit mode, characters with code points
+greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
+part of an identified script are lumped together as "Common". The current list
+of scripts is:
 .P
 Adlam,
 Ahom,
@ -928,6 +930,7 @@ Tibetan,
 Tifinagh,
 Tirhuta,
 Ugaritic,
+Unknown,
 Vai,
 Warang_Citi,
 Yi,
@ -2603,6 +2606,73 @@ is another pattern that matches "foo" preceded by three digits and any three
 characters that are not "999".
 .
 .
+.SH "SCRIPT RUNS"
+.rs
+.sp
+In concept, a script run is a sequence of characters that are all from the same
+Unicode script such as Latin or Greek. However, because some scripts are
+commonly used together, and because some diacritical and other marks are used
+with multiple scripts, it is not that simple. There is a full description of
+the rules that PCRE2 uses in the section entitled
+.\" HTML <a href="pcre2unicode.html#scriptruns">
+.\" </a>
+"Script Runs"
+.\"
+in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+documentation.
+.P
+If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
+parenthesis, it fails if the sequence of characters that it matches are not a
+script run. After a failure, normal backtracking occurs. Script runs can be
+used to detect spoofing attacks using characters that look the same, but are
+from different scripts. The string "paypal.com" is an infamous example, where
+the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
+the matched characters in a sequence of non-spaces that follow white space are
+a script run:
+.sp
+  \es+(*sr:\eS+)
+.sp
+To be sure that they are all from the Latin script (for example), a lookahead
+can be used:
+.sp
+  \es+(?=\ep{Latin})(*sr:\eS+)
+.sp
+This works as long as the first character is expected to be a character in that 
+script, and not (for example) punctuation, which is allowed with any script. If
+this is not the case, a more creative lookahead is needed. For example, if 
+digits, underscore, and dots are permitted at the start:
+.sp
+  \es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
+.sp
+.P
+In many cases, backtracking into a script run pattern fragment is not
+desirable. The script run can employ an atomic group to prevent this. Because
+this is a common requirement, a shorthand notation is provided by
+(*atomic_script_run: or (*asr:
+.sp
+  (*asr:...) is the same as (*sr:(?>...))
+.sp
+Note that the atomic group is inside the script run. Putting it outside would
+not prevent backtracking into the script run pattern.
+.P
+Support for script runs is not available if PCRE2 is compiled without Unicode
+support. A compile-time error is given if any of the above constructs is
+encountered. Script runs are not supported by the alternate matching function,
+\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
+parentheses.
+.P
+\fBWarning:\fP The (*ACCEPT) control verb
+.\" HTML <a href="#acceptverb">
+.\" </a>
+(see below)
+.\"
+should not be used within a script run subpattern, because it causes an
+immediate exit from the subpattern, bypassing the script run checking.
+.
+.
 .\" HTML <a name="conditions"></a>
 .SH "CONDITIONAL SUBPATTERNS"
 .rs
@ -3267,6 +3337,7 @@ Experiments with Perl suggest that it too has similar optimizations, and like
 PCRE2, turning them off can change the result of a match.
 .
 .
+.\" HTML <a name="acceptverb"></a>
 .SS "Verbs that act immediately"
 .rs
 .sp
@ -3287,6 +3358,10 @@ example:
 .sp
 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
 the outer parentheses.
+.P
+\fBWarning:\fP (*ACCEPT) should not be used within a script run subpattern,
+because it causes an immediate exit from the subpattern, bypassing the script
+run checking.
 .sp
  (*FAIL) or (*FAIL:NAME)
 .sp
@ -3692,6 +3767,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 24 September 2018
+Last updated: 12 October 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
+.TH PCRE2SYNTAX 3 "10 October 2018" "PCRE2 10.33"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -511,6 +511,16 @@ setting with a similar syntax.
 Each top-level branch of a lookbehind must be of a fixed length.
 .
 .
+.SH "SCRIPT RUNS"
+.rs
+.sp
+  (*script_run:...)           ) script run, can be backtracked into
+  (*sr:...)                   )
+.sp
+  (*atomic_script_run:...)    ) atomic script run
+  (*asr:...)                  )
+.
+.
 .SH "BACKREFERENCES"
 .rs
 .sp
@ -633,6 +643,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 24 September 2018
+Last updated: 10 October 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/doc/pcre2unicode.3
+++ b/doc/pcre2unicode.3
@ -1,4 +1,4 @@
-.TH PCRE2UNICODE 3 "02 September 2018" "PCRE2 10.32"
+.TH PCRE2UNICODE 3 "12 October 2018" "PCRE2 10.33"
 .SH NAME
 PCRE - Perl-compatible regular expressions (revised API)
 .SH "UNICODE AND UTF SUPPORT"
@ -118,6 +118,108 @@ few Unicode characters such as Greek sigma have more than two code points that
 are case-equivalent, and these are treated as such.
 .
 .
+.\" HTML <a name="scriptruns"></a>
+.SH "SCRIPT RUNS"
+.rs
+.sp
+The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
+synonyms (*sr:...) and (*asr:...), verify that the string matched within the
+parentheses is a script run. In concept, a script run is a sequence of
+characters that are all from the same Unicode script. However, because some
+scripts are commonly used together, and because some diacritical and other
+marks are used with multiple scripts, it is not that simple.
+.P
+Every Unicode character has a Script property, mostly with a value 
+corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
+are also three special values:
+.P
+"Unknown" is used for code points that have not been assigned, and also for the
+surrogate code points. In the PCRE2 32-bit library, characters whose code
+points are greater than the Unicode maximum (U+10FFFF), which are accessible 
+only in non-UTF mode, are assigned the Unknown script.
+.P
+"Common" is used for characters that are used with many scripts. These include
+punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
+digits 0 to 9.
+.P
+"Inherited" is used for characters such as diacritical marks that modify a
+previous character. These are considered to take on the script of the character
+that they modify.
+.P
+Some Inherited characters are used with many scripts, but many of them are only 
+normally used with a small number of scripts. For example, U+102E0 (Coptic 
+Epact thousands mark) is used only with Arabic and Coptic. In order to make it 
+possible to check this, a Unicode property called Script Extension exists. Its 
+value is a list of scripts that apply to the character. For the majority of 
+characters, the list contains just one script, the same one as the Script
+property. However, for characters such as U+102E0 more than one Script is
+listed. There are also some Common characters that have a single, non-Common
+script in their Script Extension list.
+.P
+The next section describes the basic rules for deciding whether a given string 
+of characters is a script run. Note, however, that there are some special cases 
+involving the Chinese Han script, and an additional constraint for decimal 
+digits. These are covered in subsequent sections.
+.
+.
+.SS "Basic script run rules"
+.rs
+.sp
+A string that is less than two characters long is a script run. This is the
+only case in which an Unknown character can be part of a script run. Longer
+strings are checked using only the Script Extensions property, not the basic
+Script property.
+.P
+If a character's Script Extension property is the single value "Inherited", it
+is always accepted as part of a script run. This is also true for the property
+"Common", subject to the checking of decimal digits described below. All the
+remaining characters in a script run must have at least one script in common in
+their Script Extension lists. In set-theoretic terminology, the intersection of
+all the sets of scripts must not be empty.
+.P
+A simple example is an Internet name such as "google.com". The letters are all
+in the Latin script, and the dot is Common, so this string is a script run.
+However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a 
+string that looks the same, but with Cyrillic "o"s is not a script run.
+.P
+More interesting examples involve characters with more than one script in their 
+Script Extension. Consider the following characters:
+.sp
+  U+060C  Arabic comma
+  U+06D4  Arabic full stop
+.sp
+The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and 
+Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
+appear in script runs of either Arabic or Hanifi Rohingya. The first could also
+appear in Syriac or Thaana script runs, but the second could not.
+.
+.
+.SS "The Chinese Han script"
+.rs
+.sp  
+The Chinese Han script is commonly used in conjunction with other scripts for 
+writing certain languages. Japanese uses the Hiragana and Katakana scripts 
+together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
+and Han. These three combinations are treated as special cases when checking
+script runs and are, in effect, "virtual scripts". Thus, a script run may
+contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
+Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
+Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
+Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
+in allowing such mixtures.
+.
+.
+.SS "Decimal digits"
+.rs
+.sp
+Unicode contains many sets of 10 decimal digits in different scripts, and some
+scripts (including the Common script) contain more than one set. Some of these
+decimal digits them are visually indistinguishable from the common ASCII
+digits. In addition to the script checking described above, if a script run
+contains any decimal digits, they must all come from the same set of 10
+adjacent characters.
+.
+.
 .SH "VALIDITY OF UTF STRINGS"
 .rs
 .sp
@ -285,6 +387,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 02 September 2018
+Last updated: 12 October 2018
 Copyright (c) 1997-2018 University of Cambridge.
 .fi
--- a/testdata/testinput4
+++ b/testdata/testinput4
@ -2410,6 +2410,7 @@
    \x{3031}\x{3041}\x{30a1}\x{2e80}   [Hira Kata] Hira Kata Han
    \x{060c}\x{06d4}\x{0600}\x{10d00}\x{0700}  [Arab Rohg Syrc Thaa] [Arab Rohg] Arab Rohg Syrc
    \x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00}  [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
+    \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}   Han Hira [Bopo, Han, etc] [Hira Kata] Han

 /(?<!)(*sr:)/

--- a/testdata/testinput5
+++ b/testdata/testinput5
@ -2112,6 +2112,9 @@
 /^(*sr:.*)/B,utf 
    paypаl.com   A classic example of why script run checks are a good thing

+/^(*sr:.*(*ACCEPT))/utf 
+    paypаl.com   But *ACCEPT breaks things
+
 /^(*sr:\x{2e80}*)/B,utf

 /^(*sr:\x{2e80}*)\x{2e80}/B,utf
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@ -3902,6 +3902,8 @@ No match
 0: \x{60c}\x{6d4}\x{600}
    \x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00}  [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
 0: \x{60c}\x{6d4}
+    \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}   Han Hira [Bopo, Han, etc] [Hira Kata] Han
+ 0: \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}

 /(?<!)(*sr:)/

--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@ -4791,6 +4791,10 @@ MK: ABC
    paypаl.com   A classic example of why script run checks are a good thing
 0: payp

+/^(*sr:.*(*ACCEPT))/utf 
+    paypаl.com   But *ACCEPT breaks things
+ 0: payp\x{430}l.com   But *ACCEPT breaks things
+
 /^(*sr:\x{2e80}*)/B,utf
 ------------------------------------------------------------------
        Bra