Documentation and tests update for script runs.
This commit is contained in:
parent
4e7a204d18
commit
0fc5cda13b
|
@ -31,8 +31,7 @@ hexadecimal digit" bit was removed. The default tables in
|
|||
src/pcre2_chartables.c.dist are updated.
|
||||
|
||||
8. Implement the new Perl "script run" features (*script_run:...) and
|
||||
(*atomic_script_run:...) aka (*sr:...) and (*asr:...). At present, this is
|
||||
not yet documented.
|
||||
(*atomic_script_run:...) aka (*sr:...) and (*asr:...).
|
||||
|
||||
|
||||
Version 10.32 10-September-2018
|
||||
|
|
|
@ -134,7 +134,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
|
|||
</P>
|
||||
<P>
|
||||
There are a number of features of PCRE2 regular expressions that are not
|
||||
supported by the alternative matching algorithm. They are as follows:
|
||||
supported or behave differently in the alternative matching function. Those
|
||||
that are not supported cause an error if encountered.
|
||||
</P>
|
||||
<P>
|
||||
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
||||
|
@ -159,29 +160,32 @@ do this. This means that no captured substrings are available.
|
|||
</P>
|
||||
<P>
|
||||
3. Because no substrings are captured, backreferences within the pattern are
|
||||
not supported, and cause errors if encountered.
|
||||
not supported.
|
||||
</P>
|
||||
<P>
|
||||
4. For the same reason, conditional expressions that use a backreference as the
|
||||
condition or test for a specific group recursion are not supported.
|
||||
</P>
|
||||
<P>
|
||||
5. Because many paths through the tree may be active, the \K escape sequence,
|
||||
which resets the start of the match when encountered (but may be on some paths
|
||||
and not on others), is not supported. It causes an error if encountered.
|
||||
5. Again for the same reason, script runs are not supported.
|
||||
</P>
|
||||
<P>
|
||||
6. Callouts are supported, but the value of the <i>capture_top</i> field is
|
||||
6. Because many paths through the tree may be active, the \K escape sequence,
|
||||
which resets the start of the match when encountered (but may be on some paths
|
||||
and not on others), is not supported.
|
||||
</P>
|
||||
<P>
|
||||
7. Callouts are supported, but the value of the <i>capture_top</i> field is
|
||||
always 1, and the value of the <i>capture_last</i> field is always 0.
|
||||
</P>
|
||||
<P>
|
||||
7. The \C escape sequence, which (in the standard algorithm) always matches a
|
||||
8. The \C escape sequence, which (in the standard algorithm) always matches a
|
||||
single code unit, even in a UTF mode, is not supported in these modes, because
|
||||
the alternative algorithm moves through the subject string one character (not
|
||||
code unit) at a time, for all active paths through the tree.
|
||||
</P>
|
||||
<P>
|
||||
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||
|
@ -215,7 +219,7 @@ because it has to search for all possible matches, but is also because it is
|
|||
less susceptible to optimization.
|
||||
</P>
|
||||
<P>
|
||||
2. Capturing parentheses and backreferences are not supported.
|
||||
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||
</P>
|
||||
<P>
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
|
@ -232,9 +236,9 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 29 September 2014
|
||||
Last updated: 10 October 2018
|
||||
<br>
|
||||
Copyright © 1997-2014 University of Cambridge.
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
<p>
|
||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||
|
|
|
@ -33,16 +33,17 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
|
||||
<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
|
||||
<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
||||
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
|
||||
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
|
||||
<li><a name="TOC30" href="#SEC30">REVISION</a>
|
||||
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL SUBPATTERNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
|
||||
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SUBPATTERNS AS SUBROUTINES</a>
|
||||
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
||||
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
|
||||
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
|
||||
<li><a name="TOC31" href="#SEC31">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
||||
<P>
|
||||
|
@ -756,7 +757,7 @@ sequences that match characters with specific properties are available. In
|
|||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose code points are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
may be encountered. These are all treated as being in the Unknown script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
<pre>
|
||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||
|
@ -780,8 +781,10 @@ example:
|
|||
\p{Greek}
|
||||
\P{Han}
|
||||
</pre>
|
||||
Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of scripts is:
|
||||
</P>
|
||||
<P>
|
||||
Adlam,
|
||||
|
@ -928,6 +931,7 @@ Tibetan,
|
|||
Tifinagh,
|
||||
Tirhuta,
|
||||
Ugaritic,
|
||||
Unknown,
|
||||
Vai,
|
||||
Warang_Citi,
|
||||
Yi,
|
||||
|
@ -2589,8 +2593,70 @@ preceded by "foo", while
|
|||
</pre>
|
||||
is another pattern that matches "foo" preceded by three digits and any three
|
||||
characters that are not "999".
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
In concept, a script run is a sequence of characters that are all from the same
|
||||
Unicode script such as Latin or Greek. However, because some scripts are
|
||||
commonly used together, and because some diacritical and other marks are used
|
||||
with multiple scripts, it is not that simple. There is a full description of
|
||||
the rules that PCRE2 uses in the section entitled
|
||||
<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
|
||||
in the
|
||||
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<P>
|
||||
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
|
||||
parenthesis, it fails if the sequence of characters that it matches are not a
|
||||
script run. After a failure, normal backtracking occurs. Script runs can be
|
||||
used to detect spoofing attacks using characters that look the same, but are
|
||||
from different scripts. The string "paypal.com" is an infamous example, where
|
||||
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
|
||||
the matched characters in a sequence of non-spaces that follow white space are
|
||||
a script run:
|
||||
<pre>
|
||||
\s+(*sr:\S+)
|
||||
</pre>
|
||||
To be sure that they are all from the Latin script (for example), a lookahead
|
||||
can be used:
|
||||
<pre>
|
||||
\s+(?=\p{Latin})(*sr:\S+)
|
||||
</pre>
|
||||
This works as long as the first character is expected to be a character in that
|
||||
script, and not (for example) punctuation, which is allowed with any script. If
|
||||
this is not the case, a more creative lookahead is needed. For example, if
|
||||
digits, underscore, and dots are permitted at the start:
|
||||
<pre>
|
||||
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
In many cases, backtracking into a script run pattern fragment is not
|
||||
desirable. The script run can employ an atomic group to prevent this. Because
|
||||
this is a common requirement, a shorthand notation is provided by
|
||||
(*atomic_script_run: or (*asr:
|
||||
<pre>
|
||||
(*asr:...) is the same as (*sr:(?>...))
|
||||
</pre>
|
||||
Note that the atomic group is inside the script run. Putting it outside would
|
||||
not prevent backtracking into the script run pattern.
|
||||
</P>
|
||||
<P>
|
||||
Support for script runs is not available if PCRE2 is compiled without Unicode
|
||||
support. A compile-time error is given if any of the above constructs is
|
||||
encountered. Script runs are not supported by the alternate matching function,
|
||||
<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
|
||||
parentheses.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> The (*ACCEPT) control verb
|
||||
<a href="#acceptverb">(see below)</a>
|
||||
should not be used within a script run subpattern, because it causes an
|
||||
immediate exit from the subpattern, bypassing the script run checking.
|
||||
<a name="conditions"></a></P>
|
||||
<br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
||||
<P>
|
||||
It is possible to cause the matching process to obey a subpattern
|
||||
conditionally or to choose between two alternative subpatterns, depending on
|
||||
|
@ -2790,7 +2856,7 @@ positive and negative assertions, because matching always continues after the
|
|||
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
||||
when captures are retained only for positive assertions that succeed.)
|
||||
<a name="comments"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
|
||||
<P>
|
||||
There are two ways of including comments in patterns that are processed by
|
||||
PCRE2. In both cases, the start of the comment must not be in a character
|
||||
|
@ -2820,7 +2886,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
|
|||
it does not terminate the comment. Only an actual character with the code value
|
||||
0x0a (the default newline) does so.
|
||||
<a name="recursion"></a></P>
|
||||
<br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<P>
|
||||
Consider the problem of matching a string in parentheses, allowing for
|
||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||
|
@ -3008,7 +3074,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
|
|||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||
later versions (I tried 5.024) it now works.
|
||||
<a name="subpatternsassubroutines"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
<P>
|
||||
If the syntax for a recursive subpattern call (either by number or by
|
||||
name) is used outside the parentheses to which it refers, it operates a bit
|
||||
|
@ -3057,7 +3123,7 @@ in subpatterns when called as subroutines is described in the section entitled
|
|||
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
||||
below.
|
||||
<a name="onigurumasubroutines"></a></P>
|
||||
<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<P>
|
||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||
|
@ -3075,7 +3141,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
|||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||
|
@ -3151,7 +3217,7 @@ example:
|
|||
</pre>
|
||||
The doubling is removed before the string is passed to the callout function.
|
||||
<a name="backtrackcontrol"></a></P>
|
||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||
terminology) that modify the behaviour of backtracking during matching. They
|
||||
|
@ -3222,7 +3288,7 @@ documentation.
|
|||
<P>
|
||||
Experiments with Perl suggest that it too has similar optimizations, and like
|
||||
PCRE2, turning them off can change the result of a match.
|
||||
</P>
|
||||
<a name="acceptverb"></a></P>
|
||||
<br><b>
|
||||
Verbs that act immediately
|
||||
</b><br>
|
||||
|
@ -3245,6 +3311,11 @@ example:
|
|||
</pre>
|
||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||
the outer parentheses.
|
||||
</P>
|
||||
<P>
|
||||
<b>Warning:</b> (*ACCEPT) should not be used within a script run subpattern,
|
||||
because it causes an immediate exit from the subpattern, bypassing the script
|
||||
run checking.
|
||||
<pre>
|
||||
(*FAIL) or (*FAIL:NAME)
|
||||
</pre>
|
||||
|
@ -3644,12 +3715,12 @@ behaviour). However, if there is no such group within the subroutine
|
|||
subpattern, the subroutine match fails and there is a backtrack at the outer
|
||||
level.
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -3658,9 +3729,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 12 October 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -32,14 +32,15 @@ please consult the man page, in case the conversion went wrong.
|
|||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
|
||||
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
|
||||
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
|
||||
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
|
||||
<li><a name="TOC27" href="#SEC27">REVISION</a>
|
||||
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
|
||||
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
|
||||
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
|
||||
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
|
||||
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
|
||||
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
|
||||
<li><a name="TOC28" href="#SEC28">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||
<P>
|
||||
|
@ -533,7 +534,17 @@ setting with a similar syntax.
|
|||
</pre>
|
||||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
</P>
|
||||
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
(*sr:...) )
|
||||
|
||||
(*atomic_script_run:...) ) atomic script run
|
||||
(*asr:...) )
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
\n reference by number (can be ambiguous)
|
||||
|
@ -550,7 +561,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
|||
(?P=name) reference by name (Python)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?R) recurse whole pattern
|
||||
|
@ -569,7 +580,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
|||
\g'-n' call subpattern by relative number (PCRE2 extension)
|
||||
</PRE>
|
||||
</P>
|
||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?(condition)yes-pattern)
|
||||
|
@ -592,7 +603,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
|||
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||
condition if the relevant named group exists.
|
||||
</P>
|
||||
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||
|
@ -619,7 +630,7 @@ pattern is not anchored.
|
|||
The effect of one of these verbs in a group called as a subroutine is confined
|
||||
to the subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
(?C) callout (assumed number 0)
|
||||
|
@ -630,12 +641,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
|||
start and the end), and the starting delimiter { matched with the ending
|
||||
delimiter }. To encode the ending delimiter within the string, double it.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||
</P>
|
||||
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
|
@ -644,9 +655,9 @@ University Computing Service
|
|||
Cambridge, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 10 October 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -124,6 +124,116 @@ for characters whose code points are less than 128 and that have at most two
|
|||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||
few Unicode characters such as Greek sigma have more than two code points that
|
||||
are case-equivalent, and these are treated as such.
|
||||
<a name="scriptruns"></a></P>
|
||||
<br><b>
|
||||
SCRIPT RUNS
|
||||
</b><br>
|
||||
<P>
|
||||
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
|
||||
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
|
||||
parentheses is a script run. In concept, a script run is a sequence of
|
||||
characters that are all from the same Unicode script. However, because some
|
||||
scripts are commonly used together, and because some diacritical and other
|
||||
marks are used with multiple scripts, it is not that simple.
|
||||
</P>
|
||||
<P>
|
||||
Every Unicode character has a Script property, mostly with a value
|
||||
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
|
||||
are also three special values:
|
||||
</P>
|
||||
<P>
|
||||
"Unknown" is used for code points that have not been assigned, and also for the
|
||||
surrogate code points. In the PCRE2 32-bit library, characters whose code
|
||||
points are greater than the Unicode maximum (U+10FFFF), which are accessible
|
||||
only in non-UTF mode, are assigned the Unknown script.
|
||||
</P>
|
||||
<P>
|
||||
"Common" is used for characters that are used with many scripts. These include
|
||||
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
|
||||
digits 0 to 9.
|
||||
</P>
|
||||
<P>
|
||||
"Inherited" is used for characters such as diacritical marks that modify a
|
||||
previous character. These are considered to take on the script of the character
|
||||
that they modify.
|
||||
</P>
|
||||
<P>
|
||||
Some Inherited characters are used with many scripts, but many of them are only
|
||||
normally used with a small number of scripts. For example, U+102E0 (Coptic
|
||||
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
|
||||
possible to check this, a Unicode property called Script Extension exists. Its
|
||||
value is a list of scripts that apply to the character. For the majority of
|
||||
characters, the list contains just one script, the same one as the Script
|
||||
property. However, for characters such as U+102E0 more than one Script is
|
||||
listed. There are also some Common characters that have a single, non-Common
|
||||
script in their Script Extension list.
|
||||
</P>
|
||||
<P>
|
||||
The next section describes the basic rules for deciding whether a given string
|
||||
of characters is a script run. Note, however, that there are some special cases
|
||||
involving the Chinese Han script, and an additional constraint for decimal
|
||||
digits. These are covered in subsequent sections.
|
||||
</P>
|
||||
<br><b>
|
||||
Basic script run rules
|
||||
</b><br>
|
||||
<P>
|
||||
A string that is less than two characters long is a script run. This is the
|
||||
only case in which an Unknown character can be part of a script run. Longer
|
||||
strings are checked using only the Script Extensions property, not the basic
|
||||
Script property.
|
||||
</P>
|
||||
<P>
|
||||
If a character's Script Extension property is the single value "Inherited", it
|
||||
is always accepted as part of a script run. This is also true for the property
|
||||
"Common", subject to the checking of decimal digits described below. All the
|
||||
remaining characters in a script run must have at least one script in common in
|
||||
their Script Extension lists. In set-theoretic terminology, the intersection of
|
||||
all the sets of scripts must not be empty.
|
||||
</P>
|
||||
<P>
|
||||
A simple example is an Internet name such as "google.com". The letters are all
|
||||
in the Latin script, and the dot is Common, so this string is a script run.
|
||||
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
|
||||
string that looks the same, but with Cyrillic "o"s is not a script run.
|
||||
</P>
|
||||
<P>
|
||||
More interesting examples involve characters with more than one script in their
|
||||
Script Extension. Consider the following characters:
|
||||
<pre>
|
||||
U+060C Arabic comma
|
||||
U+06D4 Arabic full stop
|
||||
</pre>
|
||||
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
|
||||
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
|
||||
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
|
||||
appear in Syriac or Thaana script runs, but the second could not.
|
||||
</P>
|
||||
<br><b>
|
||||
The Chinese Han script
|
||||
</b><br>
|
||||
<P>
|
||||
The Chinese Han script is commonly used in conjunction with other scripts for
|
||||
writing certain languages. Japanese uses the Hiragana and Katakana scripts
|
||||
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
|
||||
and Han. These three combinations are treated as special cases when checking
|
||||
script runs and are, in effect, "virtual scripts". Thus, a script run may
|
||||
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
|
||||
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
|
||||
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
|
||||
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
|
||||
in allowing such mixtures.
|
||||
</P>
|
||||
<br><b>
|
||||
Decimal digits
|
||||
</b><br>
|
||||
<P>
|
||||
Unicode contains many sets of 10 decimal digits in different scripts, and some
|
||||
scripts (including the Common script) contain more than one set. Some of these
|
||||
decimal digits them are visually indistinguishable from the common ASCII
|
||||
digits. In addition to the script checking described above, if a script run
|
||||
contains any decimal digits, they must all come from the same set of 10
|
||||
adjacent characters.
|
||||
</P>
|
||||
<br><b>
|
||||
VALIDITY OF UTF STRINGS
|
||||
|
@ -300,7 +410,7 @@ Cambridge, England.
|
|||
REVISION
|
||||
</b><br>
|
||||
<P>
|
||||
Last updated: 02 September 2018
|
||||
Last updated: 12 October 2018
|
||||
<br>
|
||||
Copyright © 1997-2018 University of Cambridge.
|
||||
<br>
|
||||
|
|
216
doc/pcre2.txt
216
doc/pcre2.txt
|
@ -5391,8 +5391,8 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
|||
SESS option when compiling.
|
||||
|
||||
There are a number of features of PCRE2 regular expressions that are
|
||||
not supported by the alternative matching algorithm. They are as fol-
|
||||
lows:
|
||||
not supported or behave differently in the alternative matching func-
|
||||
tion. Those that are not supported cause an error if encountered.
|
||||
|
||||
1. Because the algorithm finds all possible matches, the greedy or
|
||||
ungreedy nature of repetition quantifiers is not relevant (though it
|
||||
|
@ -5417,27 +5417,28 @@ THE ALTERNATIVE MATCHING ALGORITHM
|
|||
strings are available.
|
||||
|
||||
3. Because no substrings are captured, backreferences within the pat-
|
||||
tern are not supported, and cause errors if encountered.
|
||||
tern are not supported.
|
||||
|
||||
4. For the same reason, conditional expressions that use a backrefer-
|
||||
ence as the condition or test for a specific group recursion are not
|
||||
supported.
|
||||
|
||||
5. Because many paths through the tree may be active, the \K escape
|
||||
sequence, which resets the start of the match when encountered (but may
|
||||
be on some paths and not on others), is not supported. It causes an
|
||||
error if encountered.
|
||||
5. Again for the same reason, script runs are not supported.
|
||||
|
||||
6. Callouts are supported, but the value of the capture_top field is
|
||||
6. Because many paths through the tree may be active, the \K escape
|
||||
sequence, which resets the start of the match when encountered (but may
|
||||
be on some paths and not on others), is not supported.
|
||||
|
||||
7. Callouts are supported, but the value of the capture_top field is
|
||||
always 1, and the value of the capture_last field is always 0.
|
||||
|
||||
7. The \C escape sequence, which (in the standard algorithm) always
|
||||
8. The \C escape sequence, which (in the standard algorithm) always
|
||||
matches a single code unit, even in a UTF mode, is not supported in
|
||||
these modes, because the alternative algorithm moves through the sub-
|
||||
ject string one character (not code unit) at a time, for all active
|
||||
paths through the tree.
|
||||
|
||||
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
|
||||
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
|
||||
are not supported. (*FAIL) is supported, and behaves like a failing
|
||||
negative assertion.
|
||||
|
||||
|
@ -5470,7 +5471,8 @@ DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|||
partly because it has to search for all possible matches, but is also
|
||||
because it is less susceptible to optimization.
|
||||
|
||||
2. Capturing parentheses and backreferences are not supported.
|
||||
2. Capturing parentheses, backreferences, and script runs are not sup-
|
||||
ported.
|
||||
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
|
@ -5485,8 +5487,8 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 29 September 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
@ -6586,7 +6588,7 @@ BACKSLASH
|
|||
limited to testing characters whose code points are less than 256, but
|
||||
they do work in this mode. In 32-bit non-UTF mode, code points greater
|
||||
than 0x10ffff (the Unicode limit) may be encountered. These are all
|
||||
treated as being in the Common script and with an unassigned type. The
|
||||
treated as being in the Unknown script and with an unassigned type. The
|
||||
extra escape sequences are:
|
||||
|
||||
\p{xx} a character with the xx property
|
||||
|
@ -6607,8 +6609,10 @@ BACKSLASH
|
|||
\p{Greek}
|
||||
\P{Han}
|
||||
|
||||
Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code
|
||||
points greater than 0x10FFFF) are assigned the "Unknown" script. Others
|
||||
that are not part of an identified script are lumped together as "Com-
|
||||
mon". The current list of scripts is:
|
||||
|
||||
Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
|
||||
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
|
||||
|
@ -6632,7 +6636,8 @@ BACKSLASH
|
|||
vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
|
||||
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
|
||||
Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
|
||||
nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
|
||||
nagh, Tirhuta, Ugaritic, Unknown, Vai, Warang_Citi, Yi, Zan-
|
||||
abazar_Square.
|
||||
|
||||
Each character has exactly one Unicode general category property, spec-
|
||||
ified by a two-letter abbreviation. For compatibility with Perl, nega-
|
||||
|
@ -8197,6 +8202,62 @@ ASSERTIONS
|
|||
three characters that are not "999".
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
||||
In concept, a script run is a sequence of characters that are all from
|
||||
the same Unicode script such as Latin or Greek. However, because some
|
||||
scripts are commonly used together, and because some diacritical and
|
||||
other marks are used with multiple scripts, it is not that simple.
|
||||
There is a full description of the rules that PCRE2 uses in the section
|
||||
entitled "Script Runs" in the pcre2unicode documentation.
|
||||
|
||||
If part of a pattern is enclosed between (*script_run: or (*sr: and a
|
||||
closing parenthesis, it fails if the sequence of characters that it
|
||||
matches are not a script run. After a failure, normal backtracking
|
||||
occurs. Script runs can be used to detect spoofing attacks using char-
|
||||
acters that look the same, but are from different scripts. The string
|
||||
"paypal.com" is an infamous example, where the letters could be a mix-
|
||||
ture of Latin and Cyrillic. This pattern ensures that the matched char-
|
||||
acters in a sequence of non-spaces that follow white space are a script
|
||||
run:
|
||||
|
||||
\s+(*sr:\S+)
|
||||
|
||||
To be sure that they are all from the Latin script (for example), a
|
||||
lookahead can be used:
|
||||
|
||||
\s+(?=\p{Latin})(*sr:\S+)
|
||||
|
||||
This works as long as the first character is expected to be a character
|
||||
in that script, and not (for example) punctuation, which is allowed
|
||||
with any script. If this is not the case, a more creative lookahead is
|
||||
needed. For example, if digits, underscore, and dots are permitted at
|
||||
the start:
|
||||
|
||||
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
|
||||
|
||||
|
||||
In many cases, backtracking into a script run pattern fragment is not
|
||||
desirable. The script run can employ an atomic group to prevent this.
|
||||
Because this is a common requirement, a shorthand notation is provided
|
||||
by (*atomic_script_run: or (*asr:
|
||||
|
||||
(*asr:...) is the same as (*sr:(?>...))
|
||||
|
||||
Note that the atomic group is inside the script run. Putting it outside
|
||||
would not prevent backtracking into the script run pattern.
|
||||
|
||||
Support for script runs is not available if PCRE2 is compiled without
|
||||
Unicode support. A compile-time error is given if any of the above con-
|
||||
structs is encountered. Script runs are not supported by the alternate
|
||||
matching function, pcre2_dfa_match() because they use the same mecha-
|
||||
nism as capturing parentheses.
|
||||
|
||||
Warning: The (*ACCEPT) control verb (see below) should not be used
|
||||
within a script run subpattern, because it causes an immediate exit
|
||||
from the subpattern, bypassing the script run checking.
|
||||
|
||||
|
||||
CONDITIONAL SUBPATTERNS
|
||||
|
||||
It is possible to cause the matching process to obey a subpattern con-
|
||||
|
@ -8814,6 +8875,10 @@ BACKTRACKING CONTROL
|
|||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
|
||||
tured by the outer parentheses.
|
||||
|
||||
Warning: (*ACCEPT) should not be used within a script run subpattern,
|
||||
because it causes an immediate exit from the subpattern, bypassing the
|
||||
script run checking.
|
||||
|
||||
(*FAIL) or (*FAIL:NAME)
|
||||
|
||||
This verb causes a matching failure, forcing backtracking to occur. It
|
||||
|
@ -9205,7 +9270,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 12 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10413,6 +10478,15 @@ LOOKAHEAD AND LOOKBEHIND ASSERTIONS
|
|||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
(*sr:...) )
|
||||
|
||||
(*atomic_script_run:...) ) atomic script run
|
||||
(*asr:...) )
|
||||
|
||||
|
||||
BACKREFERENCES
|
||||
|
||||
\n reference by number (can be ambiguous)
|
||||
|
@ -10525,7 +10599,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
@ -10633,6 +10707,108 @@ CASE-EQUIVALENCE IN UTF MODES
|
|||
such.
|
||||
|
||||
|
||||
SCRIPT RUNS
|
||||
|
||||
The pattern constructs (*script_run:...) and (*atomic_script_run:...),
|
||||
with synonyms (*sr:...) and (*asr:...), verify that the string matched
|
||||
within the parentheses is a script run. In concept, a script run is a
|
||||
sequence of characters that are all from the same Unicode script. How-
|
||||
ever, because some scripts are commonly used together, and because some
|
||||
diacritical and other marks are used with multiple scripts, it is not
|
||||
that simple.
|
||||
|
||||
Every Unicode character has a Script property, mostly with a value cor-
|
||||
responding to the name of a script, such as Latin, Greek, or Cyrillic.
|
||||
There are also three special values:
|
||||
|
||||
"Unknown" is used for code points that have not been assigned, and also
|
||||
for the surrogate code points. In the PCRE2 32-bit library, characters
|
||||
whose code points are greater than the Unicode maximum (U+10FFFF),
|
||||
which are accessible only in non-UTF mode, are assigned the Unknown
|
||||
script.
|
||||
|
||||
"Common" is used for characters that are used with many scripts. These
|
||||
include punctuation, emoji, mathematical, musical, and currency sym-
|
||||
bols, and the ASCII digits 0 to 9.
|
||||
|
||||
"Inherited" is used for characters such as diacritical marks that mod-
|
||||
ify a previous character. These are considered to take on the script of
|
||||
the character that they modify.
|
||||
|
||||
Some Inherited characters are used with many scripts, but many of them
|
||||
are only normally used with a small number of scripts. For example,
|
||||
U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop-
|
||||
tic. In order to make it possible to check this, a Unicode property
|
||||
called Script Extension exists. Its value is a list of scripts that
|
||||
apply to the character. For the majority of characters, the list con-
|
||||
tains just one script, the same one as the Script property. However,
|
||||
for characters such as U+102E0 more than one Script is listed. There
|
||||
are also some Common characters that have a single, non-Common script
|
||||
in their Script Extension list.
|
||||
|
||||
The next section describes the basic rules for deciding whether a given
|
||||
string of characters is a script run. Note, however, that there are
|
||||
some special cases involving the Chinese Han script, and an additional
|
||||
constraint for decimal digits. These are covered in subsequent sec-
|
||||
tions.
|
||||
|
||||
Basic script run rules
|
||||
|
||||
A string that is less than two characters long is a script run. This is
|
||||
the only case in which an Unknown character can be part of a script
|
||||
run. Longer strings are checked using only the Script Extensions prop-
|
||||
erty, not the basic Script property.
|
||||
|
||||
If a character's Script Extension property is the single value "Inher-
|
||||
ited", it is always accepted as part of a script run. This is also true
|
||||
for the property "Common", subject to the checking of decimal digits
|
||||
described below. All the remaining characters in a script run must have
|
||||
at least one script in common in their Script Extension lists. In set-
|
||||
theoretic terminology, the intersection of all the sets of scripts must
|
||||
not be empty.
|
||||
|
||||
A simple example is an Internet name such as "google.com". The letters
|
||||
are all in the Latin script, and the dot is Common, so this string is a
|
||||
script run. However, the Cyrillic letter "o" looks exactly the same as
|
||||
the Latin "o"; a string that looks the same, but with Cyrillic "o"s is
|
||||
not a script run.
|
||||
|
||||
More interesting examples involve characters with more than one script
|
||||
in their Script Extension. Consider the following characters:
|
||||
|
||||
U+060C Arabic comma
|
||||
U+06D4 Arabic full stop
|
||||
|
||||
The first has the Script Extension list Arabic, Hanifi Rohingya, Syr-
|
||||
iac, and Thaana; the second has just Arabic and Hanifi Rohingya. Both
|
||||
of them could appear in script runs of either Arabic or Hanifi
|
||||
Rohingya. The first could also appear in Syriac or Thaana script runs,
|
||||
but the second could not.
|
||||
|
||||
The Chinese Han script
|
||||
|
||||
The Chinese Han script is commonly used in conjunction with other
|
||||
scripts for writing certain languages. Japanese uses the Hiragana and
|
||||
Katakana scripts together with Han; Korean uses Hangul and Han; Tai-
|
||||
wanese Mandarin uses Bopomofo and Han. These three combinations are
|
||||
treated as special cases when checking script runs and are, in effect,
|
||||
"virtual scripts". Thus, a script run may contain a mixture of Hira-
|
||||
gana, Katakana, and Han, or a mixture of Hangul and Han, or a mixture
|
||||
of Bopomofo and Han, but not, for example, a mixture of Hangul and
|
||||
Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical Stan-
|
||||
dard 39 ("Unicode Security Mechanisms", http://uni-
|
||||
code.org/reports/tr39/) in allowing such mixtures.
|
||||
|
||||
Decimal digits
|
||||
|
||||
Unicode contains many sets of 10 decimal digits in different scripts,
|
||||
and some scripts (including the Common script) contain more than one
|
||||
set. Some of these decimal digits them are visually indistinguishable
|
||||
from the common ASCII digits. In addition to the script checking
|
||||
described above, if a script run contains any decimal digits, they must
|
||||
all come from the same set of 10 adjacent characters.
|
||||
|
||||
|
||||
VALIDITY OF UTF STRINGS
|
||||
|
||||
When the PCRE2_UTF option is set, the strings passed as patterns and
|
||||
|
@ -10788,7 +10964,7 @@ AUTHOR
|
|||
|
||||
REVISION
|
||||
|
||||
Last updated: 02 September 2018
|
||||
Last updated: 12 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
------------------------------------------------------------------------------
|
||||
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2MATCHING 3 "29 September 2014" "PCRE2 10.00"
|
||||
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 MATCHING ALGORITHMS"
|
||||
|
@ -113,7 +113,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
|
|||
("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||
.P
|
||||
There are a number of features of PCRE2 regular expressions that are not
|
||||
supported by the alternative matching algorithm. They are as follows:
|
||||
supported or behave differently in the alternative matching function. Those
|
||||
that are not supported cause an error if encountered.
|
||||
.P
|
||||
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
||||
nature of repetition quantifiers is not relevant (though it may affect
|
||||
|
@ -135,24 +136,26 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
|
|||
do this. This means that no captured substrings are available.
|
||||
.P
|
||||
3. Because no substrings are captured, backreferences within the pattern are
|
||||
not supported, and cause errors if encountered.
|
||||
not supported.
|
||||
.P
|
||||
4. For the same reason, conditional expressions that use a backreference as the
|
||||
condition or test for a specific group recursion are not supported.
|
||||
.P
|
||||
5. Because many paths through the tree may be active, the \eK escape sequence,
|
||||
which resets the start of the match when encountered (but may be on some paths
|
||||
and not on others), is not supported. It causes an error if encountered.
|
||||
5. Again for the same reason, script runs are not supported.
|
||||
.P
|
||||
6. Callouts are supported, but the value of the \fIcapture_top\fP field is
|
||||
6. Because many paths through the tree may be active, the \eK escape sequence,
|
||||
which resets the start of the match when encountered (but may be on some paths
|
||||
and not on others), is not supported.
|
||||
.P
|
||||
7. Callouts are supported, but the value of the \fIcapture_top\fP field is
|
||||
always 1, and the value of the \fIcapture_last\fP field is always 0.
|
||||
.P
|
||||
7. The \eC escape sequence, which (in the standard algorithm) always matches a
|
||||
8. The \eC escape sequence, which (in the standard algorithm) always matches a
|
||||
single code unit, even in a UTF mode, is not supported in these modes, because
|
||||
the alternative algorithm moves through the subject string one character (not
|
||||
code unit) at a time, for all active paths through the tree.
|
||||
.P
|
||||
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||
.
|
||||
.
|
||||
|
@ -188,7 +191,7 @@ The alternative algorithm suffers from a number of disadvantages:
|
|||
because it has to search for all possible matches, but is also because it is
|
||||
less susceptible to optimization.
|
||||
.P
|
||||
2. Capturing parentheses and backreferences are not supported.
|
||||
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||
.P
|
||||
3. Although atomic groups are supported, their use does not provide the
|
||||
performance advantage that it does for the standard algorithm.
|
||||
|
@ -208,6 +211,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 29 September 2014
|
||||
Copyright (c) 1997-2014 University of Cambridge.
|
||||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
|
||||
.TH PCRE2PATTERN 3 "12 October 2018" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -755,7 +755,7 @@ sequences that match characters with specific properties are available. In
|
|||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||
characters whose code points are less than 256, but they do work in this mode.
|
||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||
may be encountered. These are all treated as being in the Common script and
|
||||
may be encountered. These are all treated as being in the Unknown script and
|
||||
with an unassigned type. The extra escape sequences are:
|
||||
.sp
|
||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||
|
@ -781,8 +781,10 @@ example:
|
|||
\ep{Greek}
|
||||
\eP{Han}
|
||||
.sp
|
||||
Those that are not part of an identified script are lumped together as
|
||||
"Common". The current list of scripts is:
|
||||
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||
part of an identified script are lumped together as "Common". The current list
|
||||
of scripts is:
|
||||
.P
|
||||
Adlam,
|
||||
Ahom,
|
||||
|
@ -928,6 +930,7 @@ Tibetan,
|
|||
Tifinagh,
|
||||
Tirhuta,
|
||||
Ugaritic,
|
||||
Unknown,
|
||||
Vai,
|
||||
Warang_Citi,
|
||||
Yi,
|
||||
|
@ -2603,6 +2606,73 @@ is another pattern that matches "foo" preceded by three digits and any three
|
|||
characters that are not "999".
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT RUNS"
|
||||
.rs
|
||||
.sp
|
||||
In concept, a script run is a sequence of characters that are all from the same
|
||||
Unicode script such as Latin or Greek. However, because some scripts are
|
||||
commonly used together, and because some diacritical and other marks are used
|
||||
with multiple scripts, it is not that simple. There is a full description of
|
||||
the rules that PCRE2 uses in the section entitled
|
||||
.\" HTML <a href="pcre2unicode.html#scriptruns">
|
||||
.\" </a>
|
||||
"Script Runs"
|
||||
.\"
|
||||
in the
|
||||
.\" HREF
|
||||
\fBpcre2unicode\fP
|
||||
.\"
|
||||
documentation.
|
||||
.P
|
||||
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
|
||||
parenthesis, it fails if the sequence of characters that it matches are not a
|
||||
script run. After a failure, normal backtracking occurs. Script runs can be
|
||||
used to detect spoofing attacks using characters that look the same, but are
|
||||
from different scripts. The string "paypal.com" is an infamous example, where
|
||||
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
|
||||
the matched characters in a sequence of non-spaces that follow white space are
|
||||
a script run:
|
||||
.sp
|
||||
\es+(*sr:\eS+)
|
||||
.sp
|
||||
To be sure that they are all from the Latin script (for example), a lookahead
|
||||
can be used:
|
||||
.sp
|
||||
\es+(?=\ep{Latin})(*sr:\eS+)
|
||||
.sp
|
||||
This works as long as the first character is expected to be a character in that
|
||||
script, and not (for example) punctuation, which is allowed with any script. If
|
||||
this is not the case, a more creative lookahead is needed. For example, if
|
||||
digits, underscore, and dots are permitted at the start:
|
||||
.sp
|
||||
\es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
|
||||
.sp
|
||||
.P
|
||||
In many cases, backtracking into a script run pattern fragment is not
|
||||
desirable. The script run can employ an atomic group to prevent this. Because
|
||||
this is a common requirement, a shorthand notation is provided by
|
||||
(*atomic_script_run: or (*asr:
|
||||
.sp
|
||||
(*asr:...) is the same as (*sr:(?>...))
|
||||
.sp
|
||||
Note that the atomic group is inside the script run. Putting it outside would
|
||||
not prevent backtracking into the script run pattern.
|
||||
.P
|
||||
Support for script runs is not available if PCRE2 is compiled without Unicode
|
||||
support. A compile-time error is given if any of the above constructs is
|
||||
encountered. Script runs are not supported by the alternate matching function,
|
||||
\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
|
||||
parentheses.
|
||||
.P
|
||||
\fBWarning:\fP The (*ACCEPT) control verb
|
||||
.\" HTML <a href="#acceptverb">
|
||||
.\" </a>
|
||||
(see below)
|
||||
.\"
|
||||
should not be used within a script run subpattern, because it causes an
|
||||
immediate exit from the subpattern, bypassing the script run checking.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="conditions"></a>
|
||||
.SH "CONDITIONAL SUBPATTERNS"
|
||||
.rs
|
||||
|
@ -3267,6 +3337,7 @@ Experiments with Perl suggest that it too has similar optimizations, and like
|
|||
PCRE2, turning them off can change the result of a match.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="acceptverb"></a>
|
||||
.SS "Verbs that act immediately"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -3287,6 +3358,10 @@ example:
|
|||
.sp
|
||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||
the outer parentheses.
|
||||
.P
|
||||
\fBWarning:\fP (*ACCEPT) should not be used within a script run subpattern,
|
||||
because it causes an immediate exit from the subpattern, bypassing the script
|
||||
run checking.
|
||||
.sp
|
||||
(*FAIL) or (*FAIL:NAME)
|
||||
.sp
|
||||
|
@ -3692,6 +3767,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 12 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
|
||||
.TH PCRE2SYNTAX 3 "10 October 2018" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -511,6 +511,16 @@ setting with a similar syntax.
|
|||
Each top-level branch of a lookbehind must be of a fixed length.
|
||||
.
|
||||
.
|
||||
.SH "SCRIPT RUNS"
|
||||
.rs
|
||||
.sp
|
||||
(*script_run:...) ) script run, can be backtracked into
|
||||
(*sr:...) )
|
||||
.sp
|
||||
(*atomic_script_run:...) ) atomic script run
|
||||
(*asr:...) )
|
||||
.
|
||||
.
|
||||
.SH "BACKREFERENCES"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -633,6 +643,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 24 September 2018
|
||||
Last updated: 10 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2UNICODE 3 "02 September 2018" "PCRE2 10.32"
|
||||
.TH PCRE2UNICODE 3 "12 October 2018" "PCRE2 10.33"
|
||||
.SH NAME
|
||||
PCRE - Perl-compatible regular expressions (revised API)
|
||||
.SH "UNICODE AND UTF SUPPORT"
|
||||
|
@ -118,6 +118,108 @@ few Unicode characters such as Greek sigma have more than two code points that
|
|||
are case-equivalent, and these are treated as such.
|
||||
.
|
||||
.
|
||||
.\" HTML <a name="scriptruns"></a>
|
||||
.SH "SCRIPT RUNS"
|
||||
.rs
|
||||
.sp
|
||||
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
|
||||
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
|
||||
parentheses is a script run. In concept, a script run is a sequence of
|
||||
characters that are all from the same Unicode script. However, because some
|
||||
scripts are commonly used together, and because some diacritical and other
|
||||
marks are used with multiple scripts, it is not that simple.
|
||||
.P
|
||||
Every Unicode character has a Script property, mostly with a value
|
||||
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
|
||||
are also three special values:
|
||||
.P
|
||||
"Unknown" is used for code points that have not been assigned, and also for the
|
||||
surrogate code points. In the PCRE2 32-bit library, characters whose code
|
||||
points are greater than the Unicode maximum (U+10FFFF), which are accessible
|
||||
only in non-UTF mode, are assigned the Unknown script.
|
||||
.P
|
||||
"Common" is used for characters that are used with many scripts. These include
|
||||
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
|
||||
digits 0 to 9.
|
||||
.P
|
||||
"Inherited" is used for characters such as diacritical marks that modify a
|
||||
previous character. These are considered to take on the script of the character
|
||||
that they modify.
|
||||
.P
|
||||
Some Inherited characters are used with many scripts, but many of them are only
|
||||
normally used with a small number of scripts. For example, U+102E0 (Coptic
|
||||
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
|
||||
possible to check this, a Unicode property called Script Extension exists. Its
|
||||
value is a list of scripts that apply to the character. For the majority of
|
||||
characters, the list contains just one script, the same one as the Script
|
||||
property. However, for characters such as U+102E0 more than one Script is
|
||||
listed. There are also some Common characters that have a single, non-Common
|
||||
script in their Script Extension list.
|
||||
.P
|
||||
The next section describes the basic rules for deciding whether a given string
|
||||
of characters is a script run. Note, however, that there are some special cases
|
||||
involving the Chinese Han script, and an additional constraint for decimal
|
||||
digits. These are covered in subsequent sections.
|
||||
.
|
||||
.
|
||||
.SS "Basic script run rules"
|
||||
.rs
|
||||
.sp
|
||||
A string that is less than two characters long is a script run. This is the
|
||||
only case in which an Unknown character can be part of a script run. Longer
|
||||
strings are checked using only the Script Extensions property, not the basic
|
||||
Script property.
|
||||
.P
|
||||
If a character's Script Extension property is the single value "Inherited", it
|
||||
is always accepted as part of a script run. This is also true for the property
|
||||
"Common", subject to the checking of decimal digits described below. All the
|
||||
remaining characters in a script run must have at least one script in common in
|
||||
their Script Extension lists. In set-theoretic terminology, the intersection of
|
||||
all the sets of scripts must not be empty.
|
||||
.P
|
||||
A simple example is an Internet name such as "google.com". The letters are all
|
||||
in the Latin script, and the dot is Common, so this string is a script run.
|
||||
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
|
||||
string that looks the same, but with Cyrillic "o"s is not a script run.
|
||||
.P
|
||||
More interesting examples involve characters with more than one script in their
|
||||
Script Extension. Consider the following characters:
|
||||
.sp
|
||||
U+060C Arabic comma
|
||||
U+06D4 Arabic full stop
|
||||
.sp
|
||||
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
|
||||
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
|
||||
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
|
||||
appear in Syriac or Thaana script runs, but the second could not.
|
||||
.
|
||||
.
|
||||
.SS "The Chinese Han script"
|
||||
.rs
|
||||
.sp
|
||||
The Chinese Han script is commonly used in conjunction with other scripts for
|
||||
writing certain languages. Japanese uses the Hiragana and Katakana scripts
|
||||
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
|
||||
and Han. These three combinations are treated as special cases when checking
|
||||
script runs and are, in effect, "virtual scripts". Thus, a script run may
|
||||
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
|
||||
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
|
||||
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
|
||||
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
|
||||
in allowing such mixtures.
|
||||
.
|
||||
.
|
||||
.SS "Decimal digits"
|
||||
.rs
|
||||
.sp
|
||||
Unicode contains many sets of 10 decimal digits in different scripts, and some
|
||||
scripts (including the Common script) contain more than one set. Some of these
|
||||
decimal digits them are visually indistinguishable from the common ASCII
|
||||
digits. In addition to the script checking described above, if a script run
|
||||
contains any decimal digits, they must all come from the same set of 10
|
||||
adjacent characters.
|
||||
.
|
||||
.
|
||||
.SH "VALIDITY OF UTF STRINGS"
|
||||
.rs
|
||||
.sp
|
||||
|
@ -285,6 +387,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 02 September 2018
|
||||
Last updated: 12 October 2018
|
||||
Copyright (c) 1997-2018 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -2410,6 +2410,7 @@
|
|||
\x{3031}\x{3041}\x{30a1}\x{2e80} [Hira Kata] Hira Kata Han
|
||||
\x{060c}\x{06d4}\x{0600}\x{10d00}\x{0700} [Arab Rohg Syrc Thaa] [Arab Rohg] Arab Rohg Syrc
|
||||
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
||||
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
|
||||
|
||||
/(?<!)(*sr:)/
|
||||
|
||||
|
|
|
@ -2112,6 +2112,9 @@
|
|||
/^(*sr:.*)/B,utf
|
||||
paypаl.com A classic example of why script run checks are a good thing
|
||||
|
||||
/^(*sr:.*(*ACCEPT))/utf
|
||||
paypаl.com But *ACCEPT breaks things
|
||||
|
||||
/^(*sr:\x{2e80}*)/B,utf
|
||||
|
||||
/^(*sr:\x{2e80}*)\x{2e80}/B,utf
|
||||
|
|
|
@ -3902,6 +3902,8 @@ No match
|
|||
0: \x{60c}\x{6d4}\x{600}
|
||||
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
||||
0: \x{60c}\x{6d4}
|
||||
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
|
||||
0: \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}
|
||||
|
||||
/(?<!)(*sr:)/
|
||||
|
||||
|
|
|
@ -4791,6 +4791,10 @@ MK: ABC
|
|||
paypаl.com A classic example of why script run checks are a good thing
|
||||
0: payp
|
||||
|
||||
/^(*sr:.*(*ACCEPT))/utf
|
||||
paypаl.com But *ACCEPT breaks things
|
||||
0: payp\x{430}l.com But *ACCEPT breaks things
|
||||
|
||||
/^(*sr:\x{2e80}*)/B,utf
|
||||
------------------------------------------------------------------
|
||||
Bra
|
||||
|
|
Loading…
Reference in New Issue