Documentation and tests update for script runs.
This commit is contained in:
parent
4e7a204d18
commit
0fc5cda13b
|
@ -31,8 +31,7 @@ hexadecimal digit" bit was removed. The default tables in
|
||||||
src/pcre2_chartables.c.dist are updated.
|
src/pcre2_chartables.c.dist are updated.
|
||||||
|
|
||||||
8. Implement the new Perl "script run" features (*script_run:...) and
|
8. Implement the new Perl "script run" features (*script_run:...) and
|
||||||
(*atomic_script_run:...) aka (*sr:...) and (*asr:...). At present, this is
|
(*atomic_script_run:...) aka (*sr:...) and (*asr:...).
|
||||||
not yet documented.
|
|
||||||
|
|
||||||
|
|
||||||
Version 10.32 10-September-2018
|
Version 10.32 10-September-2018
|
||||||
|
|
|
@ -134,7 +134,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
There are a number of features of PCRE2 regular expressions that are not
|
There are a number of features of PCRE2 regular expressions that are not
|
||||||
supported by the alternative matching algorithm. They are as follows:
|
supported or behave differently in the alternative matching function. Those
|
||||||
|
that are not supported cause an error if encountered.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
||||||
|
@ -159,29 +160,32 @@ do this. This means that no captured substrings are available.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
3. Because no substrings are captured, backreferences within the pattern are
|
3. Because no substrings are captured, backreferences within the pattern are
|
||||||
not supported, and cause errors if encountered.
|
not supported.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
4. For the same reason, conditional expressions that use a backreference as the
|
4. For the same reason, conditional expressions that use a backreference as the
|
||||||
condition or test for a specific group recursion are not supported.
|
condition or test for a specific group recursion are not supported.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
5. Because many paths through the tree may be active, the \K escape sequence,
|
5. Again for the same reason, script runs are not supported.
|
||||||
which resets the start of the match when encountered (but may be on some paths
|
|
||||||
and not on others), is not supported. It causes an error if encountered.
|
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
6. Callouts are supported, but the value of the <i>capture_top</i> field is
|
6. Because many paths through the tree may be active, the \K escape sequence,
|
||||||
|
which resets the start of the match when encountered (but may be on some paths
|
||||||
|
and not on others), is not supported.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
7. Callouts are supported, but the value of the <i>capture_top</i> field is
|
||||||
always 1, and the value of the <i>capture_last</i> field is always 0.
|
always 1, and the value of the <i>capture_last</i> field is always 0.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
7. The \C escape sequence, which (in the standard algorithm) always matches a
|
8. The \C escape sequence, which (in the standard algorithm) always matches a
|
||||||
single code unit, even in a UTF mode, is not supported in these modes, because
|
single code unit, even in a UTF mode, is not supported in these modes, because
|
||||||
the alternative algorithm moves through the subject string one character (not
|
the alternative algorithm moves through the subject string one character (not
|
||||||
code unit) at a time, for all active paths through the tree.
|
code unit) at a time, for all active paths through the tree.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
|
||||||
|
@ -215,7 +219,7 @@ because it has to search for all possible matches, but is also because it is
|
||||||
less susceptible to optimization.
|
less susceptible to optimization.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
2. Capturing parentheses and backreferences are not supported.
|
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
3. Although atomic groups are supported, their use does not provide the
|
3. Although atomic groups are supported, their use does not provide the
|
||||||
|
@ -232,9 +236,9 @@ Cambridge, England.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 29 September 2014
|
Last updated: 10 October 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2014 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
<p>
|
<p>
|
||||||
Return to the <a href="index.html">PCRE2 index page</a>.
|
Return to the <a href="index.html">PCRE2 index page</a>.
|
||||||
|
|
|
@ -33,16 +33,17 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||||
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
|
||||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||||
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
|
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
|
||||||
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
|
<li><a name="TOC22" href="#SEC22">CONDITIONAL SUBPATTERNS</a>
|
||||||
<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
|
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
|
||||||
<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
|
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
|
||||||
<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
|
<li><a name="TOC25" href="#SEC25">SUBPATTERNS AS SUBROUTINES</a>
|
||||||
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||||
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
|
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
|
||||||
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
|
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
|
||||||
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
|
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
|
||||||
<li><a name="TOC30" href="#SEC30">REVISION</a>
|
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
|
||||||
|
<li><a name="TOC31" href="#SEC31">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -756,7 +757,7 @@ sequences that match characters with specific properties are available. In
|
||||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||||
characters whose code points are less than 256, but they do work in this mode.
|
characters whose code points are less than 256, but they do work in this mode.
|
||||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||||
may be encountered. These are all treated as being in the Common script and
|
may be encountered. These are all treated as being in the Unknown script and
|
||||||
with an unassigned type. The extra escape sequences are:
|
with an unassigned type. The extra escape sequences are:
|
||||||
<pre>
|
<pre>
|
||||||
\p{<i>xx</i>} a character with the <i>xx</i> property
|
\p{<i>xx</i>} a character with the <i>xx</i> property
|
||||||
|
@ -780,8 +781,10 @@ example:
|
||||||
\p{Greek}
|
\p{Greek}
|
||||||
\P{Han}
|
\P{Han}
|
||||||
</pre>
|
</pre>
|
||||||
Those that are not part of an identified script are lumped together as
|
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||||
"Common". The current list of scripts is:
|
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||||
|
part of an identified script are lumped together as "Common". The current list
|
||||||
|
of scripts is:
|
||||||
</P>
|
</P>
|
||||||
<P>
|
<P>
|
||||||
Adlam,
|
Adlam,
|
||||||
|
@ -928,6 +931,7 @@ Tibetan,
|
||||||
Tifinagh,
|
Tifinagh,
|
||||||
Tirhuta,
|
Tirhuta,
|
||||||
Ugaritic,
|
Ugaritic,
|
||||||
|
Unknown,
|
||||||
Vai,
|
Vai,
|
||||||
Warang_Citi,
|
Warang_Citi,
|
||||||
Yi,
|
Yi,
|
||||||
|
@ -2589,8 +2593,70 @@ preceded by "foo", while
|
||||||
</pre>
|
</pre>
|
||||||
is another pattern that matches "foo" preceded by three digits and any three
|
is another pattern that matches "foo" preceded by three digits and any three
|
||||||
characters that are not "999".
|
characters that are not "999".
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
|
||||||
|
<P>
|
||||||
|
In concept, a script run is a sequence of characters that are all from the same
|
||||||
|
Unicode script such as Latin or Greek. However, because some scripts are
|
||||||
|
commonly used together, and because some diacritical and other marks are used
|
||||||
|
with multiple scripts, it is not that simple. There is a full description of
|
||||||
|
the rules that PCRE2 uses in the section entitled
|
||||||
|
<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
|
||||||
|
in the
|
||||||
|
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
|
||||||
|
documentation.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
|
||||||
|
parenthesis, it fails if the sequence of characters that it matches are not a
|
||||||
|
script run. After a failure, normal backtracking occurs. Script runs can be
|
||||||
|
used to detect spoofing attacks using characters that look the same, but are
|
||||||
|
from different scripts. The string "paypal.com" is an infamous example, where
|
||||||
|
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
|
||||||
|
the matched characters in a sequence of non-spaces that follow white space are
|
||||||
|
a script run:
|
||||||
|
<pre>
|
||||||
|
\s+(*sr:\S+)
|
||||||
|
</pre>
|
||||||
|
To be sure that they are all from the Latin script (for example), a lookahead
|
||||||
|
can be used:
|
||||||
|
<pre>
|
||||||
|
\s+(?=\p{Latin})(*sr:\S+)
|
||||||
|
</pre>
|
||||||
|
This works as long as the first character is expected to be a character in that
|
||||||
|
script, and not (for example) punctuation, which is allowed with any script. If
|
||||||
|
this is not the case, a more creative lookahead is needed. For example, if
|
||||||
|
digits, underscore, and dots are permitted at the start:
|
||||||
|
<pre>
|
||||||
|
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
|
||||||
|
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
In many cases, backtracking into a script run pattern fragment is not
|
||||||
|
desirable. The script run can employ an atomic group to prevent this. Because
|
||||||
|
this is a common requirement, a shorthand notation is provided by
|
||||||
|
(*atomic_script_run: or (*asr:
|
||||||
|
<pre>
|
||||||
|
(*asr:...) is the same as (*sr:(?>...))
|
||||||
|
</pre>
|
||||||
|
Note that the atomic group is inside the script run. Putting it outside would
|
||||||
|
not prevent backtracking into the script run pattern.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Support for script runs is not available if PCRE2 is compiled without Unicode
|
||||||
|
support. A compile-time error is given if any of the above constructs is
|
||||||
|
encountered. Script runs are not supported by the alternate matching function,
|
||||||
|
<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
|
||||||
|
parentheses.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>Warning:</b> The (*ACCEPT) control verb
|
||||||
|
<a href="#acceptverb">(see below)</a>
|
||||||
|
should not be used within a script run subpattern, because it causes an
|
||||||
|
immediate exit from the subpattern, bypassing the script run checking.
|
||||||
<a name="conditions"></a></P>
|
<a name="conditions"></a></P>
|
||||||
<br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
<br><a name="SEC22" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
It is possible to cause the matching process to obey a subpattern
|
It is possible to cause the matching process to obey a subpattern
|
||||||
conditionally or to choose between two alternative subpatterns, depending on
|
conditionally or to choose between two alternative subpatterns, depending on
|
||||||
|
@ -2790,7 +2856,7 @@ positive and negative assertions, because matching always continues after the
|
||||||
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
|
||||||
when captures are retained only for positive assertions that succeed.)
|
when captures are retained only for positive assertions that succeed.)
|
||||||
<a name="comments"></a></P>
|
<a name="comments"></a></P>
|
||||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
There are two ways of including comments in patterns that are processed by
|
There are two ways of including comments in patterns that are processed by
|
||||||
PCRE2. In both cases, the start of the comment must not be in a character
|
PCRE2. In both cases, the start of the comment must not be in a character
|
||||||
|
@ -2820,7 +2886,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
|
||||||
it does not terminate the comment. Only an actual character with the code value
|
it does not terminate the comment. Only an actual character with the code value
|
||||||
0x0a (the default newline) does so.
|
0x0a (the default newline) does so.
|
||||||
<a name="recursion"></a></P>
|
<a name="recursion"></a></P>
|
||||||
<br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
Consider the problem of matching a string in parentheses, allowing for
|
Consider the problem of matching a string in parentheses, allowing for
|
||||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||||
|
@ -3008,7 +3074,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
|
||||||
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
"b" and so the whole match succeeds. This match used to fail in Perl, but in
|
||||||
later versions (I tried 5.024) it now works.
|
later versions (I tried 5.024) it now works.
|
||||||
<a name="subpatternsassubroutines"></a></P>
|
<a name="subpatternsassubroutines"></a></P>
|
||||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
<br><a name="SEC25" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||||
<P>
|
<P>
|
||||||
If the syntax for a recursive subpattern call (either by number or by
|
If the syntax for a recursive subpattern call (either by number or by
|
||||||
name) is used outside the parentheses to which it refers, it operates a bit
|
name) is used outside the parentheses to which it refers, it operates a bit
|
||||||
|
@ -3057,7 +3123,7 @@ in subpatterns when called as subroutines is described in the section entitled
|
||||||
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
<a href="#btsub">"Backtracking verbs in subroutines"</a>
|
||||||
below.
|
below.
|
||||||
<a name="onigurumasubroutines"></a></P>
|
<a name="onigurumasubroutines"></a></P>
|
||||||
<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||||
<P>
|
<P>
|
||||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||||
|
@ -3075,7 +3141,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
||||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||||
synonymous. The former is a backreference; the latter is a subroutine call.
|
synonymous. The former is a backreference; the latter is a subroutine call.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||||
|
@ -3151,7 +3217,7 @@ example:
|
||||||
</pre>
|
</pre>
|
||||||
The doubling is removed before the string is passed to the callout function.
|
The doubling is removed before the string is passed to the callout function.
|
||||||
<a name="backtrackcontrol"></a></P>
|
<a name="backtrackcontrol"></a></P>
|
||||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
There are a number of special "Backtracking Control Verbs" (to use Perl's
|
||||||
terminology) that modify the behaviour of backtracking during matching. They
|
terminology) that modify the behaviour of backtracking during matching. They
|
||||||
|
@ -3222,7 +3288,7 @@ documentation.
|
||||||
<P>
|
<P>
|
||||||
Experiments with Perl suggest that it too has similar optimizations, and like
|
Experiments with Perl suggest that it too has similar optimizations, and like
|
||||||
PCRE2, turning them off can change the result of a match.
|
PCRE2, turning them off can change the result of a match.
|
||||||
</P>
|
<a name="acceptverb"></a></P>
|
||||||
<br><b>
|
<br><b>
|
||||||
Verbs that act immediately
|
Verbs that act immediately
|
||||||
</b><br>
|
</b><br>
|
||||||
|
@ -3245,6 +3311,11 @@ example:
|
||||||
</pre>
|
</pre>
|
||||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||||
the outer parentheses.
|
the outer parentheses.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
<b>Warning:</b> (*ACCEPT) should not be used within a script run subpattern,
|
||||||
|
because it causes an immediate exit from the subpattern, bypassing the script
|
||||||
|
run checking.
|
||||||
<pre>
|
<pre>
|
||||||
(*FAIL) or (*FAIL:NAME)
|
(*FAIL) or (*FAIL:NAME)
|
||||||
</pre>
|
</pre>
|
||||||
|
@ -3644,12 +3715,12 @@ behaviour). However, if there is no such group within the subroutine
|
||||||
subpattern, the subroutine match fails and there is a backtrack at the outer
|
subpattern, the subroutine match fails and there is a backtrack at the outer
|
||||||
level.
|
level.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
|
||||||
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -3658,9 +3729,9 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 24 September 2018
|
Last updated: 12 October 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -32,14 +32,15 @@ please consult the man page, in case the conversion went wrong.
|
||||||
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
|
||||||
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
|
||||||
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
|
||||||
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
|
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
|
||||||
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
|
||||||
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
|
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
|
||||||
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
|
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
|
||||||
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
|
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
|
||||||
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
|
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
|
||||||
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
|
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
|
||||||
<li><a name="TOC27" href="#SEC27">REVISION</a>
|
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
|
||||||
|
<li><a name="TOC28" href="#SEC28">REVISION</a>
|
||||||
</ul>
|
</ul>
|
||||||
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
|
||||||
<P>
|
<P>
|
||||||
|
@ -533,7 +534,17 @@ setting with a similar syntax.
|
||||||
</pre>
|
</pre>
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
|
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
|
||||||
|
<P>
|
||||||
|
<pre>
|
||||||
|
(*script_run:...) ) script run, can be backtracked into
|
||||||
|
(*sr:...) )
|
||||||
|
|
||||||
|
(*atomic_script_run:...) ) atomic script run
|
||||||
|
(*asr:...) )
|
||||||
|
</PRE>
|
||||||
|
</P>
|
||||||
|
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
\n reference by number (can be ambiguous)
|
\n reference by number (can be ambiguous)
|
||||||
|
@ -550,7 +561,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
(?P=name) reference by name (Python)
|
(?P=name) reference by name (Python)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?R) recurse whole pattern
|
(?R) recurse whole pattern
|
||||||
|
@ -569,7 +580,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
\g'-n' call subpattern by relative number (PCRE2 extension)
|
\g'-n' call subpattern by relative number (PCRE2 extension)
|
||||||
</PRE>
|
</PRE>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?(condition)yes-pattern)
|
(?(condition)yes-pattern)
|
||||||
|
@ -592,7 +603,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
|
||||||
conditions or recursion tests. Such a condition is interpreted as a reference
|
conditions or recursion tests. Such a condition is interpreted as a reference
|
||||||
condition if the relevant named group exists.
|
condition if the relevant named group exists.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||||
<P>
|
<P>
|
||||||
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
|
||||||
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
|
||||||
|
@ -619,7 +630,7 @@ pattern is not anchored.
|
||||||
The effect of one of these verbs in a group called as a subroutine is confined
|
The effect of one of these verbs in a group called as a subroutine is confined
|
||||||
to the subroutine call.
|
to the subroutine call.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
|
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
|
||||||
<P>
|
<P>
|
||||||
<pre>
|
<pre>
|
||||||
(?C) callout (assumed number 0)
|
(?C) callout (assumed number 0)
|
||||||
|
@ -630,12 +641,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
|
||||||
start and the end), and the starting delimiter { matched with the ending
|
start and the end), and the starting delimiter { matched with the ending
|
||||||
delimiter }. To encode the ending delimiter within the string, double it.
|
delimiter }. To encode the ending delimiter within the string, double it.
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
|
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
|
||||||
<P>
|
<P>
|
||||||
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
|
||||||
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
|
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
|
||||||
<P>
|
<P>
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
<br>
|
<br>
|
||||||
|
@ -644,9 +655,9 @@ University Computing Service
|
||||||
Cambridge, England.
|
Cambridge, England.
|
||||||
<br>
|
<br>
|
||||||
</P>
|
</P>
|
||||||
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
|
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 24 September 2018
|
Last updated: 10 October 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
|
@ -124,6 +124,116 @@ for characters whose code points are less than 128 and that have at most two
|
||||||
case-equivalent values. For these, a direct table lookup is used for speed. A
|
case-equivalent values. For these, a direct table lookup is used for speed. A
|
||||||
few Unicode characters such as Greek sigma have more than two code points that
|
few Unicode characters such as Greek sigma have more than two code points that
|
||||||
are case-equivalent, and these are treated as such.
|
are case-equivalent, and these are treated as such.
|
||||||
|
<a name="scriptruns"></a></P>
|
||||||
|
<br><b>
|
||||||
|
SCRIPT RUNS
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
|
||||||
|
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
|
||||||
|
parentheses is a script run. In concept, a script run is a sequence of
|
||||||
|
characters that are all from the same Unicode script. However, because some
|
||||||
|
scripts are commonly used together, and because some diacritical and other
|
||||||
|
marks are used with multiple scripts, it is not that simple.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Every Unicode character has a Script property, mostly with a value
|
||||||
|
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
|
||||||
|
are also three special values:
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
"Unknown" is used for code points that have not been assigned, and also for the
|
||||||
|
surrogate code points. In the PCRE2 32-bit library, characters whose code
|
||||||
|
points are greater than the Unicode maximum (U+10FFFF), which are accessible
|
||||||
|
only in non-UTF mode, are assigned the Unknown script.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
"Common" is used for characters that are used with many scripts. These include
|
||||||
|
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
|
||||||
|
digits 0 to 9.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
"Inherited" is used for characters such as diacritical marks that modify a
|
||||||
|
previous character. These are considered to take on the script of the character
|
||||||
|
that they modify.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
Some Inherited characters are used with many scripts, but many of them are only
|
||||||
|
normally used with a small number of scripts. For example, U+102E0 (Coptic
|
||||||
|
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
|
||||||
|
possible to check this, a Unicode property called Script Extension exists. Its
|
||||||
|
value is a list of scripts that apply to the character. For the majority of
|
||||||
|
characters, the list contains just one script, the same one as the Script
|
||||||
|
property. However, for characters such as U+102E0 more than one Script is
|
||||||
|
listed. There are also some Common characters that have a single, non-Common
|
||||||
|
script in their Script Extension list.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
The next section describes the basic rules for deciding whether a given string
|
||||||
|
of characters is a script run. Note, however, that there are some special cases
|
||||||
|
involving the Chinese Han script, and an additional constraint for decimal
|
||||||
|
digits. These are covered in subsequent sections.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Basic script run rules
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
A string that is less than two characters long is a script run. This is the
|
||||||
|
only case in which an Unknown character can be part of a script run. Longer
|
||||||
|
strings are checked using only the Script Extensions property, not the basic
|
||||||
|
Script property.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
If a character's Script Extension property is the single value "Inherited", it
|
||||||
|
is always accepted as part of a script run. This is also true for the property
|
||||||
|
"Common", subject to the checking of decimal digits described below. All the
|
||||||
|
remaining characters in a script run must have at least one script in common in
|
||||||
|
their Script Extension lists. In set-theoretic terminology, the intersection of
|
||||||
|
all the sets of scripts must not be empty.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
A simple example is an Internet name such as "google.com". The letters are all
|
||||||
|
in the Latin script, and the dot is Common, so this string is a script run.
|
||||||
|
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
|
||||||
|
string that looks the same, but with Cyrillic "o"s is not a script run.
|
||||||
|
</P>
|
||||||
|
<P>
|
||||||
|
More interesting examples involve characters with more than one script in their
|
||||||
|
Script Extension. Consider the following characters:
|
||||||
|
<pre>
|
||||||
|
U+060C Arabic comma
|
||||||
|
U+06D4 Arabic full stop
|
||||||
|
</pre>
|
||||||
|
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
|
||||||
|
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
|
||||||
|
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
|
||||||
|
appear in Syriac or Thaana script runs, but the second could not.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
The Chinese Han script
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
The Chinese Han script is commonly used in conjunction with other scripts for
|
||||||
|
writing certain languages. Japanese uses the Hiragana and Katakana scripts
|
||||||
|
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
|
||||||
|
and Han. These three combinations are treated as special cases when checking
|
||||||
|
script runs and are, in effect, "virtual scripts". Thus, a script run may
|
||||||
|
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
|
||||||
|
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
|
||||||
|
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
|
||||||
|
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
|
||||||
|
in allowing such mixtures.
|
||||||
|
</P>
|
||||||
|
<br><b>
|
||||||
|
Decimal digits
|
||||||
|
</b><br>
|
||||||
|
<P>
|
||||||
|
Unicode contains many sets of 10 decimal digits in different scripts, and some
|
||||||
|
scripts (including the Common script) contain more than one set. Some of these
|
||||||
|
decimal digits them are visually indistinguishable from the common ASCII
|
||||||
|
digits. In addition to the script checking described above, if a script run
|
||||||
|
contains any decimal digits, they must all come from the same set of 10
|
||||||
|
adjacent characters.
|
||||||
</P>
|
</P>
|
||||||
<br><b>
|
<br><b>
|
||||||
VALIDITY OF UTF STRINGS
|
VALIDITY OF UTF STRINGS
|
||||||
|
@ -300,7 +410,7 @@ Cambridge, England.
|
||||||
REVISION
|
REVISION
|
||||||
</b><br>
|
</b><br>
|
||||||
<P>
|
<P>
|
||||||
Last updated: 02 September 2018
|
Last updated: 12 October 2018
|
||||||
<br>
|
<br>
|
||||||
Copyright © 1997-2018 University of Cambridge.
|
Copyright © 1997-2018 University of Cambridge.
|
||||||
<br>
|
<br>
|
||||||
|
|
2274
doc/pcre2.txt
2274
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2MATCHING 3 "29 September 2014" "PCRE2 10.00"
|
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 MATCHING ALGORITHMS"
|
.SH "PCRE2 MATCHING ALGORITHMS"
|
||||||
|
@ -113,7 +113,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
|
||||||
("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
|
("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
|
||||||
.P
|
.P
|
||||||
There are a number of features of PCRE2 regular expressions that are not
|
There are a number of features of PCRE2 regular expressions that are not
|
||||||
supported by the alternative matching algorithm. They are as follows:
|
supported or behave differently in the alternative matching function. Those
|
||||||
|
that are not supported cause an error if encountered.
|
||||||
.P
|
.P
|
||||||
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
1. Because the algorithm finds all possible matches, the greedy or ungreedy
|
||||||
nature of repetition quantifiers is not relevant (though it may affect
|
nature of repetition quantifiers is not relevant (though it may affect
|
||||||
|
@ -135,24 +136,26 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
|
||||||
do this. This means that no captured substrings are available.
|
do this. This means that no captured substrings are available.
|
||||||
.P
|
.P
|
||||||
3. Because no substrings are captured, backreferences within the pattern are
|
3. Because no substrings are captured, backreferences within the pattern are
|
||||||
not supported, and cause errors if encountered.
|
not supported.
|
||||||
.P
|
.P
|
||||||
4. For the same reason, conditional expressions that use a backreference as the
|
4. For the same reason, conditional expressions that use a backreference as the
|
||||||
condition or test for a specific group recursion are not supported.
|
condition or test for a specific group recursion are not supported.
|
||||||
.P
|
.P
|
||||||
5. Because many paths through the tree may be active, the \eK escape sequence,
|
5. Again for the same reason, script runs are not supported.
|
||||||
which resets the start of the match when encountered (but may be on some paths
|
|
||||||
and not on others), is not supported. It causes an error if encountered.
|
|
||||||
.P
|
.P
|
||||||
6. Callouts are supported, but the value of the \fIcapture_top\fP field is
|
6. Because many paths through the tree may be active, the \eK escape sequence,
|
||||||
|
which resets the start of the match when encountered (but may be on some paths
|
||||||
|
and not on others), is not supported.
|
||||||
|
.P
|
||||||
|
7. Callouts are supported, but the value of the \fIcapture_top\fP field is
|
||||||
always 1, and the value of the \fIcapture_last\fP field is always 0.
|
always 1, and the value of the \fIcapture_last\fP field is always 0.
|
||||||
.P
|
.P
|
||||||
7. The \eC escape sequence, which (in the standard algorithm) always matches a
|
8. The \eC escape sequence, which (in the standard algorithm) always matches a
|
||||||
single code unit, even in a UTF mode, is not supported in these modes, because
|
single code unit, even in a UTF mode, is not supported in these modes, because
|
||||||
the alternative algorithm moves through the subject string one character (not
|
the alternative algorithm moves through the subject string one character (not
|
||||||
code unit) at a time, for all active paths through the tree.
|
code unit) at a time, for all active paths through the tree.
|
||||||
.P
|
.P
|
||||||
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
|
||||||
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
@ -188,7 +191,7 @@ The alternative algorithm suffers from a number of disadvantages:
|
||||||
because it has to search for all possible matches, but is also because it is
|
because it has to search for all possible matches, but is also because it is
|
||||||
less susceptible to optimization.
|
less susceptible to optimization.
|
||||||
.P
|
.P
|
||||||
2. Capturing parentheses and backreferences are not supported.
|
2. Capturing parentheses, backreferences, and script runs are not supported.
|
||||||
.P
|
.P
|
||||||
3. Although atomic groups are supported, their use does not provide the
|
3. Although atomic groups are supported, their use does not provide the
|
||||||
performance advantage that it does for the standard algorithm.
|
performance advantage that it does for the standard algorithm.
|
||||||
|
@ -208,6 +211,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 29 September 2014
|
Last updated: 10 October 2018
|
||||||
Copyright (c) 1997-2014 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
|
.TH PCRE2PATTERN 3 "12 October 2018" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||||
|
@ -755,7 +755,7 @@ sequences that match characters with specific properties are available. In
|
||||||
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
8-bit non-UTF-8 mode, these sequences are of course limited to testing
|
||||||
characters whose code points are less than 256, but they do work in this mode.
|
characters whose code points are less than 256, but they do work in this mode.
|
||||||
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
|
||||||
may be encountered. These are all treated as being in the Common script and
|
may be encountered. These are all treated as being in the Unknown script and
|
||||||
with an unassigned type. The extra escape sequences are:
|
with an unassigned type. The extra escape sequences are:
|
||||||
.sp
|
.sp
|
||||||
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
\ep{\fIxx\fP} a character with the \fIxx\fP property
|
||||||
|
@ -781,8 +781,10 @@ example:
|
||||||
\ep{Greek}
|
\ep{Greek}
|
||||||
\eP{Han}
|
\eP{Han}
|
||||||
.sp
|
.sp
|
||||||
Those that are not part of an identified script are lumped together as
|
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
|
||||||
"Common". The current list of scripts is:
|
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
|
||||||
|
part of an identified script are lumped together as "Common". The current list
|
||||||
|
of scripts is:
|
||||||
.P
|
.P
|
||||||
Adlam,
|
Adlam,
|
||||||
Ahom,
|
Ahom,
|
||||||
|
@ -928,6 +930,7 @@ Tibetan,
|
||||||
Tifinagh,
|
Tifinagh,
|
||||||
Tirhuta,
|
Tirhuta,
|
||||||
Ugaritic,
|
Ugaritic,
|
||||||
|
Unknown,
|
||||||
Vai,
|
Vai,
|
||||||
Warang_Citi,
|
Warang_Citi,
|
||||||
Yi,
|
Yi,
|
||||||
|
@ -2603,6 +2606,73 @@ is another pattern that matches "foo" preceded by three digits and any three
|
||||||
characters that are not "999".
|
characters that are not "999".
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "SCRIPT RUNS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
In concept, a script run is a sequence of characters that are all from the same
|
||||||
|
Unicode script such as Latin or Greek. However, because some scripts are
|
||||||
|
commonly used together, and because some diacritical and other marks are used
|
||||||
|
with multiple scripts, it is not that simple. There is a full description of
|
||||||
|
the rules that PCRE2 uses in the section entitled
|
||||||
|
.\" HTML <a href="pcre2unicode.html#scriptruns">
|
||||||
|
.\" </a>
|
||||||
|
"Script Runs"
|
||||||
|
.\"
|
||||||
|
in the
|
||||||
|
.\" HREF
|
||||||
|
\fBpcre2unicode\fP
|
||||||
|
.\"
|
||||||
|
documentation.
|
||||||
|
.P
|
||||||
|
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
|
||||||
|
parenthesis, it fails if the sequence of characters that it matches are not a
|
||||||
|
script run. After a failure, normal backtracking occurs. Script runs can be
|
||||||
|
used to detect spoofing attacks using characters that look the same, but are
|
||||||
|
from different scripts. The string "paypal.com" is an infamous example, where
|
||||||
|
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
|
||||||
|
the matched characters in a sequence of non-spaces that follow white space are
|
||||||
|
a script run:
|
||||||
|
.sp
|
||||||
|
\es+(*sr:\eS+)
|
||||||
|
.sp
|
||||||
|
To be sure that they are all from the Latin script (for example), a lookahead
|
||||||
|
can be used:
|
||||||
|
.sp
|
||||||
|
\es+(?=\ep{Latin})(*sr:\eS+)
|
||||||
|
.sp
|
||||||
|
This works as long as the first character is expected to be a character in that
|
||||||
|
script, and not (for example) punctuation, which is allowed with any script. If
|
||||||
|
this is not the case, a more creative lookahead is needed. For example, if
|
||||||
|
digits, underscore, and dots are permitted at the start:
|
||||||
|
.sp
|
||||||
|
\es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
|
||||||
|
.sp
|
||||||
|
.P
|
||||||
|
In many cases, backtracking into a script run pattern fragment is not
|
||||||
|
desirable. The script run can employ an atomic group to prevent this. Because
|
||||||
|
this is a common requirement, a shorthand notation is provided by
|
||||||
|
(*atomic_script_run: or (*asr:
|
||||||
|
.sp
|
||||||
|
(*asr:...) is the same as (*sr:(?>...))
|
||||||
|
.sp
|
||||||
|
Note that the atomic group is inside the script run. Putting it outside would
|
||||||
|
not prevent backtracking into the script run pattern.
|
||||||
|
.P
|
||||||
|
Support for script runs is not available if PCRE2 is compiled without Unicode
|
||||||
|
support. A compile-time error is given if any of the above constructs is
|
||||||
|
encountered. Script runs are not supported by the alternate matching function,
|
||||||
|
\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
|
||||||
|
parentheses.
|
||||||
|
.P
|
||||||
|
\fBWarning:\fP The (*ACCEPT) control verb
|
||||||
|
.\" HTML <a href="#acceptverb">
|
||||||
|
.\" </a>
|
||||||
|
(see below)
|
||||||
|
.\"
|
||||||
|
should not be used within a script run subpattern, because it causes an
|
||||||
|
immediate exit from the subpattern, bypassing the script run checking.
|
||||||
|
.
|
||||||
|
.
|
||||||
.\" HTML <a name="conditions"></a>
|
.\" HTML <a name="conditions"></a>
|
||||||
.SH "CONDITIONAL SUBPATTERNS"
|
.SH "CONDITIONAL SUBPATTERNS"
|
||||||
.rs
|
.rs
|
||||||
|
@ -3267,6 +3337,7 @@ Experiments with Perl suggest that it too has similar optimizations, and like
|
||||||
PCRE2, turning them off can change the result of a match.
|
PCRE2, turning them off can change the result of a match.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.\" HTML <a name="acceptverb"></a>
|
||||||
.SS "Verbs that act immediately"
|
.SS "Verbs that act immediately"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -3287,6 +3358,10 @@ example:
|
||||||
.sp
|
.sp
|
||||||
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
|
||||||
the outer parentheses.
|
the outer parentheses.
|
||||||
|
.P
|
||||||
|
\fBWarning:\fP (*ACCEPT) should not be used within a script run subpattern,
|
||||||
|
because it causes an immediate exit from the subpattern, bypassing the script
|
||||||
|
run checking.
|
||||||
.sp
|
.sp
|
||||||
(*FAIL) or (*FAIL:NAME)
|
(*FAIL) or (*FAIL:NAME)
|
||||||
.sp
|
.sp
|
||||||
|
@ -3692,6 +3767,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 24 September 2018
|
Last updated: 12 October 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
|
.TH PCRE2SYNTAX 3 "10 October 2018" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||||
|
@ -511,6 +511,16 @@ setting with a similar syntax.
|
||||||
Each top-level branch of a lookbehind must be of a fixed length.
|
Each top-level branch of a lookbehind must be of a fixed length.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.SH "SCRIPT RUNS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
(*script_run:...) ) script run, can be backtracked into
|
||||||
|
(*sr:...) )
|
||||||
|
.sp
|
||||||
|
(*atomic_script_run:...) ) atomic script run
|
||||||
|
(*asr:...) )
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "BACKREFERENCES"
|
.SH "BACKREFERENCES"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -633,6 +643,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 24 September 2018
|
Last updated: 10 October 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
.TH PCRE2UNICODE 3 "02 September 2018" "PCRE2 10.32"
|
.TH PCRE2UNICODE 3 "12 October 2018" "PCRE2 10.33"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
PCRE - Perl-compatible regular expressions (revised API)
|
PCRE - Perl-compatible regular expressions (revised API)
|
||||||
.SH "UNICODE AND UTF SUPPORT"
|
.SH "UNICODE AND UTF SUPPORT"
|
||||||
|
@ -118,6 +118,108 @@ few Unicode characters such as Greek sigma have more than two code points that
|
||||||
are case-equivalent, and these are treated as such.
|
are case-equivalent, and these are treated as such.
|
||||||
.
|
.
|
||||||
.
|
.
|
||||||
|
.\" HTML <a name="scriptruns"></a>
|
||||||
|
.SH "SCRIPT RUNS"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
|
||||||
|
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
|
||||||
|
parentheses is a script run. In concept, a script run is a sequence of
|
||||||
|
characters that are all from the same Unicode script. However, because some
|
||||||
|
scripts are commonly used together, and because some diacritical and other
|
||||||
|
marks are used with multiple scripts, it is not that simple.
|
||||||
|
.P
|
||||||
|
Every Unicode character has a Script property, mostly with a value
|
||||||
|
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
|
||||||
|
are also three special values:
|
||||||
|
.P
|
||||||
|
"Unknown" is used for code points that have not been assigned, and also for the
|
||||||
|
surrogate code points. In the PCRE2 32-bit library, characters whose code
|
||||||
|
points are greater than the Unicode maximum (U+10FFFF), which are accessible
|
||||||
|
only in non-UTF mode, are assigned the Unknown script.
|
||||||
|
.P
|
||||||
|
"Common" is used for characters that are used with many scripts. These include
|
||||||
|
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
|
||||||
|
digits 0 to 9.
|
||||||
|
.P
|
||||||
|
"Inherited" is used for characters such as diacritical marks that modify a
|
||||||
|
previous character. These are considered to take on the script of the character
|
||||||
|
that they modify.
|
||||||
|
.P
|
||||||
|
Some Inherited characters are used with many scripts, but many of them are only
|
||||||
|
normally used with a small number of scripts. For example, U+102E0 (Coptic
|
||||||
|
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
|
||||||
|
possible to check this, a Unicode property called Script Extension exists. Its
|
||||||
|
value is a list of scripts that apply to the character. For the majority of
|
||||||
|
characters, the list contains just one script, the same one as the Script
|
||||||
|
property. However, for characters such as U+102E0 more than one Script is
|
||||||
|
listed. There are also some Common characters that have a single, non-Common
|
||||||
|
script in their Script Extension list.
|
||||||
|
.P
|
||||||
|
The next section describes the basic rules for deciding whether a given string
|
||||||
|
of characters is a script run. Note, however, that there are some special cases
|
||||||
|
involving the Chinese Han script, and an additional constraint for decimal
|
||||||
|
digits. These are covered in subsequent sections.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Basic script run rules"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
A string that is less than two characters long is a script run. This is the
|
||||||
|
only case in which an Unknown character can be part of a script run. Longer
|
||||||
|
strings are checked using only the Script Extensions property, not the basic
|
||||||
|
Script property.
|
||||||
|
.P
|
||||||
|
If a character's Script Extension property is the single value "Inherited", it
|
||||||
|
is always accepted as part of a script run. This is also true for the property
|
||||||
|
"Common", subject to the checking of decimal digits described below. All the
|
||||||
|
remaining characters in a script run must have at least one script in common in
|
||||||
|
their Script Extension lists. In set-theoretic terminology, the intersection of
|
||||||
|
all the sets of scripts must not be empty.
|
||||||
|
.P
|
||||||
|
A simple example is an Internet name such as "google.com". The letters are all
|
||||||
|
in the Latin script, and the dot is Common, so this string is a script run.
|
||||||
|
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
|
||||||
|
string that looks the same, but with Cyrillic "o"s is not a script run.
|
||||||
|
.P
|
||||||
|
More interesting examples involve characters with more than one script in their
|
||||||
|
Script Extension. Consider the following characters:
|
||||||
|
.sp
|
||||||
|
U+060C Arabic comma
|
||||||
|
U+06D4 Arabic full stop
|
||||||
|
.sp
|
||||||
|
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
|
||||||
|
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
|
||||||
|
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
|
||||||
|
appear in Syriac or Thaana script runs, but the second could not.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "The Chinese Han script"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
The Chinese Han script is commonly used in conjunction with other scripts for
|
||||||
|
writing certain languages. Japanese uses the Hiragana and Katakana scripts
|
||||||
|
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
|
||||||
|
and Han. These three combinations are treated as special cases when checking
|
||||||
|
script runs and are, in effect, "virtual scripts". Thus, a script run may
|
||||||
|
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
|
||||||
|
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
|
||||||
|
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
|
||||||
|
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
|
||||||
|
in allowing such mixtures.
|
||||||
|
.
|
||||||
|
.
|
||||||
|
.SS "Decimal digits"
|
||||||
|
.rs
|
||||||
|
.sp
|
||||||
|
Unicode contains many sets of 10 decimal digits in different scripts, and some
|
||||||
|
scripts (including the Common script) contain more than one set. Some of these
|
||||||
|
decimal digits them are visually indistinguishable from the common ASCII
|
||||||
|
digits. In addition to the script checking described above, if a script run
|
||||||
|
contains any decimal digits, they must all come from the same set of 10
|
||||||
|
adjacent characters.
|
||||||
|
.
|
||||||
|
.
|
||||||
.SH "VALIDITY OF UTF STRINGS"
|
.SH "VALIDITY OF UTF STRINGS"
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
|
@ -285,6 +387,6 @@ Cambridge, England.
|
||||||
.rs
|
.rs
|
||||||
.sp
|
.sp
|
||||||
.nf
|
.nf
|
||||||
Last updated: 02 September 2018
|
Last updated: 12 October 2018
|
||||||
Copyright (c) 1997-2018 University of Cambridge.
|
Copyright (c) 1997-2018 University of Cambridge.
|
||||||
.fi
|
.fi
|
||||||
|
|
|
@ -2410,6 +2410,7 @@
|
||||||
\x{3031}\x{3041}\x{30a1}\x{2e80} [Hira Kata] Hira Kata Han
|
\x{3031}\x{3041}\x{30a1}\x{2e80} [Hira Kata] Hira Kata Han
|
||||||
\x{060c}\x{06d4}\x{0600}\x{10d00}\x{0700} [Arab Rohg Syrc Thaa] [Arab Rohg] Arab Rohg Syrc
|
\x{060c}\x{06d4}\x{0600}\x{10d00}\x{0700} [Arab Rohg Syrc Thaa] [Arab Rohg] Arab Rohg Syrc
|
||||||
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
||||||
|
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
|
||||||
|
|
||||||
/(?<!)(*sr:)/
|
/(?<!)(*sr:)/
|
||||||
|
|
||||||
|
|
|
@ -2112,6 +2112,9 @@
|
||||||
/^(*sr:.*)/B,utf
|
/^(*sr:.*)/B,utf
|
||||||
paypаl.com A classic example of why script run checks are a good thing
|
paypаl.com A classic example of why script run checks are a good thing
|
||||||
|
|
||||||
|
/^(*sr:.*(*ACCEPT))/utf
|
||||||
|
paypаl.com But *ACCEPT breaks things
|
||||||
|
|
||||||
/^(*sr:\x{2e80}*)/B,utf
|
/^(*sr:\x{2e80}*)/B,utf
|
||||||
|
|
||||||
/^(*sr:\x{2e80}*)\x{2e80}/B,utf
|
/^(*sr:\x{2e80}*)\x{2e80}/B,utf
|
||||||
|
|
|
@ -3902,6 +3902,8 @@ No match
|
||||||
0: \x{60c}\x{6d4}\x{600}
|
0: \x{60c}\x{6d4}\x{600}
|
||||||
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
|
||||||
0: \x{60c}\x{6d4}
|
0: \x{60c}\x{6d4}
|
||||||
|
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
|
||||||
|
0: \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}
|
||||||
|
|
||||||
/(?<!)(*sr:)/
|
/(?<!)(*sr:)/
|
||||||
|
|
||||||
|
|
|
@ -4791,6 +4791,10 @@ MK: ABC
|
||||||
paypаl.com A classic example of why script run checks are a good thing
|
paypаl.com A classic example of why script run checks are a good thing
|
||||||
0: payp
|
0: payp
|
||||||
|
|
||||||
|
/^(*sr:.*(*ACCEPT))/utf
|
||||||
|
paypаl.com But *ACCEPT breaks things
|
||||||
|
0: payp\x{430}l.com But *ACCEPT breaks things
|
||||||
|
|
||||||
/^(*sr:\x{2e80}*)/B,utf
|
/^(*sr:\x{2e80}*)/B,utf
|
||||||
------------------------------------------------------------------
|
------------------------------------------------------------------
|
||||||
Bra
|
Bra
|
||||||
|
|
Loading…
Reference in New Issue