Documentation and tests update for script runs.

This commit is contained in:
Philip.Hazel 2018-10-12 17:02:34 +00:00
parent 4e7a204d18
commit 0fc5cda13b
14 changed files with 1697 additions and 1126 deletions

View File

@ -31,8 +31,7 @@ hexadecimal digit" bit was removed. The default tables in
src/pcre2_chartables.c.dist are updated.
8. Implement the new Perl "script run" features (*script_run:...) and
(*atomic_script_run:...) aka (*sr:...) and (*asr:...). At present, this is
not yet documented.
(*atomic_script_run:...) aka (*sr:...) and (*asr:...).
Version 10.32 10-September-2018

View File

@ -134,7 +134,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
</P>
<P>
There are a number of features of PCRE2 regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
supported or behave differently in the alternative matching function. Those
that are not supported cause an error if encountered.
</P>
<P>
1. Because the algorithm finds all possible matches, the greedy or ungreedy
@ -159,29 +160,32 @@ do this. This means that no captured substrings are available.
</P>
<P>
3. Because no substrings are captured, backreferences within the pattern are
not supported, and cause errors if encountered.
not supported.
</P>
<P>
4. For the same reason, conditional expressions that use a backreference as the
condition or test for a specific group recursion are not supported.
</P>
<P>
5. Because many paths through the tree may be active, the \K escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported. It causes an error if encountered.
5. Again for the same reason, script runs are not supported.
</P>
<P>
6. Callouts are supported, but the value of the <i>capture_top</i> field is
6. Because many paths through the tree may be active, the \K escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported.
</P>
<P>
7. Callouts are supported, but the value of the <i>capture_top</i> field is
always 1, and the value of the <i>capture_last</i> field is always 0.
</P>
<P>
7. The \C escape sequence, which (in the standard algorithm) always matches a
8. The \C escape sequence, which (in the standard algorithm) always matches a
single code unit, even in a UTF mode, is not supported in these modes, because
the alternative algorithm moves through the subject string one character (not
code unit) at a time, for all active paths through the tree.
</P>
<P>
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
</P>
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
@ -215,7 +219,7 @@ because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
</P>
<P>
2. Capturing parentheses and backreferences are not supported.
2. Capturing parentheses, backreferences, and script runs are not supported.
</P>
<P>
3. Although atomic groups are supported, their use does not provide the
@ -232,9 +236,9 @@ Cambridge, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
Last updated: 29 September 2014
Last updated: 10 October 2018
<br>
Copyright &copy; 1997-2014 University of Cambridge.
Copyright &copy; 1997-2018 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.

View File

@ -33,16 +33,17 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
<li><a name="TOC30" href="#SEC30">REVISION</a>
<li><a name="TOC21" href="#SEC21">SCRIPT RUNS</a>
<li><a name="TOC22" href="#SEC22">CONDITIONAL SUBPATTERNS</a>
<li><a name="TOC23" href="#SEC23">COMMENTS</a>
<li><a name="TOC24" href="#SEC24">RECURSIVE PATTERNS</a>
<li><a name="TOC25" href="#SEC25">SUBPATTERNS AS SUBROUTINES</a>
<li><a name="TOC26" href="#SEC26">ONIGURUMA SUBROUTINE SYNTAX</a>
<li><a name="TOC27" href="#SEC27">CALLOUTS</a>
<li><a name="TOC28" href="#SEC28">BACKTRACKING CONTROL</a>
<li><a name="TOC29" href="#SEC29">SEE ALSO</a>
<li><a name="TOC30" href="#SEC30">AUTHOR</a>
<li><a name="TOC31" href="#SEC31">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
<P>
@ -756,7 +757,7 @@ sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose code points are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
may be encountered. These are all treated as being in the Unknown script and
with an unassigned type. The extra escape sequences are:
<pre>
\p{<i>xx</i>} a character with the <i>xx</i> property
@ -780,8 +781,10 @@ example:
\p{Greek}
\P{Han}
</pre>
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list
of scripts is:
</P>
<P>
Adlam,
@ -928,6 +931,7 @@ Tibetan,
Tifinagh,
Tirhuta,
Ugaritic,
Unknown,
Vai,
Warang_Citi,
Yi,
@ -2589,8 +2593,70 @@ preceded by "foo", while
</pre>
is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
</P>
<br><a name="SEC21" href="#TOC1">SCRIPT RUNS</a><br>
<P>
In concept, a script run is a sequence of characters that are all from the same
Unicode script such as Latin or Greek. However, because some scripts are
commonly used together, and because some diacritical and other marks are used
with multiple scripts, it is not that simple. There is a full description of
the rules that PCRE2 uses in the section entitled
<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
in the
<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
documentation.
</P>
<P>
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
parenthesis, it fails if the sequence of characters that it matches are not a
script run. After a failure, normal backtracking occurs. Script runs can be
used to detect spoofing attacks using characters that look the same, but are
from different scripts. The string "paypal.com" is an infamous example, where
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
the matched characters in a sequence of non-spaces that follow white space are
a script run:
<pre>
\s+(*sr:\S+)
</pre>
To be sure that they are all from the Latin script (for example), a lookahead
can be used:
<pre>
\s+(?=\p{Latin})(*sr:\S+)
</pre>
This works as long as the first character is expected to be a character in that
script, and not (for example) punctuation, which is allowed with any script. If
this is not the case, a more creative lookahead is needed. For example, if
digits, underscore, and dots are permitted at the start:
<pre>
\s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
</PRE>
</P>
<P>
In many cases, backtracking into a script run pattern fragment is not
desirable. The script run can employ an atomic group to prevent this. Because
this is a common requirement, a shorthand notation is provided by
(*atomic_script_run: or (*asr:
<pre>
(*asr:...) is the same as (*sr:(?&#62;...))
</pre>
Note that the atomic group is inside the script run. Putting it outside would
not prevent backtracking into the script run pattern.
</P>
<P>
Support for script runs is not available if PCRE2 is compiled without Unicode
support. A compile-time error is given if any of the above constructs is
encountered. Script runs are not supported by the alternate matching function,
<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
parentheses.
</P>
<P>
<b>Warning:</b> The (*ACCEPT) control verb
<a href="#acceptverb">(see below)</a>
should not be used within a script run subpattern, because it causes an
immediate exit from the subpattern, bypassing the script run checking.
<a name="conditions"></a></P>
<br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
<br><a name="SEC22" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
<P>
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
@ -2790,7 +2856,7 @@ positive and negative assertions, because matching always continues after the
assertion, whether it succeeds or fails. (Compare non-conditional assertions,
when captures are retained only for positive assertions that succeed.)
<a name="comments"></a></P>
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
<br><a name="SEC23" href="#TOC1">COMMENTS</a><br>
<P>
There are two ways of including comments in patterns that are processed by
PCRE2. In both cases, the start of the comment must not be in a character
@ -2820,7 +2886,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
it does not terminate the comment. Only an actual character with the code value
0x0a (the default newline) does so.
<a name="recursion"></a></P>
<br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
<br><a name="SEC24" href="#TOC1">RECURSIVE PATTERNS</a><br>
<P>
Consider the problem of matching a string in parentheses, allowing for
unlimited nested parentheses. Without the use of recursion, the best that can
@ -3008,7 +3074,7 @@ alternative matches "a" and then recurses. In the recursion, \1 does now match
"b" and so the whole match succeeds. This match used to fail in Perl, but in
later versions (I tried 5.024) it now works.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<br><a name="SEC25" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
If the syntax for a recursive subpattern call (either by number or by
name) is used outside the parentheses to which it refers, it operates a bit
@ -3057,7 +3123,7 @@ in subpatterns when called as subroutines is described in the section entitled
<a href="#btsub">"Backtracking verbs in subroutines"</a>
below.
<a name="onigurumasubroutines"></a></P>
<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<br><a name="SEC26" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
<P>
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
a number enclosed either in angle brackets or single quotes, is an alternative
@ -3075,7 +3141,7 @@ plus or a minus sign it is taken as a relative reference. For example:
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
synonymous. The former is a backreference; the latter is a subroutine call.
</P>
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC27" href="#TOC1">CALLOUTS</a><br>
<P>
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
code to be obeyed in the middle of matching a regular expression. This makes it
@ -3151,7 +3217,7 @@ example:
</pre>
The doubling is removed before the string is passed to the callout function.
<a name="backtrackcontrol"></a></P>
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC28" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
There are a number of special "Backtracking Control Verbs" (to use Perl's
terminology) that modify the behaviour of backtracking during matching. They
@ -3222,7 +3288,7 @@ documentation.
<P>
Experiments with Perl suggest that it too has similar optimizations, and like
PCRE2, turning them off can change the result of a match.
</P>
<a name="acceptverb"></a></P>
<br><b>
Verbs that act immediately
</b><br>
@ -3245,6 +3311,11 @@ example:
</pre>
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses.
</P>
<P>
<b>Warning:</b> (*ACCEPT) should not be used within a script run subpattern,
because it causes an immediate exit from the subpattern, bypassing the script
run checking.
<pre>
(*FAIL) or (*FAIL:NAME)
</pre>
@ -3644,12 +3715,12 @@ behaviour). However, if there is no such group within the subroutine
subpattern, the subroutine match fails and there is a backtrack at the outer
level.
</P>
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC29" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC30" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -3658,9 +3729,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 September 2018
Last updated: 12 October 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -32,14 +32,15 @@ please consult the man page, in case the conversion went wrong.
<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
<li><a name="TOC27" href="#SEC27">REVISION</a>
<li><a name="TOC20" href="#SEC20">SCRIPT RUNS</a>
<li><a name="TOC21" href="#SEC21">BACKREFERENCES</a>
<li><a name="TOC22" href="#SEC22">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
<li><a name="TOC23" href="#SEC23">CONDITIONAL PATTERNS</a>
<li><a name="TOC24" href="#SEC24">BACKTRACKING CONTROL</a>
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
<li><a name="TOC26" href="#SEC26">SEE ALSO</a>
<li><a name="TOC27" href="#SEC27">AUTHOR</a>
<li><a name="TOC28" href="#SEC28">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
<P>
@ -533,7 +534,17 @@ setting with a similar syntax.
</pre>
Each top-level branch of a lookbehind must be of a fixed length.
</P>
<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
<br><a name="SEC20" href="#TOC1">SCRIPT RUNS</a><br>
<P>
<pre>
(*script_run:...) ) script run, can be backtracked into
(*sr:...) )
(*atomic_script_run:...) ) atomic script run
(*asr:...) )
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
@ -550,7 +561,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
(?P=name) reference by name (Python)
</PRE>
</P>
<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<br><a name="SEC22" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
@ -569,7 +580,7 @@ Each top-level branch of a lookbehind must be of a fixed length.
\g'-n' call subpattern by relative number (PCRE2 extension)
</PRE>
</P>
<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<br><a name="SEC23" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
@ -592,7 +603,7 @@ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
conditions or recursion tests. Such a condition is interpreted as a reference
condition if the relevant named group exists.
</P>
<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
<br><a name="SEC24" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
@ -619,7 +630,7 @@ pattern is not anchored.
The effect of one of these verbs in a group called as a subroutine is confined
to the subroutine call.
</P>
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
(?C) callout (assumed number 0)
@ -630,12 +641,12 @@ The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
start and the end), and the starting delimiter { matched with the ending
delimiter }. To encode the ending delimiter within the string, double it.
</P>
<br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
<b>pcre2matching</b>(3), <b>pcre2</b>(3).
</P>
<br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@ -644,9 +655,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
Last updated: 24 September 2018
Last updated: 10 October 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

View File

@ -124,6 +124,116 @@ for characters whose code points are less than 128 and that have at most two
case-equivalent values. For these, a direct table lookup is used for speed. A
few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
<a name="scriptruns"></a></P>
<br><b>
SCRIPT RUNS
</b><br>
<P>
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
parentheses is a script run. In concept, a script run is a sequence of
characters that are all from the same Unicode script. However, because some
scripts are commonly used together, and because some diacritical and other
marks are used with multiple scripts, it is not that simple.
</P>
<P>
Every Unicode character has a Script property, mostly with a value
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
are also three special values:
</P>
<P>
"Unknown" is used for code points that have not been assigned, and also for the
surrogate code points. In the PCRE2 32-bit library, characters whose code
points are greater than the Unicode maximum (U+10FFFF), which are accessible
only in non-UTF mode, are assigned the Unknown script.
</P>
<P>
"Common" is used for characters that are used with many scripts. These include
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
digits 0 to 9.
</P>
<P>
"Inherited" is used for characters such as diacritical marks that modify a
previous character. These are considered to take on the script of the character
that they modify.
</P>
<P>
Some Inherited characters are used with many scripts, but many of them are only
normally used with a small number of scripts. For example, U+102E0 (Coptic
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
possible to check this, a Unicode property called Script Extension exists. Its
value is a list of scripts that apply to the character. For the majority of
characters, the list contains just one script, the same one as the Script
property. However, for characters such as U+102E0 more than one Script is
listed. There are also some Common characters that have a single, non-Common
script in their Script Extension list.
</P>
<P>
The next section describes the basic rules for deciding whether a given string
of characters is a script run. Note, however, that there are some special cases
involving the Chinese Han script, and an additional constraint for decimal
digits. These are covered in subsequent sections.
</P>
<br><b>
Basic script run rules
</b><br>
<P>
A string that is less than two characters long is a script run. This is the
only case in which an Unknown character can be part of a script run. Longer
strings are checked using only the Script Extensions property, not the basic
Script property.
</P>
<P>
If a character's Script Extension property is the single value "Inherited", it
is always accepted as part of a script run. This is also true for the property
"Common", subject to the checking of decimal digits described below. All the
remaining characters in a script run must have at least one script in common in
their Script Extension lists. In set-theoretic terminology, the intersection of
all the sets of scripts must not be empty.
</P>
<P>
A simple example is an Internet name such as "google.com". The letters are all
in the Latin script, and the dot is Common, so this string is a script run.
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
string that looks the same, but with Cyrillic "o"s is not a script run.
</P>
<P>
More interesting examples involve characters with more than one script in their
Script Extension. Consider the following characters:
<pre>
U+060C Arabic comma
U+06D4 Arabic full stop
</pre>
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
appear in Syriac or Thaana script runs, but the second could not.
</P>
<br><b>
The Chinese Han script
</b><br>
<P>
The Chinese Han script is commonly used in conjunction with other scripts for
writing certain languages. Japanese uses the Hiragana and Katakana scripts
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
and Han. These three combinations are treated as special cases when checking
script runs and are, in effect, "virtual scripts". Thus, a script run may
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
in allowing such mixtures.
</P>
<br><b>
Decimal digits
</b><br>
<P>
Unicode contains many sets of 10 decimal digits in different scripts, and some
scripts (including the Common script) contain more than one set. Some of these
decimal digits them are visually indistinguishable from the common ASCII
digits. In addition to the script checking described above, if a script run
contains any decimal digits, they must all come from the same set of 10
adjacent characters.
</P>
<br><b>
VALIDITY OF UTF STRINGS
@ -300,7 +410,7 @@ Cambridge, England.
REVISION
</b><br>
<P>
Last updated: 02 September 2018
Last updated: 12 October 2018
<br>
Copyright &copy; 1997-2018 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2MATCHING 3 "29 September 2014" "PCRE2 10.00"
.TH PCRE2MATCHING 3 "10 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 MATCHING ALGORITHMS"
@ -113,7 +113,8 @@ do want multiple matches in such cases, either use an ungreedy repeat
("a\ed+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
.P
There are a number of features of PCRE2 regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
supported or behave differently in the alternative matching function. Those
that are not supported cause an error if encountered.
.P
1. Because the algorithm finds all possible matches, the greedy or ungreedy
nature of repetition quantifiers is not relevant (though it may affect
@ -135,24 +136,26 @@ possibilities, and PCRE2's implementation of this algorithm does not attempt to
do this. This means that no captured substrings are available.
.P
3. Because no substrings are captured, backreferences within the pattern are
not supported, and cause errors if encountered.
not supported.
.P
4. For the same reason, conditional expressions that use a backreference as the
condition or test for a specific group recursion are not supported.
.P
5. Because many paths through the tree may be active, the \eK escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported. It causes an error if encountered.
5. Again for the same reason, script runs are not supported.
.P
6. Callouts are supported, but the value of the \fIcapture_top\fP field is
6. Because many paths through the tree may be active, the \eK escape sequence,
which resets the start of the match when encountered (but may be on some paths
and not on others), is not supported.
.P
7. Callouts are supported, but the value of the \fIcapture_top\fP field is
always 1, and the value of the \fIcapture_last\fP field is always 0.
.P
7. The \eC escape sequence, which (in the standard algorithm) always matches a
8. The \eC escape sequence, which (in the standard algorithm) always matches a
single code unit, even in a UTF mode, is not supported in these modes, because
the alternative algorithm moves through the subject string one character (not
code unit) at a time, for all active paths through the tree.
.P
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
supported. (*FAIL) is supported, and behaves like a failing negative assertion.
.
.
@ -188,7 +191,7 @@ The alternative algorithm suffers from a number of disadvantages:
because it has to search for all possible matches, but is also because it is
less susceptible to optimization.
.P
2. Capturing parentheses and backreferences are not supported.
2. Capturing parentheses, backreferences, and script runs are not supported.
.P
3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.
@ -208,6 +211,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 29 September 2014
Copyright (c) 1997-2014 University of Cambridge.
Last updated: 10 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "24 September 2018" "PCRE2 10.33"
.TH PCRE2PATTERN 3 "12 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -755,7 +755,7 @@ sequences that match characters with specific properties are available. In
8-bit non-UTF-8 mode, these sequences are of course limited to testing
characters whose code points are less than 256, but they do work in this mode.
In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
may be encountered. These are all treated as being in the Common script and
may be encountered. These are all treated as being in the Unknown script and
with an unassigned type. The extra escape sequences are:
.sp
\ep{\fIxx\fP} a character with the \fIxx\fP property
@ -781,8 +781,10 @@ example:
\ep{Greek}
\eP{Han}
.sp
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
Unassigned characters (and in non-UTF 32-bit mode, characters with code points
greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
part of an identified script are lumped together as "Common". The current list
of scripts is:
.P
Adlam,
Ahom,
@ -928,6 +930,7 @@ Tibetan,
Tifinagh,
Tirhuta,
Ugaritic,
Unknown,
Vai,
Warang_Citi,
Yi,
@ -2603,6 +2606,73 @@ is another pattern that matches "foo" preceded by three digits and any three
characters that are not "999".
.
.
.SH "SCRIPT RUNS"
.rs
.sp
In concept, a script run is a sequence of characters that are all from the same
Unicode script such as Latin or Greek. However, because some scripts are
commonly used together, and because some diacritical and other marks are used
with multiple scripts, it is not that simple. There is a full description of
the rules that PCRE2 uses in the section entitled
.\" HTML <a href="pcre2unicode.html#scriptruns">
.\" </a>
"Script Runs"
.\"
in the
.\" HREF
\fBpcre2unicode\fP
.\"
documentation.
.P
If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
parenthesis, it fails if the sequence of characters that it matches are not a
script run. After a failure, normal backtracking occurs. Script runs can be
used to detect spoofing attacks using characters that look the same, but are
from different scripts. The string "paypal.com" is an infamous example, where
the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
the matched characters in a sequence of non-spaces that follow white space are
a script run:
.sp
\es+(*sr:\eS+)
.sp
To be sure that they are all from the Latin script (for example), a lookahead
can be used:
.sp
\es+(?=\ep{Latin})(*sr:\eS+)
.sp
This works as long as the first character is expected to be a character in that
script, and not (for example) punctuation, which is allowed with any script. If
this is not the case, a more creative lookahead is needed. For example, if
digits, underscore, and dots are permitted at the start:
.sp
\es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
.sp
.P
In many cases, backtracking into a script run pattern fragment is not
desirable. The script run can employ an atomic group to prevent this. Because
this is a common requirement, a shorthand notation is provided by
(*atomic_script_run: or (*asr:
.sp
(*asr:...) is the same as (*sr:(?>...))
.sp
Note that the atomic group is inside the script run. Putting it outside would
not prevent backtracking into the script run pattern.
.P
Support for script runs is not available if PCRE2 is compiled without Unicode
support. A compile-time error is given if any of the above constructs is
encountered. Script runs are not supported by the alternate matching function,
\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
parentheses.
.P
\fBWarning:\fP The (*ACCEPT) control verb
.\" HTML <a href="#acceptverb">
.\" </a>
(see below)
.\"
should not be used within a script run subpattern, because it causes an
immediate exit from the subpattern, bypassing the script run checking.
.
.
.\" HTML <a name="conditions"></a>
.SH "CONDITIONAL SUBPATTERNS"
.rs
@ -3267,6 +3337,7 @@ Experiments with Perl suggest that it too has similar optimizations, and like
PCRE2, turning them off can change the result of a match.
.
.
.\" HTML <a name="acceptverb"></a>
.SS "Verbs that act immediately"
.rs
.sp
@ -3287,6 +3358,10 @@ example:
.sp
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses.
.P
\fBWarning:\fP (*ACCEPT) should not be used within a script run subpattern,
because it causes an immediate exit from the subpattern, bypassing the script
run checking.
.sp
(*FAIL) or (*FAIL:NAME)
.sp
@ -3692,6 +3767,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 24 September 2018
Last updated: 12 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "24 September 2018" "PCRE2 10.33"
.TH PCRE2SYNTAX 3 "10 October 2018" "PCRE2 10.33"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -511,6 +511,16 @@ setting with a similar syntax.
Each top-level branch of a lookbehind must be of a fixed length.
.
.
.SH "SCRIPT RUNS"
.rs
.sp
(*script_run:...) ) script run, can be backtracked into
(*sr:...) )
.sp
(*atomic_script_run:...) ) atomic script run
(*asr:...) )
.
.
.SH "BACKREFERENCES"
.rs
.sp
@ -633,6 +643,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 24 September 2018
Last updated: 10 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2UNICODE 3 "02 September 2018" "PCRE2 10.32"
.TH PCRE2UNICODE 3 "12 October 2018" "PCRE2 10.33"
.SH NAME
PCRE - Perl-compatible regular expressions (revised API)
.SH "UNICODE AND UTF SUPPORT"
@ -118,6 +118,108 @@ few Unicode characters such as Greek sigma have more than two code points that
are case-equivalent, and these are treated as such.
.
.
.\" HTML <a name="scriptruns"></a>
.SH "SCRIPT RUNS"
.rs
.sp
The pattern constructs (*script_run:...) and (*atomic_script_run:...), with
synonyms (*sr:...) and (*asr:...), verify that the string matched within the
parentheses is a script run. In concept, a script run is a sequence of
characters that are all from the same Unicode script. However, because some
scripts are commonly used together, and because some diacritical and other
marks are used with multiple scripts, it is not that simple.
.P
Every Unicode character has a Script property, mostly with a value
corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
are also three special values:
.P
"Unknown" is used for code points that have not been assigned, and also for the
surrogate code points. In the PCRE2 32-bit library, characters whose code
points are greater than the Unicode maximum (U+10FFFF), which are accessible
only in non-UTF mode, are assigned the Unknown script.
.P
"Common" is used for characters that are used with many scripts. These include
punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
digits 0 to 9.
.P
"Inherited" is used for characters such as diacritical marks that modify a
previous character. These are considered to take on the script of the character
that they modify.
.P
Some Inherited characters are used with many scripts, but many of them are only
normally used with a small number of scripts. For example, U+102E0 (Coptic
Epact thousands mark) is used only with Arabic and Coptic. In order to make it
possible to check this, a Unicode property called Script Extension exists. Its
value is a list of scripts that apply to the character. For the majority of
characters, the list contains just one script, the same one as the Script
property. However, for characters such as U+102E0 more than one Script is
listed. There are also some Common characters that have a single, non-Common
script in their Script Extension list.
.P
The next section describes the basic rules for deciding whether a given string
of characters is a script run. Note, however, that there are some special cases
involving the Chinese Han script, and an additional constraint for decimal
digits. These are covered in subsequent sections.
.
.
.SS "Basic script run rules"
.rs
.sp
A string that is less than two characters long is a script run. This is the
only case in which an Unknown character can be part of a script run. Longer
strings are checked using only the Script Extensions property, not the basic
Script property.
.P
If a character's Script Extension property is the single value "Inherited", it
is always accepted as part of a script run. This is also true for the property
"Common", subject to the checking of decimal digits described below. All the
remaining characters in a script run must have at least one script in common in
their Script Extension lists. In set-theoretic terminology, the intersection of
all the sets of scripts must not be empty.
.P
A simple example is an Internet name such as "google.com". The letters are all
in the Latin script, and the dot is Common, so this string is a script run.
However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
string that looks the same, but with Cyrillic "o"s is not a script run.
.P
More interesting examples involve characters with more than one script in their
Script Extension. Consider the following characters:
.sp
U+060C Arabic comma
U+06D4 Arabic full stop
.sp
The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could
appear in script runs of either Arabic or Hanifi Rohingya. The first could also
appear in Syriac or Thaana script runs, but the second could not.
.
.
.SS "The Chinese Han script"
.rs
.sp
The Chinese Han script is commonly used in conjunction with other scripts for
writing certain languages. Japanese uses the Hiragana and Katakana scripts
together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
and Han. These three combinations are treated as special cases when checking
script runs and are, in effect, "virtual scripts". Thus, a script run may
contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of
Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
in allowing such mixtures.
.
.
.SS "Decimal digits"
.rs
.sp
Unicode contains many sets of 10 decimal digits in different scripts, and some
scripts (including the Common script) contain more than one set. Some of these
decimal digits them are visually indistinguishable from the common ASCII
digits. In addition to the script checking described above, if a script run
contains any decimal digits, they must all come from the same set of 10
adjacent characters.
.
.
.SH "VALIDITY OF UTF STRINGS"
.rs
.sp
@ -285,6 +387,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 02 September 2018
Last updated: 12 October 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi

1
testdata/testinput4 vendored
View File

@ -2410,6 +2410,7 @@
\x{3031}\x{3041}\x{30a1}\x{2e80} [Hira Kata] Hira Kata Han
\x{060c}\x{06d4}\x{0600}\x{10d00}\x{0700} [Arab Rohg Syrc Thaa] [Arab Rohg] Arab Rohg Syrc
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
/(?<!)(*sr:)/

3
testdata/testinput5 vendored
View File

@ -2112,6 +2112,9 @@
/^(*sr:.*)/B,utf
paypаl.com A classic example of why script run checks are a good thing
/^(*sr:.*(*ACCEPT))/utf
paypаl.com But *ACCEPT breaks things
/^(*sr:\x{2e80}*)/B,utf
/^(*sr:\x{2e80}*)\x{2e80}/B,utf

View File

@ -3902,6 +3902,8 @@ No match
0: \x{60c}\x{6d4}\x{600}
\x{060c}\x{06d4}\x{0700}\x{0600}\x{10d00} [Arab Rohg Syrc Thaa] [Arab Rohg] Syrc Arab Rohg
0: \x{60c}\x{6d4}
\x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80} Han Hira [Bopo, Han, etc] [Hira Kata] Han
0: \x{2e80}\x{3041}\x{3001}\x{3031}\x{2e80}
/(?<!)(*sr:)/

View File

@ -4791,6 +4791,10 @@ MK: ABC
paypаl.com A classic example of why script run checks are a good thing
0: payp
/^(*sr:.*(*ACCEPT))/utf
paypаl.com But *ACCEPT breaks things
0: payp\x{430}l.com But *ACCEPT breaks things
/^(*sr:\x{2e80}*)/B,utf
------------------------------------------------------------------
Bra