Add short synonyms for Bidi_Control and Bidi_Class
This commit is contained in:
parent
30abd0ac8d
commit
49b29f837d
|
@ -783,7 +783,7 @@ escape sequences are:
|
|||
\P{<i>xx</i>} a character without the <i>xx</i> property
|
||||
\X a Unicode extended grapheme cluster
|
||||
</pre>
|
||||
The property names represented by <i>xx</i> above are not case-sensitive, and in
|
||||
The property names represented by <i>xx</i> above are not case-sensitive, and in
|
||||
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||
underscores are ignored. There is support for Unicode script names, Unicode
|
||||
general category properties, "Any", which matches any character (including
|
||||
|
@ -1072,10 +1072,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
|
|||
Bi-directional properties for \p and \P
|
||||
</b><br>
|
||||
<P>
|
||||
Two properties relating to bi-directional text are supported:
|
||||
Two properties relating to bi-directional text (each with a shorter synonym)
|
||||
are supported:
|
||||
<pre>
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
The recognized classes are:
|
||||
<pre>
|
||||
|
@ -1088,7 +1091,7 @@ The recognized classes are:
|
|||
ES European separator
|
||||
ET European terminator
|
||||
FSI first strong isolate
|
||||
L left-to-right
|
||||
L left-to-right
|
||||
LRE left-to-right embedding
|
||||
LRI left-to-right isolate
|
||||
LRO left-to-right override
|
||||
|
@ -1101,11 +1104,10 @@ The recognized classes are:
|
|||
RLI right-to-left isolate
|
||||
RLO right-to-left override
|
||||
S segment separator
|
||||
WS which space
|
||||
WS which space
|
||||
</pre>
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive. As for other properties, only the short names are
|
||||
recognized.
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive; only the short names listed above are recognized.
|
||||
</P>
|
||||
<br><b>
|
||||
Extended grapheme clusters
|
||||
|
@ -3902,7 +3904,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 10 December 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
|
@ -137,6 +137,11 @@ happening, \s and \w may also match characters with code points in the range
|
|||
sequences is changed to use Unicode properties and they match many more
|
||||
characters.
|
||||
</P>
|
||||
<P>
|
||||
Property descriptions in \p and \P are matched caselessly; hyphens,
|
||||
underscores, and white space are ignored, in accordance with Unicode's "loose
|
||||
matching" rules.
|
||||
</P>
|
||||
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
|
||||
<P>
|
||||
<pre>
|
||||
|
@ -367,7 +372,9 @@ Zanabazar_Square.
|
|||
<P>
|
||||
<pre>
|
||||
\p{Bidi_Control} matches a Bidi control character
|
||||
\p{Bidi_C} matches a Bidi control character
|
||||
\p{Bidi_Class:<class>} matches a character with the given class
|
||||
\p{BC:<class>} matches a character with the given class
|
||||
</pre>
|
||||
The recognized classes are:
|
||||
<pre>
|
||||
|
@ -731,7 +738,7 @@ Cambridge, England.
|
|||
</P>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 10 December 2021
|
||||
<br>
|
||||
Copyright © 1997-2021 University of Cambridge.
|
||||
<br>
|
||||
|
|
2133
doc/pcre2.txt
2133
doc/pcre2.txt
File diff suppressed because it is too large
Load Diff
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
|
||||
|
@ -779,7 +779,7 @@ escape sequences are:
|
|||
\eP{\fIxx\fP} a character without the \fIxx\fP property
|
||||
\eX a Unicode extended grapheme cluster
|
||||
.sp
|
||||
The property names represented by \fIxx\fP above are not case-sensitive, and in
|
||||
The property names represented by \fIxx\fP above are not case-sensitive, and in
|
||||
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
|
||||
underscores are ignored. There is support for Unicode script names, Unicode
|
||||
general category properties, "Any", which matches any character (including
|
||||
|
@ -1064,10 +1064,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
|
|||
.SS "Bi-directional properties for \ep and \eP"
|
||||
.rs
|
||||
.sp
|
||||
Two properties relating to bi-directional text are supported:
|
||||
Two properties relating to bi-directional text (each with a shorter synonym)
|
||||
are supported:
|
||||
.sp
|
||||
\ep{Bidi_Control} matches a Bidi control character
|
||||
\ep{Bidi_C} matches a Bidi control character
|
||||
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||
\ep{BC:<class>} matches a character with the given class
|
||||
.sp
|
||||
The recognized classes are:
|
||||
.sp
|
||||
|
@ -1080,7 +1083,7 @@ The recognized classes are:
|
|||
ES European separator
|
||||
ET European terminator
|
||||
FSI first strong isolate
|
||||
L left-to-right
|
||||
L left-to-right
|
||||
LRE left-to-right embedding
|
||||
LRI left-to-right isolate
|
||||
LRO left-to-right override
|
||||
|
@ -1093,11 +1096,10 @@ The recognized classes are:
|
|||
RLI right-to-left isolate
|
||||
RLO right-to-left override
|
||||
S segment separator
|
||||
WS which space
|
||||
WS which space
|
||||
.sp
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive. As for other properties, only the short names are
|
||||
recognized.
|
||||
For Bidi_Class, an equals sign may be used instead of a colon. The class names
|
||||
are case-insensitive; only the short names listed above are recognized.
|
||||
.
|
||||
.
|
||||
.SS Extended grapheme clusters
|
||||
|
@ -3950,6 +3952,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 10 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40"
|
||||
.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
|
||||
.SH NAME
|
||||
PCRE2 - Perl-compatible regular expressions (revised API)
|
||||
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
|
||||
|
@ -102,6 +102,10 @@ happening, \es and \ew may also match characters with code points in the range
|
|||
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
|
||||
sequences is changed to use Unicode properties and they match many more
|
||||
characters.
|
||||
.P
|
||||
Property descriptions in \ep and \eP are matched caselessly; hyphens,
|
||||
underscores, and white space are ignored, in accordance with Unicode's "loose
|
||||
matching" rules.
|
||||
.
|
||||
.
|
||||
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
|
||||
|
@ -337,7 +341,9 @@ Zanabazar_Square.
|
|||
.rs
|
||||
.sp
|
||||
\ep{Bidi_Control} matches a Bidi control character
|
||||
\ep{Bidi_C} matches a Bidi control character
|
||||
\ep{Bidi_Class:<class>} matches a character with the given class
|
||||
\ep{BC:<class>} matches a character with the given class
|
||||
.sp
|
||||
The recognized classes are:
|
||||
.sp
|
||||
|
@ -717,6 +723,6 @@ Cambridge, England.
|
|||
.rs
|
||||
.sp
|
||||
.nf
|
||||
Last updated: 08 December 2021
|
||||
Last updated: 10 December 2021
|
||||
Copyright (c) 1997-2021 University of Cambridge.
|
||||
.fi
|
||||
|
|
|
@ -688,6 +688,15 @@ static uint32_t chartypeoffset[] = {
|
|||
OP_STAR - OP_STAR, OP_STARI - OP_STAR,
|
||||
OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR };
|
||||
|
||||
/* Table of synonyms for Unicode properties. Each pair has the synonym first,
|
||||
followed by the name that's in the UCD table (lower case, no hyphens,
|
||||
underscores, or spaces). */
|
||||
|
||||
static const char *prop_synonyms[] = {
|
||||
"bc", "bidiclass",
|
||||
"bidic", "bidicontrol"
|
||||
};
|
||||
|
||||
/* Tables of names of POSIX character classes and their lengths. The names are
|
||||
now all in a single string, to reduce the number of relocations when a shared
|
||||
library is dynamically loaded. The list of lengths is terminated by a zero
|
||||
|
@ -2101,11 +2110,13 @@ negation. */
|
|||
if (c == CHAR_LEFT_CURLY_BRACKET)
|
||||
{
|
||||
if (ptr >= cb->end_pattern) goto ERROR_RETURN;
|
||||
|
||||
if (*ptr == CHAR_CIRCUMFLEX_ACCENT)
|
||||
{
|
||||
*negptr = TRUE;
|
||||
ptr++;
|
||||
}
|
||||
|
||||
for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++)
|
||||
{
|
||||
if (ptr >= cb->end_pattern) goto ERROR_RETURN;
|
||||
|
@ -2118,10 +2129,39 @@ if (c == CHAR_LEFT_CURLY_BRACKET)
|
|||
}
|
||||
if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN;
|
||||
name[i] = 0;
|
||||
|
||||
/* Implement a general synonym feature for class names. */
|
||||
|
||||
if (vptr != NULL) *vptr = 0; /* Terminate class name */
|
||||
|
||||
bot = 0;
|
||||
top = sizeof(prop_synonyms)/sizeof(char *);
|
||||
|
||||
while (top != bot)
|
||||
{
|
||||
size_t mid = ((top + bot)/2) & (-2);
|
||||
int cf = PRIV(strcmp_c8)(name, prop_synonyms[mid]);
|
||||
if (cf == 0)
|
||||
{
|
||||
const char *s = prop_synonyms[mid+1];
|
||||
size_t slen = strlen(s);
|
||||
if (vptr != NULL)
|
||||
{
|
||||
size_t vlen = name + i - vptr;
|
||||
memmove(name + slen + 1, vptr + 1, (vlen + 1) * sizeof(PCRE2_UCHAR));
|
||||
vptr = name + slen;
|
||||
i = slen + vlen + 1;
|
||||
}
|
||||
for (size_t k = 0; k <= slen; k++) name[k] = s[k];
|
||||
break;
|
||||
}
|
||||
|
||||
if (cf > 0) bot = mid + 2; else top = mid;
|
||||
}
|
||||
}
|
||||
|
||||
/* Otherwise there is just one following character, which must be an ASCII
|
||||
letter. */
|
||||
/* If { doesn't follow \p or \P there is just one following character, which
|
||||
must be an ASCII letter. */
|
||||
|
||||
else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0)
|
||||
{
|
||||
|
@ -2138,7 +2178,6 @@ property names are "bidi<name>". */
|
|||
|
||||
if (vptr != NULL)
|
||||
{
|
||||
*vptr = 0; /* Terminate class name */
|
||||
if (PRIV(strcmp_c8)(name, "bidiclass") != 0)
|
||||
{
|
||||
*errorcodeptr = ERR47;
|
||||
|
|
|
@ -2507,15 +2507,15 @@
|
|||
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
|
||||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
|
||||
/\p{bidicontrol}+?/utf
|
||||
/\p{bidic}+?/utf
|
||||
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
|
||||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
|
||||
/\p{bidicontrol}++/utf
|
||||
/\p{bidi_control}++/utf
|
||||
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
|
||||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
|
||||
/[\p{bidi_control}]/utf
|
||||
/[\p{bidi_c}]/utf
|
||||
-->\x{202c}<--
|
||||
|
||||
/[\p{bidicontrol}]+/utf
|
||||
|
@ -2545,7 +2545,7 @@
|
|||
/\p{bidi class = al}/utf
|
||||
-->\x{061D}<--
|
||||
|
||||
/\p{bidi class = al}+/utf
|
||||
/\p{bc = al}+/utf
|
||||
-->\x{061D}\x{061e}\x{061f}<--
|
||||
|
||||
/\p{bidi_class : AL}+?/utf
|
||||
|
@ -2554,7 +2554,7 @@
|
|||
/\p{Bidi_Class : AL}++/utf
|
||||
-->\x{061D}\x{061e}\x{061f}<--
|
||||
|
||||
/\p{bidi class = aN}+/utf
|
||||
/\p{b_c = aN}+/utf
|
||||
-->\x{061D}\x{0602}\x{0604}\x{061f}<--
|
||||
|
||||
/\p{bidi class = B}+/utf
|
||||
|
|
|
@ -4047,19 +4047,19 @@ No match
|
|||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
0: \x{2066}\x{2067}\x{2068}\x{2069}
|
||||
|
||||
/\p{bidicontrol}+?/utf
|
||||
/\p{bidic}+?/utf
|
||||
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
|
||||
0: \x{61c}
|
||||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
0: \x{2066}
|
||||
|
||||
/\p{bidicontrol}++/utf
|
||||
/\p{bidi_control}++/utf
|
||||
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
|
||||
0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}
|
||||
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
|
||||
0: \x{2066}\x{2067}\x{2068}\x{2069}
|
||||
|
||||
/[\p{bidi_control}]/utf
|
||||
/[\p{bidi_c}]/utf
|
||||
-->\x{202c}<--
|
||||
0: \x{202c}
|
||||
|
||||
|
@ -4107,7 +4107,7 @@ No match
|
|||
-->\x{061D}<--
|
||||
0: \x{61d}
|
||||
|
||||
/\p{bidi class = al}+/utf
|
||||
/\p{bc = al}+/utf
|
||||
-->\x{061D}\x{061e}\x{061f}<--
|
||||
0: \x{61d}\x{61e}\x{61f}
|
||||
|
||||
|
@ -4119,7 +4119,7 @@ No match
|
|||
-->\x{061D}\x{061e}\x{061f}<--
|
||||
0: \x{61d}\x{61e}\x{61f}
|
||||
|
||||
/\p{bidi class = aN}+/utf
|
||||
/\p{b_c = aN}+/utf
|
||||
-->\x{061D}\x{0602}\x{0604}\x{061f}<--
|
||||
0: \x{602}\x{604}
|
||||
|
||||
|
|
Loading…
Reference in New Issue