Add short synonyms for Bidi_Control and Bidi_Class

This commit is contained in:
Philip Hazel 2021-12-10 16:32:10 +00:00
parent 30abd0ac8d
commit 49b29f837d
8 changed files with 1160 additions and 1095 deletions

View File

@ -783,7 +783,7 @@ escape sequences are:
\P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster
</pre>
The property names represented by <i>xx</i> above are not case-sensitive, and in
The property names represented by <i>xx</i> above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including
@ -1072,10 +1072,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
Bi-directional properties for \p and \P
</b><br>
<P>
Two properties relating to bi-directional text are supported:
Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
@ -1088,7 +1091,7 @@ The recognized classes are:
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
@ -1101,11 +1104,10 @@ The recognized classes are:
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
WS which space
</pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive; only the short names listed above are recognized.
</P>
<br><b>
Extended grapheme clusters
@ -3902,7 +3904,7 @@ Cambridge, England.
</P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P>
Last updated: 08 December 2021
Last updated: 10 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

View File

@ -137,6 +137,11 @@ happening, \s and \w may also match characters with code points in the range
sequences is changed to use Unicode properties and they match many more
characters.
</P>
<P>
Property descriptions in \p and \P are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
</P>
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P>
<pre>
@ -367,7 +372,9 @@ Zanabazar_Square.
<P>
<pre>
\p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre>
The recognized classes are:
<pre>
@ -731,7 +738,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
Last updated: 08 December 2021
Last updated: 10 December 2021
<br>
Copyright &copy; 1997-2021 University of Cambridge.
<br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40"
.TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -779,7 +779,7 @@ escape sequences are:
\eP{\fIxx\fP} a character without the \fIxx\fP property
\eX a Unicode extended grapheme cluster
.sp
The property names represented by \fIxx\fP above are not case-sensitive, and in
The property names represented by \fIxx\fP above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including
@ -1064,10 +1064,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
.SS "Bi-directional properties for \ep and \eP"
.rs
.sp
Two properties relating to bi-directional text are supported:
Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
@ -1080,7 +1083,7 @@ The recognized classes are:
ES European separator
ET European terminator
FSI first strong isolate
L left-to-right
L left-to-right
LRE left-to-right embedding
LRI left-to-right isolate
LRO left-to-right override
@ -1093,11 +1096,10 @@ The recognized classes are:
RLI right-to-left isolate
RLO right-to-left override
S segment separator
WS which space
WS which space
.sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are
recognized.
For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive; only the short names listed above are recognized.
.
.
.SS Extended grapheme clusters
@ -3950,6 +3952,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 08 December 2021
Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40"
.TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -102,6 +102,10 @@ happening, \es and \ew may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more
characters.
.P
Property descriptions in \ep and \eP are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
.
.
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
@ -337,7 +341,9 @@ Zanabazar_Square.
.rs
.sp
\ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp
The recognized classes are:
.sp
@ -717,6 +723,6 @@ Cambridge, England.
.rs
.sp
.nf
Last updated: 08 December 2021
Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge.
.fi

View File

@ -688,6 +688,15 @@ static uint32_t chartypeoffset[] = {
OP_STAR - OP_STAR, OP_STARI - OP_STAR,
OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR };
/* Table of synonyms for Unicode properties. Each pair has the synonym first,
followed by the name that's in the UCD table (lower case, no hyphens,
underscores, or spaces). */
static const char *prop_synonyms[] = {
"bc", "bidiclass",
"bidic", "bidicontrol"
};
/* Tables of names of POSIX character classes and their lengths. The names are
now all in a single string, to reduce the number of relocations when a shared
library is dynamically loaded. The list of lengths is terminated by a zero
@ -2101,11 +2110,13 @@ negation. */
if (c == CHAR_LEFT_CURLY_BRACKET)
{
if (ptr >= cb->end_pattern) goto ERROR_RETURN;
if (*ptr == CHAR_CIRCUMFLEX_ACCENT)
{
*negptr = TRUE;
ptr++;
}
for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++)
{
if (ptr >= cb->end_pattern) goto ERROR_RETURN;
@ -2118,10 +2129,39 @@ if (c == CHAR_LEFT_CURLY_BRACKET)
}
if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN;
name[i] = 0;
/* Implement a general synonym feature for class names. */
if (vptr != NULL) *vptr = 0; /* Terminate class name */
bot = 0;
top = sizeof(prop_synonyms)/sizeof(char *);
while (top != bot)
{
size_t mid = ((top + bot)/2) & (-2);
int cf = PRIV(strcmp_c8)(name, prop_synonyms[mid]);
if (cf == 0)
{
const char *s = prop_synonyms[mid+1];
size_t slen = strlen(s);
if (vptr != NULL)
{
size_t vlen = name + i - vptr;
memmove(name + slen + 1, vptr + 1, (vlen + 1) * sizeof(PCRE2_UCHAR));
vptr = name + slen;
i = slen + vlen + 1;
}
for (size_t k = 0; k <= slen; k++) name[k] = s[k];
break;
}
if (cf > 0) bot = mid + 2; else top = mid;
}
}
/* Otherwise there is just one following character, which must be an ASCII
letter. */
/* If { doesn't follow \p or \P there is just one following character, which
must be an ASCII letter. */
else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0)
{
@ -2138,7 +2178,6 @@ property names are "bidi<name>". */
if (vptr != NULL)
{
*vptr = 0; /* Terminate class name */
if (PRIV(strcmp_c8)(name, "bidiclass") != 0)
{
*errorcodeptr = ERR47;

10
testdata/testinput4 vendored
View File

@ -2507,15 +2507,15 @@
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}+?/utf
/\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}++/utf
/\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
/[\p{bidi_control}]/utf
/[\p{bidi_c}]/utf
-->\x{202c}<--
/[\p{bidicontrol}]+/utf
@ -2545,7 +2545,7 @@
/\p{bidi class = al}/utf
-->\x{061D}<--
/\p{bidi class = al}+/utf
/\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<--
/\p{bidi_class : AL}+?/utf
@ -2554,7 +2554,7 @@
/\p{Bidi_Class : AL}++/utf
-->\x{061D}\x{061e}\x{061f}<--
/\p{bidi class = aN}+/utf
/\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<--
/\p{bidi class = B}+/utf

10
testdata/testoutput4 vendored
View File

@ -4047,19 +4047,19 @@ No match
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069}
/\p{bidicontrol}+?/utf
/\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c}
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}
/\p{bidicontrol}++/utf
/\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}
-->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069}
/[\p{bidi_control}]/utf
/[\p{bidi_c}]/utf
-->\x{202c}<--
0: \x{202c}
@ -4107,7 +4107,7 @@ No match
-->\x{061D}<--
0: \x{61d}
/\p{bidi class = al}+/utf
/\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f}
@ -4119,7 +4119,7 @@ No match
-->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f}
/\p{bidi class = aN}+/utf
/\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<--
0: \x{602}\x{604}