Add short synonyms for Bidi_Control and Bidi_Class

This commit is contained in:
Philip Hazel 2021-12-10 16:32:10 +00:00
parent 30abd0ac8d
commit 49b29f837d
8 changed files with 1160 additions and 1095 deletions

View File

@ -783,7 +783,7 @@ escape sequences are:
\P{<i>xx</i>} a character without the <i>xx</i> property \P{<i>xx</i>} a character without the <i>xx</i> property
\X a Unicode extended grapheme cluster \X a Unicode extended grapheme cluster
</pre> </pre>
The property names represented by <i>xx</i> above are not case-sensitive, and in The property names represented by <i>xx</i> above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including general category properties, "Any", which matches any character (including
@ -1072,10 +1072,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
Bi-directional properties for \p and \P Bi-directional properties for \p and \P
</b><br> </b><br>
<P> <P>
Two properties relating to bi-directional text are supported: Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
The recognized classes are: The recognized classes are:
<pre> <pre>
@ -1088,7 +1091,7 @@ The recognized classes are:
ES European separator ES European separator
ET European terminator ET European terminator
FSI first strong isolate FSI first strong isolate
L left-to-right L left-to-right
LRE left-to-right embedding LRE left-to-right embedding
LRI left-to-right isolate LRI left-to-right isolate
LRO left-to-right override LRO left-to-right override
@ -1101,11 +1104,10 @@ The recognized classes are:
RLI right-to-left isolate RLI right-to-left isolate
RLO right-to-left override RLO right-to-left override
S segment separator S segment separator
WS which space WS which space
</pre> </pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are are case-insensitive; only the short names listed above are recognized.
recognized.
</P> </P>
<br><b> <br><b>
Extended grapheme clusters Extended grapheme clusters
@ -3902,7 +3904,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 December 2021 Last updated: 10 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -137,6 +137,11 @@ happening, \s and \w may also match characters with code points in the range
sequences is changed to use Unicode properties and they match many more sequences is changed to use Unicode properties and they match many more
characters. characters.
</P> </P>
<P>
Property descriptions in \p and \P are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
</P>
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P> <P>
<pre> <pre>
@ -367,7 +372,9 @@ Zanabazar_Square.
<P> <P>
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
The recognized classes are: The recognized classes are:
<pre> <pre>
@ -731,7 +738,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 December 2021 Last updated: 10 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

File diff suppressed because it is too large Load Diff

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40" .TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -779,7 +779,7 @@ escape sequences are:
\eP{\fIxx\fP} a character without the \fIxx\fP property \eP{\fIxx\fP} a character without the \fIxx\fP property
\eX a Unicode extended grapheme cluster \eX a Unicode extended grapheme cluster
.sp .sp
The property names represented by \fIxx\fP above are not case-sensitive, and in The property names represented by \fIxx\fP above are not case-sensitive, and in
accordance with Unicode's "loose matching" rules, spaces, hyphens, and accordance with Unicode's "loose matching" rules, spaces, hyphens, and
underscores are ignored. There is support for Unicode script names, Unicode underscores are ignored. There is support for Unicode script names, Unicode
general category properties, "Any", which matches any character (including general category properties, "Any", which matches any character (including
@ -1064,10 +1064,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
.SS "Bi-directional properties for \ep and \eP" .SS "Bi-directional properties for \ep and \eP"
.rs .rs
.sp .sp
Two properties relating to bi-directional text are supported: Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
.sp .sp
\ep{Bidi_Control} matches a Bidi control character \ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp .sp
The recognized classes are: The recognized classes are:
.sp .sp
@ -1080,7 +1083,7 @@ The recognized classes are:
ES European separator ES European separator
ET European terminator ET European terminator
FSI first strong isolate FSI first strong isolate
L left-to-right L left-to-right
LRE left-to-right embedding LRE left-to-right embedding
LRI left-to-right isolate LRI left-to-right isolate
LRO left-to-right override LRO left-to-right override
@ -1093,11 +1096,10 @@ The recognized classes are:
RLI right-to-left isolate RLI right-to-left isolate
RLO right-to-left override RLO right-to-left override
S segment separator S segment separator
WS which space WS which space
.sp .sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are are case-insensitive; only the short names listed above are recognized.
recognized.
. .
. .
.SS Extended grapheme clusters .SS Extended grapheme clusters
@ -3950,6 +3952,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40" .TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -102,6 +102,10 @@ happening, \es and \ew may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more sequences is changed to use Unicode properties and they match many more
characters. characters.
.P
Property descriptions in \ep and \eP are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
. .
. .
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
@ -337,7 +341,9 @@ Zanabazar_Square.
.rs .rs
.sp .sp
\ep{Bidi_Control} matches a Bidi control character \ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp .sp
The recognized classes are: The recognized classes are:
.sp .sp
@ -717,6 +723,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -688,6 +688,15 @@ static uint32_t chartypeoffset[] = {
OP_STAR - OP_STAR, OP_STARI - OP_STAR, OP_STAR - OP_STAR, OP_STARI - OP_STAR,
OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR }; OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR };
/* Table of synonyms for Unicode properties. Each pair has the synonym first,
followed by the name that's in the UCD table (lower case, no hyphens,
underscores, or spaces). */
static const char *prop_synonyms[] = {
"bc", "bidiclass",
"bidic", "bidicontrol"
};
/* Tables of names of POSIX character classes and their lengths. The names are /* Tables of names of POSIX character classes and their lengths. The names are
now all in a single string, to reduce the number of relocations when a shared now all in a single string, to reduce the number of relocations when a shared
library is dynamically loaded. The list of lengths is terminated by a zero library is dynamically loaded. The list of lengths is terminated by a zero
@ -2101,11 +2110,13 @@ negation. */
if (c == CHAR_LEFT_CURLY_BRACKET) if (c == CHAR_LEFT_CURLY_BRACKET)
{ {
if (ptr >= cb->end_pattern) goto ERROR_RETURN; if (ptr >= cb->end_pattern) goto ERROR_RETURN;
if (*ptr == CHAR_CIRCUMFLEX_ACCENT) if (*ptr == CHAR_CIRCUMFLEX_ACCENT)
{ {
*negptr = TRUE; *negptr = TRUE;
ptr++; ptr++;
} }
for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++) for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++)
{ {
if (ptr >= cb->end_pattern) goto ERROR_RETURN; if (ptr >= cb->end_pattern) goto ERROR_RETURN;
@ -2118,10 +2129,39 @@ if (c == CHAR_LEFT_CURLY_BRACKET)
} }
if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN; if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN;
name[i] = 0; name[i] = 0;
/* Implement a general synonym feature for class names. */
if (vptr != NULL) *vptr = 0; /* Terminate class name */
bot = 0;
top = sizeof(prop_synonyms)/sizeof(char *);
while (top != bot)
{
size_t mid = ((top + bot)/2) & (-2);
int cf = PRIV(strcmp_c8)(name, prop_synonyms[mid]);
if (cf == 0)
{
const char *s = prop_synonyms[mid+1];
size_t slen = strlen(s);
if (vptr != NULL)
{
size_t vlen = name + i - vptr;
memmove(name + slen + 1, vptr + 1, (vlen + 1) * sizeof(PCRE2_UCHAR));
vptr = name + slen;
i = slen + vlen + 1;
}
for (size_t k = 0; k <= slen; k++) name[k] = s[k];
break;
}
if (cf > 0) bot = mid + 2; else top = mid;
}
} }
/* Otherwise there is just one following character, which must be an ASCII /* If { doesn't follow \p or \P there is just one following character, which
letter. */ must be an ASCII letter. */
else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0) else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0)
{ {
@ -2138,7 +2178,6 @@ property names are "bidi<name>". */
if (vptr != NULL) if (vptr != NULL)
{ {
*vptr = 0; /* Terminate class name */
if (PRIV(strcmp_c8)(name, "bidiclass") != 0) if (PRIV(strcmp_c8)(name, "bidiclass") != 0)
{ {
*errorcodeptr = ERR47; *errorcodeptr = ERR47;

10
testdata/testinput4 vendored
View File

@ -2507,15 +2507,15 @@
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}+?/utf /\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}++/utf /\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/[\p{bidi_control}]/utf /[\p{bidi_c}]/utf
-->\x{202c}<-- -->\x{202c}<--
/[\p{bidicontrol}]+/utf /[\p{bidicontrol}]+/utf
@ -2545,7 +2545,7 @@
/\p{bidi class = al}/utf /\p{bidi class = al}/utf
-->\x{061D}<-- -->\x{061D}<--
/\p{bidi class = al}+/utf /\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
/\p{bidi_class : AL}+?/utf /\p{bidi_class : AL}+?/utf
@ -2554,7 +2554,7 @@
/\p{Bidi_Class : AL}++/utf /\p{Bidi_Class : AL}++/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
/\p{bidi class = aN}+/utf /\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<-- -->\x{061D}\x{0602}\x{0604}\x{061f}<--
/\p{bidi class = B}+/utf /\p{bidi class = B}+/utf

10
testdata/testoutput4 vendored
View File

@ -4047,19 +4047,19 @@ No match
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069} 0: \x{2066}\x{2067}\x{2068}\x{2069}
/\p{bidicontrol}+?/utf /\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c} 0: \x{61c}
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066} 0: \x{2066}
/\p{bidicontrol}++/utf /\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d} 0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069} 0: \x{2066}\x{2067}\x{2068}\x{2069}
/[\p{bidi_control}]/utf /[\p{bidi_c}]/utf
-->\x{202c}<-- -->\x{202c}<--
0: \x{202c} 0: \x{202c}
@ -4107,7 +4107,7 @@ No match
-->\x{061D}<-- -->\x{061D}<--
0: \x{61d} 0: \x{61d}
/\p{bidi class = al}+/utf /\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f} 0: \x{61d}\x{61e}\x{61f}
@ -4119,7 +4119,7 @@ No match
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f} 0: \x{61d}\x{61e}\x{61f}
/\p{bidi class = aN}+/utf /\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<-- -->\x{061D}\x{0602}\x{0604}\x{061f}<--
0: \x{602}\x{604} 0: \x{602}\x{604}