Add short synonyms for Bidi_Control and Bidi_Class

This commit is contained in:
Philip Hazel 2021-12-10 16:32:10 +00:00
parent 30abd0ac8d
commit 49b29f837d
8 changed files with 1160 additions and 1095 deletions

View File

@ -1072,10 +1072,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
Bi-directional properties for \p and \P Bi-directional properties for \p and \P
</b><br> </b><br>
<P> <P>
Two properties relating to bi-directional text are supported: Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
The recognized classes are: The recognized classes are:
<pre> <pre>
@ -1104,8 +1107,7 @@ The recognized classes are:
WS which space WS which space
</pre> </pre>
For Bidi_Class, an equals sign may be used instead of a colon. The class names For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are are case-insensitive; only the short names listed above are recognized.
recognized.
</P> </P>
<br><b> <br><b>
Extended grapheme clusters Extended grapheme clusters
@ -3902,7 +3904,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC32" href="#TOC1">REVISION</a><br> <br><a name="SEC32" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 December 2021 Last updated: 10 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -137,6 +137,11 @@ happening, \s and \w may also match characters with code points in the range
sequences is changed to use Unicode properties and they match many more sequences is changed to use Unicode properties and they match many more
characters. characters.
</P> </P>
<P>
Property descriptions in \p and \P are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
</P>
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
<P> <P>
<pre> <pre>
@ -367,7 +372,9 @@ Zanabazar_Square.
<P> <P>
<pre> <pre>
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:&#60;class&#62;} matches a character with the given class \p{Bidi_Class:&#60;class&#62;} matches a character with the given class
\p{BC:&#60;class&#62;} matches a character with the given class
</pre> </pre>
The recognized classes are: The recognized classes are:
<pre> <pre>
@ -731,7 +738,7 @@ Cambridge, England.
</P> </P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br> <br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P> <P>
Last updated: 08 December 2021 Last updated: 10 December 2021
<br> <br>
Copyright &copy; 1997-2021 University of Cambridge. Copyright &copy; 1997-2021 University of Cambridge.
<br> <br>

View File

@ -7035,10 +7035,13 @@ BACKSLASH
Bi-directional properties for \p and \P Bi-directional properties for \p and \P
Two properties relating to bi-directional text are supported: Two properties relating to bi-directional text (each with a shorter
synonym) are supported:
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:<class>} matches a character with the given class \p{Bidi_Class:<class>} matches a character with the given class
\p{BC:<class>} matches a character with the given class
The recognized classes are: The recognized classes are:
@ -7067,8 +7070,8 @@ BACKSLASH
WS which space WS which space
For Bidi_Class, an equals sign may be used instead of a colon. The For Bidi_Class, an equals sign may be used instead of a colon. The
class names are case-insensitive. As for other properties, only the class names are case-insensitive; only the short names listed above are
short names are recognized. recognized.
Extended grapheme clusters Extended grapheme clusters
@ -9698,7 +9701,7 @@ AUTHOR
REVISION REVISION
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------
@ -10646,6 +10649,10 @@ CHARACTER TYPES
iour of these escape sequences is changed to use Unicode properties and iour of these escape sequences is changed to use Unicode properties and
they match many more characters. they match many more characters.
Property descriptions in \p and \P are matched caselessly; hyphens, un-
derscores, and white space are ignored, in accordance with Unicode's
"loose matching" rules.
GENERAL CATEGORY PROPERTIES FOR \p and \P GENERAL CATEGORY PROPERTIES FOR \p and \P
@ -10740,7 +10747,9 @@ SCRIPT NAMES FOR \p AND \P
BIDI_PROPERTIES FOR \p AND \P BIDI_PROPERTIES FOR \p AND \P
\p{Bidi_Control} matches a Bidi control character \p{Bidi_Control} matches a Bidi control character
\p{Bidi_C} matches a Bidi control character
\p{Bidi_Class:<class>} matches a character with the given class \p{Bidi_Class:<class>} matches a character with the given class
\p{BC:<class>} matches a character with the given class
The recognized classes are: The recognized classes are:
@ -11098,7 +11107,7 @@ AUTHOR
REVISION REVISION
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
------------------------------------------------------------------------------ ------------------------------------------------------------------------------

View File

@ -1,4 +1,4 @@
.TH PCRE2PATTERN 3 "08 December 2021" "PCRE2 10.40" .TH PCRE2PATTERN 3 "10 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS" .SH "PCRE2 REGULAR EXPRESSION DETAILS"
@ -1064,10 +1064,13 @@ PCRE2_UCP option or by starting the pattern with (*UCP).
.SS "Bi-directional properties for \ep and \eP" .SS "Bi-directional properties for \ep and \eP"
.rs .rs
.sp .sp
Two properties relating to bi-directional text are supported: Two properties relating to bi-directional text (each with a shorter synonym)
are supported:
.sp .sp
\ep{Bidi_Control} matches a Bidi control character \ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp .sp
The recognized classes are: The recognized classes are:
.sp .sp
@ -1096,8 +1099,7 @@ The recognized classes are:
WS which space WS which space
.sp .sp
For Bidi_Class, an equals sign may be used instead of a colon. The class names For Bidi_Class, an equals sign may be used instead of a colon. The class names
are case-insensitive. As for other properties, only the short names are are case-insensitive; only the short names listed above are recognized.
recognized.
. .
. .
.SS Extended grapheme clusters .SS Extended grapheme clusters
@ -3950,6 +3952,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -1,4 +1,4 @@
.TH PCRE2SYNTAX 3 "08 December 2021" "PCRE2 10.40" .TH PCRE2SYNTAX 3 "10 December 2021" "PCRE2 10.40"
.SH NAME .SH NAME
PCRE2 - Perl-compatible regular expressions (revised API) PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@ -102,6 +102,10 @@ happening, \es and \ew may also match characters with code points in the range
128-255. If the PCRE2_UCP option is set, the behaviour of these escape 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
sequences is changed to use Unicode properties and they match many more sequences is changed to use Unicode properties and they match many more
characters. characters.
.P
Property descriptions in \ep and \eP are matched caselessly; hyphens,
underscores, and white space are ignored, in accordance with Unicode's "loose
matching" rules.
. .
. .
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
@ -337,7 +341,9 @@ Zanabazar_Square.
.rs .rs
.sp .sp
\ep{Bidi_Control} matches a Bidi control character \ep{Bidi_Control} matches a Bidi control character
\ep{Bidi_C} matches a Bidi control character
\ep{Bidi_Class:<class>} matches a character with the given class \ep{Bidi_Class:<class>} matches a character with the given class
\ep{BC:<class>} matches a character with the given class
.sp .sp
The recognized classes are: The recognized classes are:
.sp .sp
@ -717,6 +723,6 @@ Cambridge, England.
.rs .rs
.sp .sp
.nf .nf
Last updated: 08 December 2021 Last updated: 10 December 2021
Copyright (c) 1997-2021 University of Cambridge. Copyright (c) 1997-2021 University of Cambridge.
.fi .fi

View File

@ -688,6 +688,15 @@ static uint32_t chartypeoffset[] = {
OP_STAR - OP_STAR, OP_STARI - OP_STAR, OP_STAR - OP_STAR, OP_STARI - OP_STAR,
OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR }; OP_NOTSTAR - OP_STAR, OP_NOTSTARI - OP_STAR };
/* Table of synonyms for Unicode properties. Each pair has the synonym first,
followed by the name that's in the UCD table (lower case, no hyphens,
underscores, or spaces). */
static const char *prop_synonyms[] = {
"bc", "bidiclass",
"bidic", "bidicontrol"
};
/* Tables of names of POSIX character classes and their lengths. The names are /* Tables of names of POSIX character classes and their lengths. The names are
now all in a single string, to reduce the number of relocations when a shared now all in a single string, to reduce the number of relocations when a shared
library is dynamically loaded. The list of lengths is terminated by a zero library is dynamically loaded. The list of lengths is terminated by a zero
@ -2101,11 +2110,13 @@ negation. */
if (c == CHAR_LEFT_CURLY_BRACKET) if (c == CHAR_LEFT_CURLY_BRACKET)
{ {
if (ptr >= cb->end_pattern) goto ERROR_RETURN; if (ptr >= cb->end_pattern) goto ERROR_RETURN;
if (*ptr == CHAR_CIRCUMFLEX_ACCENT) if (*ptr == CHAR_CIRCUMFLEX_ACCENT)
{ {
*negptr = TRUE; *negptr = TRUE;
ptr++; ptr++;
} }
for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++) for (i = 0; i < (int)(sizeof(name) / sizeof(PCRE2_UCHAR)) - 1; i++)
{ {
if (ptr >= cb->end_pattern) goto ERROR_RETURN; if (ptr >= cb->end_pattern) goto ERROR_RETURN;
@ -2118,10 +2129,39 @@ if (c == CHAR_LEFT_CURLY_BRACKET)
} }
if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN; if (c != CHAR_RIGHT_CURLY_BRACKET) goto ERROR_RETURN;
name[i] = 0; name[i] = 0;
/* Implement a general synonym feature for class names. */
if (vptr != NULL) *vptr = 0; /* Terminate class name */
bot = 0;
top = sizeof(prop_synonyms)/sizeof(char *);
while (top != bot)
{
size_t mid = ((top + bot)/2) & (-2);
int cf = PRIV(strcmp_c8)(name, prop_synonyms[mid]);
if (cf == 0)
{
const char *s = prop_synonyms[mid+1];
size_t slen = strlen(s);
if (vptr != NULL)
{
size_t vlen = name + i - vptr;
memmove(name + slen + 1, vptr + 1, (vlen + 1) * sizeof(PCRE2_UCHAR));
vptr = name + slen;
i = slen + vlen + 1;
}
for (size_t k = 0; k <= slen; k++) name[k] = s[k];
break;
} }
/* Otherwise there is just one following character, which must be an ASCII if (cf > 0) bot = mid + 2; else top = mid;
letter. */ }
}
/* If { doesn't follow \p or \P there is just one following character, which
must be an ASCII letter. */
else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0) else if (MAX_255(c) && (cb->ctypes[c] & ctype_letter) != 0)
{ {
@ -2138,7 +2178,6 @@ property names are "bidi<name>". */
if (vptr != NULL) if (vptr != NULL)
{ {
*vptr = 0; /* Terminate class name */
if (PRIV(strcmp_c8)(name, "bidiclass") != 0) if (PRIV(strcmp_c8)(name, "bidiclass") != 0)
{ {
*errorcodeptr = ERR47; *errorcodeptr = ERR47;

10
testdata/testinput4 vendored
View File

@ -2507,15 +2507,15 @@
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}+?/utf /\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/\p{bidicontrol}++/utf /\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
/[\p{bidi_control}]/utf /[\p{bidi_c}]/utf
-->\x{202c}<-- -->\x{202c}<--
/[\p{bidicontrol}]+/utf /[\p{bidicontrol}]+/utf
@ -2545,7 +2545,7 @@
/\p{bidi class = al}/utf /\p{bidi class = al}/utf
-->\x{061D}<-- -->\x{061D}<--
/\p{bidi class = al}+/utf /\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
/\p{bidi_class : AL}+?/utf /\p{bidi_class : AL}+?/utf
@ -2554,7 +2554,7 @@
/\p{Bidi_Class : AL}++/utf /\p{Bidi_Class : AL}++/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
/\p{bidi class = aN}+/utf /\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<-- -->\x{061D}\x{0602}\x{0604}\x{061f}<--
/\p{bidi class = B}+/utf /\p{bidi class = B}+/utf

10
testdata/testoutput4 vendored
View File

@ -4047,19 +4047,19 @@ No match
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069} 0: \x{2066}\x{2067}\x{2068}\x{2069}
/\p{bidicontrol}+?/utf /\p{bidic}+?/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c} 0: \x{61c}
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066} 0: \x{2066}
/\p{bidicontrol}++/utf /\p{bidi_control}++/utf
-->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<-- -->\x{061c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}<--
0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d} 0: \x{61c}\x{200e}\x{200f}\x{202a}\x{202b}\x{202c}\x{202d}
-->\x{2066}\x{2067}\x{2068}\x{2069}<-- -->\x{2066}\x{2067}\x{2068}\x{2069}<--
0: \x{2066}\x{2067}\x{2068}\x{2069} 0: \x{2066}\x{2067}\x{2068}\x{2069}
/[\p{bidi_control}]/utf /[\p{bidi_c}]/utf
-->\x{202c}<-- -->\x{202c}<--
0: \x{202c} 0: \x{202c}
@ -4107,7 +4107,7 @@ No match
-->\x{061D}<-- -->\x{061D}<--
0: \x{61d} 0: \x{61d}
/\p{bidi class = al}+/utf /\p{bc = al}+/utf
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f} 0: \x{61d}\x{61e}\x{61f}
@ -4119,7 +4119,7 @@ No match
-->\x{061D}\x{061e}\x{061f}<-- -->\x{061D}\x{061e}\x{061f}<--
0: \x{61d}\x{61e}\x{61f} 0: \x{61d}\x{61e}\x{61f}
/\p{bidi class = aN}+/utf /\p{b_c = aN}+/utf
-->\x{061D}\x{0602}\x{0604}\x{061f}<-- -->\x{061D}\x{0602}\x{0604}\x{061f}<--
0: \x{602}\x{604} 0: \x{602}\x{604}