diff --git a/ChangeLog b/ChangeLog index cf87b2d..a068002 100644 --- a/ChangeLog +++ b/ChangeLog @@ -107,6 +107,10 @@ to an incorrect "lookbehind assertion is not fixed length" error. 23. The VERSION condition test was reading fractional PCRE2 version numbers such as the 04 in 10.04 incorrectly and hence giving wrong results. + +24. Updated to Unicode version 11.0.0. As well as the usual addition of new +scripts and characters, this involved re-jigging the grapheme break property +algorithm because Unicode has changed the way emojis are handled. Version 10.31 12-February-2018 diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 9adc426..9d241b7 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -789,6 +789,7 @@ Cypriot, Cyrillic, Deseret, Devanagari, +Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan, @@ -799,9 +800,11 @@ Gothic, Grantha, Greek, Gujarati, +Gunjala_Gondi, Gurmukhi, Han, Hangul, +Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, @@ -829,11 +832,13 @@ Lisu, Lycian, Lydian, Mahajani, +Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi, +Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, @@ -856,6 +861,7 @@ Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, +Old_Sogdian, Old_South_Arabian, Old_Turkic, Oriya, @@ -876,6 +882,7 @@ Shavian, Siddham, SignWriting, Sinhala, +Sogdian, Sora_Sompeng, Soyombo, Sundanese, @@ -1006,7 +1013,10 @@ grapheme cluster", and treats the sequence as an atomic group Unicode supports various kinds of composite character by giving each character a grapheme breaking property, and having rules that use these properties to define the boundaries of extended grapheme clusters. The rules are defined in -Unicode Standard Annex 29, "Unicode Text Segmentation". +Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0 +abandoned the use of some previous properties that had been used for emojis. +Instead it introduced various emoji-specific properties. PCRE2 uses only the +Extended Pictographic property.
\X always matches at least one character. Then it decides whether to add @@ -1026,27 +1036,24 @@ character; an LVT or T character may be follwed only by a T character.
4. Do not end before extending characters or spacing marks or the "zero-width -joiner" characters. Characters with the "mark" property always have the +joiner" character. Characters with the "mark" property always have the "extend" grapheme breaking property.
5. Do not end after prepend characters.
-6. Do not break within emoji modifier sequences (a base character followed by a -modifier). Extending characters are allowed before the modifier. +6. Do not break within emoji modifier sequences or emoji zwj sequences. That +is, do not break between characters with the Extended_Pictographic property. +Extend and ZWJ characters are allowed between the characters.
-7. Do not break within emoji zwj sequences (zero-width joiner followed by -"glue after ZWJ" or "base glue after ZWJ"). -
--8. Do not break within emoji flag sequences. That is, do not break between +7. Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) characters if there are an odd number of RI characters before the break point.
-6. Otherwise, end the cluster. +8. Otherwise, end the cluster.
(?<=\Kfoo)bar-If the subject is "foobar", a call to pcre2_match() with a starting -offset of 3 succeeds and reports the matching string as "foobar", that is, the +If the subject is "foobar", a call to pcre2_match() with a starting +offset of 3 succeeds and reports the matching string as "foobar", that is, the start of the reported match is earlier than where the match started.
-Last updated: 30 June 2018
+Last updated: 07 July 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
index c0d7b39..dee937e 100644
--- a/doc/html/pcre2syntax.html
+++ b/doc/html/pcre2syntax.html
@@ -188,6 +188,7 @@ at release 5.18.
+Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, @@ -198,6 +199,7 @@ Bamum, Bassa_Vah, Batak, Bengali, +Bhaiksuki, Bopomofo, Brahmi, Braille, @@ -216,6 +218,7 @@ Cypriot, Cyrillic, Deseret, Devanagari, +Dogra, Duployan, Egyptian_Hieroglyphs, Elbasan, @@ -226,9 +229,11 @@ Gothic, Grantha, Greek, Gujarati, +Gunjala_Gondi, Gurmukhi, Han, Hangul, +Hanifi_Rohingya, Hanunoo, Hatran, Hebrew, @@ -256,9 +261,13 @@ Lisu, Lycian, Lydian, Mahajani, +Makasar, Malayalam, Mandaic, Manichaean, +Marchen, +Masaram_Gondi, +Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive, @@ -271,7 +280,9 @@ Multani, Myanmar, Nabataean, New_Tai_Lue, +Newa, Nko, +Nushu, Ogham, Ol_Chiki, Old_Hungarian, @@ -279,9 +290,11 @@ Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, +Old_Sogdian, Old_South_Arabian, Old_Turkic, Oriya, +Osage, Osmanya, Pahawh_Hmong, Palmyrene, @@ -298,7 +311,9 @@ Shavian, Siddham, SignWriting, Sinhala, +Sogdian, Sora_Sompeng, +Soyombo, Sundanese, Syloti_Nagri, Syriac, @@ -309,6 +324,7 @@ Tai_Tham, Tai_Viet, Takri, Tamil, +Tangut, Telugu, Thaana, Thai, @@ -318,7 +334,8 @@ Tirhuta, Ugaritic, Vai, Warang_Citi, -Yi. +Yi, +Zanabazar_Square.
@@ -600,7 +617,7 @@ Cambridge, England.
-Last updated: 28 June 2018
+Last updated: 07 July 2018
Copyright © 1997-2018 University of Cambridge.
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index d8c08a9..553ca20 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -6483,34 +6483,35 @@ BACKSLASH
nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
- Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan,
- Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gur-
- mukhi, Han, Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Ara-
- maic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
- Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Kho-
- jki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu,
- Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean, Marchen,
- Masaram_Gondi, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
- Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
- Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
- ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
- Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya, Pahawh_Hmong,
- Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang,
- Runic, Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting,
- Sinhala, Sora_Sompeng, Soyombo, Sundanese, Syloti_Nagri, Syriac, Taga-
- log, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Tangut, Tel-
- ugu, Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai,
- Warang_Citi, Yi, Zanabazar_Square.
+ Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
+ Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
+ Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
+ Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
+ Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
+ nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
+ Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
+ jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
+ Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
+ Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
+ ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
+ dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
+ Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
+ Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
+ vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
+ Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
+ Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
+ nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
Each character has exactly one Unicode general category property, spec-
- ified by a two-letter abbreviation. For compatibility with Perl, nega-
- tion can be specified by including a circumflex between the opening
- brace and the property name. For example, \p{^Lu} is the same as
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -6562,44 +6563,47 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
- so cannot be tested by PCRE2, unless UTF validity checking has been
- turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
+ so cannot be tested by PCRE2, unless UTF validity checking has been
+ turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
page). Perl does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
- For example, \p{Lu} always matches only upper case letters. This is
+ Specifying caseless matching does not affect these escape sequences.
+ For example, \p{Lu} always matches only upper case letters. This is
different from the behaviour of current versions of Perl.
- Matching characters by Unicode property is not fast, because PCRE2 has
- to do a multistage table lookup in order to find a character's prop-
+ Matching characters by Unicode property is not fast, because PCRE2 has
+ to do a multistage table lookup in order to find a character's prop-
erty. That is why the traditional escape sequences such as \d and \w do
- not use Unicode properties in PCRE2 by default, though you can make
- them do so by setting the PCRE2_UCP option or by starting the pattern
+ not use Unicode properties in PCRE2 by default, though you can make
+ them do so by setting the PCRE2_UCP option or by starting the pattern
with (*UCP).
Extended grapheme clusters
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
"extended grapheme cluster", and treats the sequence as an atomic group
- (see below). Unicode supports various kinds of composite character by
- giving each character a grapheme breaking property, and having rules
+ (see below). Unicode supports various kinds of composite character by
+ giving each character a grapheme breaking property, and having rules
that use these properties to define the boundaries of extended grapheme
- clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
- Text Segmentation".
+ clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
+ Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
+ properties that had been used for emojis. Instead it introduced vari-
+ ous emoji-specific properties. PCRE2 uses only the Extended Picto-
+ graphic property.
\X always matches at least one character. Then it decides whether to
add additional characters according to the following rules for ending a
@@ -6617,23 +6621,21 @@ BACKSLASH
only by a T character.
4. Do not end before extending characters or spacing marks or the
- "zero-width joiner" characters. Characters with the "mark" property
+ "zero-width joiner" character. Characters with the "mark" property
always have the "extend" grapheme breaking property.
5. Do not end after prepend characters.
- 6. Do not break within emoji modifier sequences (a base character fol-
- lowed by a modifier). Extending characters are allowed before the modi-
- fier.
+ 6. Do not break within emoji modifier sequences or emoji zwj sequences.
+ That is, do not break between characters with the Extended_Pictographic
+ property. Extend and ZWJ characters are allowed between the charac-
+ ters.
- 7. Do not break within emoji zwj sequences (zero-width joiner followed
- by "glue after ZWJ" or "base glue after ZWJ").
-
- 8. Do not break within emoji flag sequences. That is, do not break
+ 7. Do not break within emoji flag sequences. That is, do not break
between regional indicator (RI) characters if there are an odd number
of RI characters before the break point.
- 6. Otherwise, end the cluster.
+ 8. Otherwise, end the cluster.
PCRE2's additional properties
@@ -8941,7 +8943,7 @@ AUTHOR
REVISION
- Last updated: 30 June 2018
+ Last updated: 07 July 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
@@ -9915,26 +9917,29 @@ PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
SCRIPT NAMES FOR \p AND \P
- Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Balinese,
- Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille, Buginese,
- Buhid, Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham,
- Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
- Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic, Geor-
- gian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi, Han,
- Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
- Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
- nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
- Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
- jani, Malayalam, Mandaic, Manichaean, Meetei_Mayek, Mende_Kikakui,
- Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro,
- Multani, Myanmar, Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki,
- Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian,
- Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
- Pau_Cin_Hau, Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic,
- Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
- Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
- Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai,
- Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
+ Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
+ nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
+ Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
+ nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
+ Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
+ Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
+ Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
+ Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
+ Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
+ nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
+ Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
+ jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
+ Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
+ Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
+ ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
+ dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
+ Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
+ Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
+ vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
+ Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
+ Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
+ nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
CHARACTER CLASSES
@@ -9960,8 +9965,8 @@ CHARACTER CLASSES
word same as \w
xdigit hexadecimal digit
- In PCRE2, POSIX character set names recognize only ASCII characters by
- default, but some of them use Unicode properties if PCRE2_UCP is set.
+ In PCRE2, POSIX character set names recognize only ASCII characters by
+ default, but some of them use Unicode properties if PCRE2_UCP is set.
You can use \Q...\E inside a character class.
@@ -10047,8 +10052,8 @@ OPTION SETTING
(?xx) as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
- The following are recognized only at the very start of a pattern or
- after one of the newline or \R options with similar syntax. More than
+ The following are recognized only at the very start of a pattern or
+ after one of the newline or \R options with similar syntax. More than
one of them may appear. For the first three, d is a decimal number.
(*LIMIT_DEPTH=d) set the backtracking limit to d
@@ -10063,17 +10068,17 @@ OPTION SETTING
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
- Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
- value of the limits set by the caller of pcre2_match() or
- pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
+ Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
+ value of the limits set by the caller of pcre2_match() or
+ pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
- and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
+ and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
respectively, at compile time.
NEWLINE CONVENTION
- These are recognized only at the very start of the pattern or after
+ These are recognized only at the very start of the pattern or after
option settings with a similar syntax.
(*CR) carriage return only
@@ -10086,7 +10091,7 @@ NEWLINE CONVENTION
WHAT \R MATCHES
- These are recognized only at the very start of the pattern or after
+ These are recognized only at the very start of the pattern or after
option setting with a similar syntax.
(*BSR_ANYCRLF) CR, LF, or CRLF
@@ -10155,8 +10160,8 @@ CONDITIONAL PATTERNS
(?(VERSION[>]=n.m) test PCRE2 version
(?(assert) assertion condition
- Note the ambiguity of (?(R) and (?(Rn) which might be named reference
- conditions or recursion tests. Such a condition is interpreted as a
+ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
+ conditions or recursion tests. Such a condition is interpreted as a
reference condition if the relevant named group exists.
@@ -10168,7 +10173,7 @@ BACKTRACKING CONTROL
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
- The following act only when a subsequent match failure causes a back-
+ The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored.
@@ -10190,14 +10195,14 @@ CALLOUTS
(?C"text") callout with string data
The allowed string delimiters are ` ' " ^ % # $ (which are the same for
- the start and the end), and the starting delimiter { matched with the
- ending delimiter }. To encode the ending delimiter within the string,
+ the start and the end), and the starting delimiter { matched with the
+ ending delimiter }. To encode the ending delimiter within the string,
double it.
SEE ALSO
- pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
+ pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
pcre2(3).
@@ -10210,7 +10215,7 @@ AUTHOR
REVISION
- Last updated: 28 June 2018
+ Last updated: 07 July 2018
Copyright (c) 1997-2018 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 2b534f2..cd9a99c 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "30 June 2018" "PCRE2 10.32"
+.TH PCRE2PATTERN 3 "07 July 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -788,6 +788,7 @@ Cypriot,
Cyrillic,
Deseret,
Devanagari,
+Dogra,
Duployan,
Egyptian_Hieroglyphs,
Elbasan,
@@ -798,9 +799,11 @@ Gothic,
Grantha,
Greek,
Gujarati,
+Gunjala_Gondi,
Gurmukhi,
Han,
Hangul,
+Hanifi_Rohingya,
Hanunoo,
Hatran,
Hebrew,
@@ -828,11 +831,13 @@ Lisu,
Lycian,
Lydian,
Mahajani,
+Makasar,
Malayalam,
Mandaic,
Manichaean,
Marchen,
Masaram_Gondi,
+Medefaidrin,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
@@ -855,6 +860,7 @@ Old_Italic,
Old_North_Arabian,
Old_Permic,
Old_Persian,
+Old_Sogdian,
Old_South_Arabian,
Old_Turkic,
Oriya,
@@ -875,6 +881,7 @@ Shavian,
Siddham,
SignWriting,
Sinhala,
+Sogdian,
Sora_Sompeng,
Soyombo,
Sundanese,
@@ -1003,7 +1010,10 @@ grapheme cluster", and treats the sequence as an atomic group
Unicode supports various kinds of composite character by giving each character
a grapheme breaking property, and having rules that use these properties to
define the boundaries of extended grapheme clusters. The rules are defined in
-Unicode Standard Annex 29, "Unicode Text Segmentation".
+Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
+abandoned the use of some previous properties that had been used for emojis.
+Instead it introduced various emoji-specific properties. PCRE2 uses only the
+Extended Pictographic property.
.P
\eX always matches at least one character. Then it decides whether to add
additional characters according to the following rules for ending a cluster:
@@ -1018,22 +1028,20 @@ L, V, LV, or LVT character; an LV or V character may be followed by a V or T
character; an LVT or T character may be follwed only by a T character.
.P
4. Do not end before extending characters or spacing marks or the "zero-width
-joiner" characters. Characters with the "mark" property always have the
+joiner" character. Characters with the "mark" property always have the
"extend" grapheme breaking property.
.P
5. Do not end after prepend characters.
.P
-6. Do not break within emoji modifier sequences (a base character followed by a
-modifier). Extending characters are allowed before the modifier.
+6. Do not break within emoji modifier sequences or emoji zwj sequences. That
+is, do not break between characters with the Extended_Pictographic property.
+Extend and ZWJ characters are allowed between the characters.
.P
-7. Do not break within emoji zwj sequences (zero-width joiner followed by
-"glue after ZWJ" or "base glue after ZWJ").
-.P
-8. Do not break within emoji flag sequences. That is, do not break between
+7. Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) characters if there are an odd number of RI characters
before the break point.
.P
-6. Otherwise, end the cluster.
+8. Otherwise, end the cluster.
.
.
.\" HTML
@@ -1112,8 +1120,8 @@ lead to odd effects. For example, consider this pattern:
.sp
(?<=\eKfoo)bar
.sp
-If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
-offset of 3 succeeds and reports the matching string as "foobar", that is, the
+If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
+offset of 3 succeeds and reports the matching string as "foobar", that is, the
start of the reported match is earlier than where the match started.
.
.
@@ -3517,6 +3525,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 30 June 2018
+Last updated: 07 July 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi
diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3
index 4eec552..7e29beb 100644
--- a/doc/pcre2syntax.3
+++ b/doc/pcre2syntax.3
@@ -1,4 +1,4 @@
-.TH PCRE2SYNTAX 3 "28 June 2018" "PCRE2 10.32"
+.TH PCRE2SYNTAX 3 "07 July 2018" "PCRE2 10.32"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -160,6 +160,7 @@ at release 5.18.
.SH "SCRIPT NAMES FOR \ep AND \eP"
.rs
.sp
+Adlam,
Ahom,
Anatolian_Hieroglyphs,
Arabic,
@@ -170,6 +171,7 @@ Bamum,
Bassa_Vah,
Batak,
Bengali,
+Bhaiksuki,
Bopomofo,
Brahmi,
Braille,
@@ -188,6 +190,7 @@ Cypriot,
Cyrillic,
Deseret,
Devanagari,
+Dogra,
Duployan,
Egyptian_Hieroglyphs,
Elbasan,
@@ -198,9 +201,11 @@ Gothic,
Grantha,
Greek,
Gujarati,
+Gunjala_Gondi,
Gurmukhi,
Han,
Hangul,
+Hanifi_Rohingya,
Hanunoo,
Hatran,
Hebrew,
@@ -228,9 +233,13 @@ Lisu,
Lycian,
Lydian,
Mahajani,
+Makasar,
Malayalam,
Mandaic,
Manichaean,
+Marchen,
+Masaram_Gondi,
+Medefaidrin,
Meetei_Mayek,
Mende_Kikakui,
Meroitic_Cursive,
@@ -243,7 +252,9 @@ Multani,
Myanmar,
Nabataean,
New_Tai_Lue,
+Newa,
Nko,
+Nushu,
Ogham,
Ol_Chiki,
Old_Hungarian,
@@ -251,9 +262,11 @@ Old_Italic,
Old_North_Arabian,
Old_Permic,
Old_Persian,
+Old_Sogdian,
Old_South_Arabian,
Old_Turkic,
Oriya,
+Osage,
Osmanya,
Pahawh_Hmong,
Palmyrene,
@@ -270,7 +283,9 @@ Shavian,
Siddham,
SignWriting,
Sinhala,
+Sogdian,
Sora_Sompeng,
+Soyombo,
Sundanese,
Syloti_Nagri,
Syriac,
@@ -281,6 +296,7 @@ Tai_Tham,
Tai_Viet,
Takri,
Tamil,
+Tangut,
Telugu,
Thaana,
Thai,
@@ -290,7 +306,8 @@ Tirhuta,
Ugaritic,
Vai,
Warang_Citi,
-Yi.
+Yi,
+Zanabazar_Square.
.
.
.SH "CHARACTER CLASSES"
@@ -589,6 +606,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 28 June 2018
+Last updated: 07 July 2018
Copyright (c) 1997-2018 University of Cambridge.
.fi
diff --git a/maint/GenerateUtt.py b/maint/GenerateUtt.py
index a152566..54a72e0 100755
--- a/maint/GenerateUtt.py
+++ b/maint/GenerateUtt.py
@@ -24,6 +24,7 @@
# Added script names for Unicode 7.0.0, 20-June-2014.
# Added script names for Unicode 8.0.0, 19-June-2015.
# Added script names for Unicode 10.0.0, 02-July-2017.
+# Added script names for Unicode 11.0.0, 03-July-2018.
script_names = ['Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal', \
'Cherokee', 'Common', 'Coptic', 'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Ethiopic', 'Georgian', \
@@ -55,7 +56,10 @@ script_names = ['Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille', 'Bugines
'SignWriting',
# New for Unicode 10.0.0
'Adlam', 'Bhaiksuki', 'Marchen', 'Newa', 'Osage', 'Tangut', 'Masaram_Gondi',
- 'Nushu', 'Soyombo', 'Zanabazar_Square'
+ 'Nushu', 'Soyombo', 'Zanabazar_Square',
+# New for Unicode 11.0.0
+ 'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
+ 'Old_Sogdian', 'Sogdian'
]
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
diff --git a/maint/MultiStage2.py b/maint/MultiStage2.py
index f124538..e9ed694 100755
--- a/maint/MultiStage2.py
+++ b/maint/MultiStage2.py
@@ -7,20 +7,26 @@
# This script was submitted to the PCRE project by Peter Kankowski as part of
# the upgrading of Unicode property support. The new code speeds up property
# matching many times. The script is for the use of PCRE maintainers, to
-# generate the pcre_ucd.c file that contains a digested form of the Unicode
+# generate the pcre2_ucd.c file that contains a digested form of the Unicode
# data tables.
#
-# The script has now been upgraded to Python 3 for PCRE2, and should be run in
+# The script has now been upgraded to Python 3 for PCRE2, and should be run in
# the maint subdirectory, using the command
#
# [python3] ./MultiStage2.py >../src/pcre2_ucd.c
#
-# It requires four Unicode data tables, DerivedGeneralCategory.txt,
-# GraphemeBreakProperty.txt, Scripts.txt, and CaseFolding.txt, to be in the
-# Unicode.tables subdirectory. The first of these is found in the "extracted"
-# subdirectory of the Unicode database (UCD) on the Unicode web site; the
-# second is in the "auxiliary" subdirectory; the other two are directly in the
-# UCD directory.
+# It requires five Unicode data tables: DerivedGeneralCategory.txt,
+# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt.
+# These must be in the maint/Unicode.tables subdirectory.
+#
+# DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
+# Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
+# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly
+# in the UCD directory. The emoji-data.txt file is in files associated with
+# Unicode Technical Standard #51 ("Unicode Emoji"), for example:
+#
+# http://unicode.org/Public/emoji/11.0/emoji-data.txt
+#
#
# Minor modifications made to this script:
# Added #! line at start
@@ -41,7 +47,8 @@
# Added code to search for sets of more than two characters that must match
# each other caselessly. A new table is output containing these sets, and
# offsets into the table are added to the main output records. This new
-# code scans CaseFolding.txt instead of UnicodeData.txt.
+# code scans CaseFolding.txt instead of UnicodeData.txt, which is no longer
+# used.
#
# Update for Python3:
# . Processed with 2to3, but that didn't fix everything
@@ -50,8 +57,13 @@
# . Inserted 'int' before blocksize/ELEMS_PER_LINE because an int is
# required and the result of the division is a float
#
+# Added code to scan the emoji-data.txt file to find the Extended Pictographic
+# property, which is used by PCRE2 as a grapheme breaking property. This was
+# done when updating to Unicode 11.0.0 (July 2018).
+#
+#
# The main tables generated by this script are used by macros defined in
-# pcre2_internal.h. They look up Unicode character properties using short
+# pcre2_internal.h. They look up Unicode character properties using short
# sequences of code that contains no branches, which makes for greater speed.
#
# Conceptually, there is a table of records (of type ucd_record), containing a
@@ -75,43 +87,48 @@
# table of "virtual" blocks; each block is indexed by the offset of a character
# within its own block, and the result is the offset of the required record.
#
+# The following examples are correct for the Unicode 11.0.0 database. Future
+# updates may make change the actual lookup values.
+#
# Example: lowercase "a" (U+0061) is in block 0
# lookup 0 in stage1 table yields 0
# lookup 97 in the first table in stage2 yields 16
-# record 17 is { 33, 5, 11, 0, -32 }
+# record 17 is { 33, 5, 11, 0, -32 }
# 33 = ucp_Latin => Latin script
# 5 = ucp_Ll => Lower case letter
-# 11 = ucp_gbOther => Grapheme break property "Other"
+# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => not part of a caseless set
# -32 => Other case is U+0041
-#
+#
# Almost all lowercase latin characters resolve to the same record. One or two
# are different because they are part of a multi-character caseless set (for
# example, k, K and the Kelvin symbol are such a set).
#
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
-# lookup 96 in stage1 table yields 88
-# lookup 66 in the 88th table in stage2 yields 467
-# record 470 is { 26, 7, 11, 0, 0 }
+# lookup 96 in stage1 table yields 90
+# lookup 66 in the 90th table in stage2 yields 515
+# record 515 is { 26, 7, 11, 0, 0 }
# 26 = ucp_Hiragana => Hiragana script
# 7 = ucp_Lo => Other letter
-# 11 = ucp_gbOther => Grapheme break property "Other"
+# 12 = ucp_gbOther => Grapheme break property "Other"
# 0 => not part of a caseless set
-# 0 => No other case
+# 0 => No other case
#
# In these examples, no other blocks resolve to the same "virtual" block, as it
# happens, but plenty of other blocks do share "virtual" blocks.
#
-# There is a fourth table, maintained by hand, which translates from the
+# There is a fourth table, maintained by hand, which translates from the
# individual character types such as ucp_Cc to the general types like ucp_C.
#
# Philip Hazel, 03 July 2008
+# Last Updated: 07 July 2018
+#
#
# 01-March-2010: Updated list of scripts for Unicode 5.2.0
# 30-April-2011: Updated list of scripts for Unicode 6.0.0
# July-2012: Updated list of scripts for Unicode 6.1.0
-# 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new
-# field in the record to hold the value. Luckily, the
+# 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new
+# field in the record to hold the value. Luckily, the
# structure had a hole in it, so the resulting table is
# not much bigger than before.
# 18-September-2012: Added code for multiple caseless sets. This uses the
@@ -123,6 +140,9 @@
# 12-August-2014: Updated to put Unicode version into the file
# 19-June-2015: Updated for Unicode 8.0.0
# 02-July-2017: Updated for Unicode 10.0.0
+# 03-July-2018: Updated for Unicode 11.0.0
+# 07-July-2018: Added code to scan emoji-data.txt for the Extended
+# Pictographic property.
##############################################################################
@@ -148,7 +168,7 @@ def get_other_case(chardata):
# Read the whole table in memory, setting/checking the Unicode version
def read_table(file_name, get_value, default_value):
global unicode_version
-
+
f = re.match(r'^[^/]+/([^.]+)\.txt$', file_name)
file_base = f.group(1)
version_pat = r"^# " + re.escape(file_base) + r"-(\d+\.\d+\.\d+)\.txt$"
@@ -159,7 +179,7 @@ def read_table(file_name, get_value, default_value):
unicode_version = version
elif unicode_version != version:
print("WARNING: Unicode version differs in %s", file_name, file=sys.stderr)
-
+
table = [default_value] * MAX_UNICODE
for line in file:
line = re.sub(r'#.*', '', line)
@@ -172,14 +192,14 @@ def read_table(file_name, get_value, default_value):
if m.group(3) is None:
last = char
else:
- last = int(m.group(3), 16)
+ last = int(m.group(3), 16)
for i in range(char, last + 1):
# It is important not to overwrite a previously set
# value because in the CaseFolding file there are lines
- # to be ignored (returning the default value of 0)
- # which often come after a line which has already set
- # data.
- if table[i] == default_value:
+ # to be ignored (returning the default value of 0)
+ # which often come after a line which has already set
+ # data.
+ if table[i] == default_value:
table[i] = value
file.close()
return table
@@ -220,14 +240,14 @@ def compress_table(table, block_size):
stage2 += block
blocks[block] = start
stage1.append(start)
-
+
return stage1, stage2
# Print a table
def print_table(table, table_name, block_size = None):
type, size = get_type_size(table)
ELEMS_PER_LINE = 16
-
+
s = "const %s %s[] = { /* %d bytes" % (type, table_name, size * len(table))
if block_size:
s += ", block = %d" % block_size
@@ -237,7 +257,7 @@ def print_table(table, table_name, block_size = None):
fmt = "%3d," * ELEMS_PER_LINE + " /* U+%04X */"
mult = MAX_UNICODE / len(table)
for i in range(0, len(table), ELEMS_PER_LINE):
- print(fmt % (table[i:i+ELEMS_PER_LINE] +
+ print(fmt % (table[i:i+ELEMS_PER_LINE] +
(int(i * mult),)))
else:
if block_size > ELEMS_PER_LINE:
@@ -274,15 +294,15 @@ def get_record_size_struct(records):
size = (size + slice_size - 1) & -slice_size
size += slice_size
structure += '%s property_%d;\n' % (slice_type, i)
-
+
# round up to the first item of the next structure in array
record_slice = [record[0] for record in records]
slice_type, slice_size = get_type_size(record_slice)
size = (size + slice_size - 1) & -slice_size
-
+
structure += '} ucd_record;\n*/\n\n'
return size, structure
-
+
def test_record_size():
tests = [ \
( [(3,), (6,), (6,), (1,)], 1 ), \
@@ -339,16 +359,23 @@ script_names = ['Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille', 'Bugines
'SignWriting',
# New for Unicode 10.0.0
'Adlam', 'Bhaiksuki', 'Marchen', 'Newa', 'Osage', 'Tangut', 'Masaram_Gondi',
- 'Nushu', 'Soyombo', 'Zanabazar_Square'
+ 'Nushu', 'Soyombo', 'Zanabazar_Square',
+# New for Unicode 11.0.0
+ 'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
+ 'Old_Sogdian', 'Sogdian'
]
-
+
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
'Sc', 'Sk', 'Sm', 'So', 'Zl', 'Zp', 'Zs' ]
+# The Extended_Pictographic property is not found in the file where all the
+# others are (GraphemeBreakProperty.txt). It comes from the emoji-data.txt
+# file, but we list it here so that the name has the correct index value.
+
break_property_names = ['CR', 'LF', 'Control', 'Extend', 'Prepend',
'SpacingMark', 'L', 'V', 'T', 'LV', 'LVT', 'Regional_Indicator', 'Other',
- 'E_Base', 'E_Modifier', 'E_Base_GAZ', 'ZWJ', 'Glue_After_Zwj' ]
+ 'ZWJ', 'Extended_Pictographic' ]
test_record_size()
unicode_version = ""
@@ -358,21 +385,50 @@ category = read_table('Unicode.tables/DerivedGeneralCategory.txt', make_get_name
break_props = read_table('Unicode.tables/GraphemeBreakProperty.txt', make_get_names(break_property_names), break_property_names.index('Other'))
other_case = read_table('Unicode.tables/CaseFolding.txt', get_other_case, 0)
+# The grapheme breaking rules were changed for Unicode 11.0.0 (June 2018). Now
+# we need to find the Extended_Pictographic property for emoji characters. This
+# can be set as an additional grapheme break property, because the default for
+# all the emojis is "other". We scan the emoji-data.txt file and modify the
+# break-props table.
-# This block of code was added by PH in September 2012. I am not a Python
-# programmer, so the style is probably dreadful, but it does the job. It scans
-# the other_case table to find sets of more than two characters that must all
-# match each other caselessly. Later in this script a table of these sets is
-# written out. However, we have to do this work here in order to compute the
+file = open('Unicode.tables/emoji-data.txt', 'r', encoding='utf-8')
+for line in file:
+ line = re.sub(r'#.*', '', line)
+ chardata = list(map(str.strip, line.split(';')))
+ if len(chardata) <= 1:
+ continue
+
+ if chardata[1] != "Extended_Pictographic":
+ continue
+
+ m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0])
+ char = int(m.group(1), 16)
+ if m.group(3) is None:
+ last = char
+ else:
+ last = int(m.group(3), 16)
+ for i in range(char, last + 1):
+ if break_props[i] != break_property_names.index('Other'):
+ print("WARNING: Emoji 0x%x has break property %s, not 'Other'",
+ i, break_property_names[break_props[i]], file=sys.stderr)
+ break_props[i] = break_property_names.index('Extended_Pictographic')
+file.close()
+
+
+# This block of code was added by PH in September 2012. I am not a Python
+# programmer, so the style is probably dreadful, but it does the job. It scans
+# the other_case table to find sets of more than two characters that must all
+# match each other caselessly. Later in this script a table of these sets is
+# written out. However, we have to do this work here in order to compute the
# offsets in the table that are inserted into the main table.
# The CaseFolding.txt file lists pairs, but the common logic for reading data
-# sets only one value, so first we go through the table and set "return"
+# sets only one value, so first we go through the table and set "return"
# offsets for those that are not already set.
for c in range(0x10ffff):
if other_case[c] != 0 and other_case[c + other_case[c]] == 0:
- other_case[c + other_case[c]] = -other_case[c]
+ other_case[c + other_case[c]] = -other_case[c]
# Now scan again and create equivalence sets.
@@ -382,25 +438,25 @@ for c in range(0x10ffff):
o = c + other_case[c]
# Trigger when this character's other case does not point back here. We
- # now have three characters that are case-equivalent.
-
+ # now have three characters that are case-equivalent.
+
if other_case[o] != -other_case[c]:
t = o + other_case[o]
-
- # Scan the existing sets to see if any of the three characters are already
+
+ # Scan the existing sets to see if any of the three characters are already
# part of a set. If so, unite the existing set with the new set.
-
- appended = 0
+
+ appended = 0
for s in sets:
- found = 0
+ found = 0
for x in s:
if x == c or x == o or x == t:
found = 1
-
+
# Add new characters to an existing set
-
+
if found:
- found = 0
+ found = 0
for y in [c, o, t]:
for x in s:
if x == y:
@@ -408,10 +464,10 @@ for c in range(0x10ffff):
if not found:
s.append(y)
appended = 1
-
+
# If we have not added to an existing set, create a new one.
- if not appended:
+ if not appended:
sets.append([c, o, t])
# End of loop looking for caseless sets.
@@ -422,7 +478,7 @@ caseless_offsets = [0] * MAX_UNICODE
offset = 1;
for s in sets:
- for x in s:
+ for x in s:
caseless_offsets[x] = offset
offset += len(s) + 1
@@ -431,7 +487,7 @@ for s in sets:
# Combine the tables
-table, records = combine_tables(script, category, break_props,
+table, records = combine_tables(script, category, break_props,
caseless_offsets, other_case)
record_size, record_struct = get_record_size_struct(list(records.keys()))
@@ -473,7 +529,7 @@ print("/* This file was autogenerated by the MultiStage2.py script. */")
print("/* Total size: %d bytes, block size: %d. */" % (min_size, min_block_size))
print()
print("/* The tables herein are needed only when UCP support is built,")
-print("and in PCRE2 that happens automatically with UTF support.")
+print("and in PCRE2 that happens automatically with UTF support.")
print("This module should not be referenced otherwise, so")
print("it should not matter whether it is compiled or not. However")
print("a comment was received about space saving - maybe the guy linked")
@@ -484,7 +540,7 @@ print("Instead, just supply small dummy tables. */")
print()
print("#ifndef SUPPORT_UNICODE")
print("const ucd_record PRIV(ucd_records)[] = {{0,0,0,0,0 }};")
-print("const uint8_t PRIV(ucd_stage1)[] = {0};")
+print("const uint16_t PRIV(ucd_stage1)[] = {0};")
print("const uint16_t PRIV(ucd_stage2)[] = {0};")
print("const uint32_t PRIV(ucd_caseless_sets)[] = {0};")
print("#else")
@@ -515,7 +571,7 @@ for s in sets:
s = sorted(s)
for x in s:
print(' 0x%04x,' % x, end=' ')
- print(' NOTACHAR,')
+ print(' NOTACHAR,')
print('};')
print()
diff --git a/maint/README b/maint/README
index fb9b7ee..d2de188 100644
--- a/maint/README
+++ b/maint/README
@@ -23,7 +23,7 @@ GenerateUtt.py A Python script to generate part of the pcre2_tables.c file
ManyConfigTests A shell script that runs "configure, make, test" a number of
times with different configuration settings.
-MultiStage2.py A Python script that generates the file pcre2_ucd.c from three
+MultiStage2.py A Python script that generates the file pcre2_ucd.c from five
Unicode data tables, which are themselves downloaded from the
Unicode web site. Run this script in the "maint" directory.
The generated file contains the tables for a 2-stage lookup
@@ -37,11 +37,17 @@ pcre2_chartables.c.non-standard
README This file.
-Unicode.tables The files in this directory (CaseFolding.txt,
- DerivedGeneralCategory.txt, GraphemeBreakProperty.txt,
- Scripts.txt and UnicodeData.txt) were downloaded from the
- Unicode web site. They contain information about Unicode
- characters and scripts.
+Unicode.tables The files in this directory were downloaded from the Unicode
+ web site. They contain information about Unicode characters
+ and scripts. The ones used by the MultiStage2.py script are
+ CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
+ GraphemeBreakProperty.txt, and emoji-data.txt. I've kept
+ UnicodeData.txt (which is no longer used by the script)
+ because it is useful occasionally for manually looking up the
+ details of certain characters. However, note that character
+ names in this file such as "Arabic sign sanah" do NOT mean
+ that the character is in a particular script (in this case,
+ Arabic). Scripts.txt is where to look for script information.
ucptest.c A short C program for testing the Unicode property macros
that do lookups in the pcre2_ucd.c data, mainly useful after
@@ -359,4 +365,4 @@ very sensible; some are rather wacky. Some have been on this list for years.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 May 2017
+Last updated: 07 July 2018
diff --git a/maint/Unicode.tables/CaseFolding.txt b/maint/Unicode.tables/CaseFolding.txt
index efdf18e..cce350f 100644
--- a/maint/Unicode.tables/CaseFolding.txt
+++ b/maint/Unicode.tables/CaseFolding.txt
@@ -1,6 +1,6 @@
-# CaseFolding-10.0.0.txt
-# Date: 2017-04-14, 05:40:18 GMT
-# © 2017 Unicode®, Inc.
+# CaseFolding-11.0.0.txt
+# Date: 2018-01-31, 08:20:09 GMT
+# © 2018 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
@@ -603,6 +603,52 @@
1C86; C; 044A; # CYRILLIC SMALL LETTER TALL HARD SIGN
1C87; C; 0463; # CYRILLIC SMALL LETTER TALL YAT
1C88; C; A64B; # CYRILLIC SMALL LETTER UNBLENDED UK
+1C90; C; 10D0; # GEORGIAN MTAVRULI CAPITAL LETTER AN
+1C91; C; 10D1; # GEORGIAN MTAVRULI CAPITAL LETTER BAN
+1C92; C; 10D2; # GEORGIAN MTAVRULI CAPITAL LETTER GAN
+1C93; C; 10D3; # GEORGIAN MTAVRULI CAPITAL LETTER DON
+1C94; C; 10D4; # GEORGIAN MTAVRULI CAPITAL LETTER EN
+1C95; C; 10D5; # GEORGIAN MTAVRULI CAPITAL LETTER VIN
+1C96; C; 10D6; # GEORGIAN MTAVRULI CAPITAL LETTER ZEN
+1C97; C; 10D7; # GEORGIAN MTAVRULI CAPITAL LETTER TAN
+1C98; C; 10D8; # GEORGIAN MTAVRULI CAPITAL LETTER IN
+1C99; C; 10D9; # GEORGIAN MTAVRULI CAPITAL LETTER KAN
+1C9A; C; 10DA; # GEORGIAN MTAVRULI CAPITAL LETTER LAS
+1C9B; C; 10DB; # GEORGIAN MTAVRULI CAPITAL LETTER MAN
+1C9C; C; 10DC; # GEORGIAN MTAVRULI CAPITAL LETTER NAR
+1C9D; C; 10DD; # GEORGIAN MTAVRULI CAPITAL LETTER ON
+1C9E; C; 10DE; # GEORGIAN MTAVRULI CAPITAL LETTER PAR
+1C9F; C; 10DF; # GEORGIAN MTAVRULI CAPITAL LETTER ZHAR
+1CA0; C; 10E0; # GEORGIAN MTAVRULI CAPITAL LETTER RAE
+1CA1; C; 10E1; # GEORGIAN MTAVRULI CAPITAL LETTER SAN
+1CA2; C; 10E2; # GEORGIAN MTAVRULI CAPITAL LETTER TAR
+1CA3; C; 10E3; # GEORGIAN MTAVRULI CAPITAL LETTER UN
+1CA4; C; 10E4; # GEORGIAN MTAVRULI CAPITAL LETTER PHAR
+1CA5; C; 10E5; # GEORGIAN MTAVRULI CAPITAL LETTER KHAR
+1CA6; C; 10E6; # GEORGIAN MTAVRULI CAPITAL LETTER GHAN
+1CA7; C; 10E7; # GEORGIAN MTAVRULI CAPITAL LETTER QAR
+1CA8; C; 10E8; # GEORGIAN MTAVRULI CAPITAL LETTER SHIN
+1CA9; C; 10E9; # GEORGIAN MTAVRULI CAPITAL LETTER CHIN
+1CAA; C; 10EA; # GEORGIAN MTAVRULI CAPITAL LETTER CAN
+1CAB; C; 10EB; # GEORGIAN MTAVRULI CAPITAL LETTER JIL
+1CAC; C; 10EC; # GEORGIAN MTAVRULI CAPITAL LETTER CIL
+1CAD; C; 10ED; # GEORGIAN MTAVRULI CAPITAL LETTER CHAR
+1CAE; C; 10EE; # GEORGIAN MTAVRULI CAPITAL LETTER XAN
+1CAF; C; 10EF; # GEORGIAN MTAVRULI CAPITAL LETTER JHAN
+1CB0; C; 10F0; # GEORGIAN MTAVRULI CAPITAL LETTER HAE
+1CB1; C; 10F1; # GEORGIAN MTAVRULI CAPITAL LETTER HE
+1CB2; C; 10F2; # GEORGIAN MTAVRULI CAPITAL LETTER HIE
+1CB3; C; 10F3; # GEORGIAN MTAVRULI CAPITAL LETTER WE
+1CB4; C; 10F4; # GEORGIAN MTAVRULI CAPITAL LETTER HAR
+1CB5; C; 10F5; # GEORGIAN MTAVRULI CAPITAL LETTER HOE
+1CB6; C; 10F6; # GEORGIAN MTAVRULI CAPITAL LETTER FI
+1CB7; C; 10F7; # GEORGIAN MTAVRULI CAPITAL LETTER YN
+1CB8; C; 10F8; # GEORGIAN MTAVRULI CAPITAL LETTER ELIFI
+1CB9; C; 10F9; # GEORGIAN MTAVRULI CAPITAL LETTER TURNED GAN
+1CBA; C; 10FA; # GEORGIAN MTAVRULI CAPITAL LETTER AIN
+1CBD; C; 10FD; # GEORGIAN MTAVRULI CAPITAL LETTER AEN
+1CBE; C; 10FE; # GEORGIAN MTAVRULI CAPITAL LETTER HARD SIGN
+1CBF; C; 10FF; # GEORGIAN MTAVRULI CAPITAL LETTER LABIAL SIGN
1E00; C; 1E01; # LATIN CAPITAL LETTER A WITH RING BELOW
1E02; C; 1E03; # LATIN CAPITAL LETTER B WITH DOT ABOVE
1E04; C; 1E05; # LATIN CAPITAL LETTER B WITH DOT BELOW
@@ -1180,6 +1226,7 @@ A7B2; C; 029D; # LATIN CAPITAL LETTER J WITH CROSSED-TAIL
A7B3; C; AB53; # LATIN CAPITAL LETTER CHI
A7B4; C; A7B5; # LATIN CAPITAL LETTER BETA
A7B6; C; A7B7; # LATIN CAPITAL LETTER OMEGA
+A7B8; C; A7B9; # LATIN CAPITAL LETTER U WITH STROKE
AB70; C; 13A0; # CHEROKEE SMALL LETTER A
AB71; C; 13A1; # CHEROKEE SMALL LETTER E
AB72; C; 13A2; # CHEROKEE SMALL LETTER I
@@ -1457,6 +1504,38 @@ FF3A; C; FF5A; # FULLWIDTH LATIN CAPITAL LETTER Z
118BD; C; 118DD; # WARANG CITI CAPITAL LETTER SSUU
118BE; C; 118DE; # WARANG CITI CAPITAL LETTER SII
118BF; C; 118DF; # WARANG CITI CAPITAL LETTER VIYO
+16E40; C; 16E60; # MEDEFAIDRIN CAPITAL LETTER M
+16E41; C; 16E61; # MEDEFAIDRIN CAPITAL LETTER S
+16E42; C; 16E62; # MEDEFAIDRIN CAPITAL LETTER V
+16E43; C; 16E63; # MEDEFAIDRIN CAPITAL LETTER W
+16E44; C; 16E64; # MEDEFAIDRIN CAPITAL LETTER ATIU
+16E45; C; 16E65; # MEDEFAIDRIN CAPITAL LETTER Z
+16E46; C; 16E66; # MEDEFAIDRIN CAPITAL LETTER KP
+16E47; C; 16E67; # MEDEFAIDRIN CAPITAL LETTER P
+16E48; C; 16E68; # MEDEFAIDRIN CAPITAL LETTER T
+16E49; C; 16E69; # MEDEFAIDRIN CAPITAL LETTER G
+16E4A; C; 16E6A; # MEDEFAIDRIN CAPITAL LETTER F
+16E4B; C; 16E6B; # MEDEFAIDRIN CAPITAL LETTER I
+16E4C; C; 16E6C; # MEDEFAIDRIN CAPITAL LETTER K
+16E4D; C; 16E6D; # MEDEFAIDRIN CAPITAL LETTER A
+16E4E; C; 16E6E; # MEDEFAIDRIN CAPITAL LETTER J
+16E4F; C; 16E6F; # MEDEFAIDRIN CAPITAL LETTER E
+16E50; C; 16E70; # MEDEFAIDRIN CAPITAL LETTER B
+16E51; C; 16E71; # MEDEFAIDRIN CAPITAL LETTER C
+16E52; C; 16E72; # MEDEFAIDRIN CAPITAL LETTER U
+16E53; C; 16E73; # MEDEFAIDRIN CAPITAL LETTER YU
+16E54; C; 16E74; # MEDEFAIDRIN CAPITAL LETTER L
+16E55; C; 16E75; # MEDEFAIDRIN CAPITAL LETTER Q
+16E56; C; 16E76; # MEDEFAIDRIN CAPITAL LETTER HP
+16E57; C; 16E77; # MEDEFAIDRIN CAPITAL LETTER NY
+16E58; C; 16E78; # MEDEFAIDRIN CAPITAL LETTER X
+16E59; C; 16E79; # MEDEFAIDRIN CAPITAL LETTER D
+16E5A; C; 16E7A; # MEDEFAIDRIN CAPITAL LETTER OE
+16E5B; C; 16E7B; # MEDEFAIDRIN CAPITAL LETTER N
+16E5C; C; 16E7C; # MEDEFAIDRIN CAPITAL LETTER R
+16E5D; C; 16E7D; # MEDEFAIDRIN CAPITAL LETTER O
+16E5E; C; 16E7E; # MEDEFAIDRIN CAPITAL LETTER AI
+16E5F; C; 16E7F; # MEDEFAIDRIN CAPITAL LETTER Y
1E900; C; 1E922; # ADLAM CAPITAL LETTER ALIF
1E901; C; 1E923; # ADLAM CAPITAL LETTER DAALI
1E902; C; 1E924; # ADLAM CAPITAL LETTER LAAM
diff --git a/maint/Unicode.tables/DerivedGeneralCategory.txt b/maint/Unicode.tables/DerivedGeneralCategory.txt
index bc7f5e8..38c95e2 100644
--- a/maint/Unicode.tables/DerivedGeneralCategory.txt
+++ b/maint/Unicode.tables/DerivedGeneralCategory.txt
@@ -1,6 +1,6 @@
-# DerivedGeneralCategory-10.0.0.txt
-# Date: 2017-03-08, 08:41:49 GMT
-# © 2017 Unicode®, Inc.
+# DerivedGeneralCategory-11.0.0.txt
+# Date: 2018-02-21, 05:34:04 GMT
+# © 2018 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
@@ -22,25 +22,23 @@
03A2 ; Cn #