Documentation update for Script Extensions property coding.
This commit is contained in:
parent
04ba4bce0f
commit
83726c359d
|
@ -8,26 +8,28 @@
|
||||||
# the upgrading of Unicode property support. The new code speeds up property
|
# the upgrading of Unicode property support. The new code speeds up property
|
||||||
# matching many times. The script is for the use of PCRE maintainers, to
|
# matching many times. The script is for the use of PCRE maintainers, to
|
||||||
# generate the pcre2_ucd.c file that contains a digested form of the Unicode
|
# generate the pcre2_ucd.c file that contains a digested form of the Unicode
|
||||||
# data tables.
|
# data tables. A number of extensions have been added to the original script.
|
||||||
#
|
#
|
||||||
# The script has now been upgraded to Python 3 for PCRE2, and should be run in
|
# The script has now been upgraded to Python 3 for PCRE2, and should be run in
|
||||||
# the maint subdirectory, using the command
|
# the maint subdirectory, using the command
|
||||||
#
|
#
|
||||||
# [python3] ./MultiStage2.py >../src/pcre2_ucd.c
|
# [python3] ./MultiStage2.py >../src/pcre2_ucd.c
|
||||||
#
|
#
|
||||||
# It requires five Unicode data tables: DerivedGeneralCategory.txt,
|
# It requires six Unicode data tables: DerivedGeneralCategory.txt,
|
||||||
# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt.
|
# GraphemeBreakProperty.txt, Scripts.txt, ScriptExtensions.txt,
|
||||||
# These must be in the maint/Unicode.tables subdirectory.
|
# CaseFolding.txt, and emoji-data.txt. These must be in the
|
||||||
|
# maint/Unicode.tables subdirectory.
|
||||||
#
|
#
|
||||||
# DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
|
# DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the
|
||||||
# Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
|
# Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is
|
||||||
# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly
|
# in the "auxiliary" subdirectory. Scripts.txt, ScriptExtensions.txt, and
|
||||||
# in the UCD directory. The emoji-data.txt file is in files associated with
|
# CaseFolding.txt are directly in the UCD directory. The emoji-data.txt file is
|
||||||
# Unicode Technical Standard #51 ("Unicode Emoji"), for example:
|
# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"),
|
||||||
#
|
# for example:
|
||||||
# http://unicode.org/Public/emoji/11.0/emoji-data.txt
|
|
||||||
#
|
#
|
||||||
|
# http://unicode.org/Public/emoji/11.0/emoji-data.txt
|
||||||
#
|
#
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
# Minor modifications made to this script:
|
# Minor modifications made to this script:
|
||||||
# Added #! line at start
|
# Added #! line at start
|
||||||
# Removed tabs
|
# Removed tabs
|
||||||
|
@ -61,78 +63,8 @@
|
||||||
# property, which is used by PCRE2 as a grapheme breaking property. This was
|
# property, which is used by PCRE2 as a grapheme breaking property. This was
|
||||||
# done when updating to Unicode 11.0.0 (July 2018).
|
# done when updating to Unicode 11.0.0 (July 2018).
|
||||||
#
|
#
|
||||||
# Added code to add a Script Extensions field to records.
|
# Added code to add a Script Extensions field to records. This has increased
|
||||||
#
|
# their size from 8 to 12 bytes, only 10 of which are currently used.
|
||||||
#
|
|
||||||
# The main tables generated by this script are used by macros defined in
|
|
||||||
# pcre2_internal.h. They look up Unicode character properties using short
|
|
||||||
# sequences of code that contains no branches, which makes for greater speed.
|
|
||||||
#
|
|
||||||
# Conceptually, there is a table of records (of type ucd_record), containing a
|
|
||||||
# script number, script extension value, character type, grapheme break type,
|
|
||||||
# offset to caseless matching set, offset to the character's other case, for
|
|
||||||
# every character. However, a real table covering all Unicode characters would
|
|
||||||
# be far too big. It can be efficiently compressed by observing that many
|
|
||||||
# characters have the same record, and many blocks of characters (taking 128
|
|
||||||
# characters in a block) have the same set of records as other blocks. This
|
|
||||||
# leads to a 2-stage lookup process.
|
|
||||||
#
|
|
||||||
# This script constructs six tables. The ucd_caseless_sets table contains
|
|
||||||
# lists of characters that all match each other caselessly. Each list is
|
|
||||||
# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
|
|
||||||
# any valid character. The first list is empty; this is used for characters
|
|
||||||
# that are not part of any list.
|
|
||||||
#
|
|
||||||
# The ucd_digit_sets table contains the code points of the '9' characters in
|
|
||||||
# each set of 10 decimal digits in Unicode. This is used to ensure that digits
|
|
||||||
# in script runs all come from the same set. The first element in the vector
|
|
||||||
# contains the number of subsequent elements, which are in ascending order.
|
|
||||||
#
|
|
||||||
# The ucd_script_sets vector contains lists of script numbers that are the
|
|
||||||
# Script Extensions properties of certain characters. Each list is terminated
|
|
||||||
# by zero (ucp_Unknown). A character with more than one script listed for its
|
|
||||||
# Script Extension property has a negative value in its record. This is the
|
|
||||||
# negated offset to the start of the relevant list.
|
|
||||||
#
|
|
||||||
# The ucd_records table contains one instance of every unique record that is
|
|
||||||
# required. The ucd_stage1 table is indexed by a character's block number, and
|
|
||||||
# yields what is in effect a "virtual" block number. The ucd_stage2 table is a
|
|
||||||
# table of "virtual" blocks; each block is indexed by the offset of a character
|
|
||||||
# within its own block, and the result is the offset of the required record.
|
|
||||||
#
|
|
||||||
# The following examples are correct for the Unicode 11.0.0 database. Future
|
|
||||||
# updates may make change the actual lookup values.
|
|
||||||
#
|
|
||||||
# Example: lowercase "a" (U+0061) is in block 0
|
|
||||||
# lookup 0 in stage1 table yields 0
|
|
||||||
# lookup 97 in the first table in stage2 yields 16
|
|
||||||
# record 17 is { 33, 5, 11, 0, -32 }
|
|
||||||
# 33 = ucp_Latin => Latin script
|
|
||||||
# 5 = ucp_Ll => Lower case letter
|
|
||||||
# 12 = ucp_gbOther => Grapheme break property "Other"
|
|
||||||
# 0 => not part of a caseless set
|
|
||||||
# -32 => Other case is U+0041
|
|
||||||
#
|
|
||||||
# Almost all lowercase latin characters resolve to the same record. One or two
|
|
||||||
# are different because they are part of a multi-character caseless set (for
|
|
||||||
# example, k, K and the Kelvin symbol are such a set).
|
|
||||||
#
|
|
||||||
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
|
|
||||||
# lookup 96 in stage1 table yields 90
|
|
||||||
# lookup 66 in the 90th table in stage2 yields 515
|
|
||||||
# record 515 is { 26, 7, 11, 0, 0 }
|
|
||||||
# 26 = ucp_Hiragana => Hiragana script
|
|
||||||
# 7 = ucp_Lo => Other letter
|
|
||||||
# 12 = ucp_gbOther => Grapheme break property "Other"
|
|
||||||
# 0 => not part of a caseless set
|
|
||||||
# 0 => No other case
|
|
||||||
#
|
|
||||||
# In these examples, no other blocks resolve to the same "virtual" block, as it
|
|
||||||
# happens, but plenty of other blocks do share "virtual" blocks.
|
|
||||||
#
|
|
||||||
# Philip Hazel, 03 July 2008
|
|
||||||
# Last Updated: 03 October 2018
|
|
||||||
#
|
|
||||||
#
|
#
|
||||||
# 01-March-2010: Updated list of scripts for Unicode 5.2.0
|
# 01-March-2010: Updated list of scripts for Unicode 5.2.0
|
||||||
# 30-April-2011: Updated list of scripts for Unicode 6.0.0
|
# 30-April-2011: Updated list of scripts for Unicode 6.0.0
|
||||||
|
@ -155,6 +87,98 @@
|
||||||
# Pictographic property.
|
# Pictographic property.
|
||||||
# 01-October-2018: Added the 'Unknown' script name
|
# 01-October-2018: Added the 'Unknown' script name
|
||||||
# 03-October-2018: Added new field for Script Extensions
|
# 03-October-2018: Added new field for Script Extensions
|
||||||
|
# ----------------------------------------------------------------------------
|
||||||
|
#
|
||||||
|
#
|
||||||
|
# The main tables generated by this script are used by macros defined in
|
||||||
|
# pcre2_internal.h. They look up Unicode character properties using short
|
||||||
|
# sequences of code that contains no branches, which makes for greater speed.
|
||||||
|
#
|
||||||
|
# Conceptually, there is a table of records (of type ucd_record), containing a
|
||||||
|
# script number, script extension value, character type, grapheme break type,
|
||||||
|
# offset to caseless matching set, offset to the character's other case, for
|
||||||
|
# every Unicode character. However, a real table covering all Unicode
|
||||||
|
# characters would be far too big. It can be efficiently compressed by
|
||||||
|
# observing that many characters have the same record, and many blocks of
|
||||||
|
# characters (taking 128 characters in a block) have the same set of records as
|
||||||
|
# other blocks. This leads to a 2-stage lookup process.
|
||||||
|
#
|
||||||
|
# This script constructs six tables. The ucd_caseless_sets table contains
|
||||||
|
# lists of characters that all match each other caselessly. Each list is
|
||||||
|
# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
|
||||||
|
# any valid character. The first list is empty; this is used for characters
|
||||||
|
# that are not part of any list.
|
||||||
|
#
|
||||||
|
# The ucd_digit_sets table contains the code points of the '9' characters in
|
||||||
|
# each set of 10 decimal digits in Unicode. This is used to ensure that digits
|
||||||
|
# in script runs all come from the same set. The first element in the vector
|
||||||
|
# contains the number of subsequent elements, which are in ascending order.
|
||||||
|
#
|
||||||
|
# The ucd_script_sets vector contains lists of script numbers that are the
|
||||||
|
# Script Extensions properties of certain characters. Each list is terminated
|
||||||
|
# by zero (ucp_Unknown). A character with more than one script listed for its
|
||||||
|
# Script Extension property has a negative value in its record. This is the
|
||||||
|
# negated offset to the start of the relevant list in the ucd_script_sets
|
||||||
|
# vector.
|
||||||
|
#
|
||||||
|
# The ucd_records table contains one instance of every unique record that is
|
||||||
|
# required. The ucd_stage1 table is indexed by a character's block number,
|
||||||
|
# which is the character's code point divided by 128, since 128 is the size
|
||||||
|
# of each block. The result of a lookup in ucd_stage1 a "virtual" block number.
|
||||||
|
#
|
||||||
|
# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by
|
||||||
|
# the offset of a character within its own block, and the result is the index
|
||||||
|
# number of the required record in the ucd_records vector.
|
||||||
|
#
|
||||||
|
# The following examples are correct for the Unicode 11.0.0 database. Future
|
||||||
|
# updates may make change the actual lookup values.
|
||||||
|
#
|
||||||
|
# Example: lowercase "a" (U+0061) is in block 0
|
||||||
|
# lookup 0 in stage1 table yields 0
|
||||||
|
# lookup 97 (0x61) in the first table in stage2 yields 17
|
||||||
|
# record 17 is { 34, 5, 12, 0, -32, 34, 0 }
|
||||||
|
# 34 = ucp_Latin => Latin script
|
||||||
|
# 5 = ucp_Ll => Lower case letter
|
||||||
|
# 12 = ucp_gbOther => Grapheme break property "Other"
|
||||||
|
# 0 => Not part of a caseless set
|
||||||
|
# -32 (-0x20) => Other case is U+0041
|
||||||
|
# 34 = ucp_Latin => No special Script Extension property
|
||||||
|
# 0 => Dummy value, unused at present
|
||||||
|
#
|
||||||
|
# Almost all lowercase latin characters resolve to the same record. One or two
|
||||||
|
# are different because they are part of a multi-character caseless set (for
|
||||||
|
# example, k, K and the Kelvin symbol are such a set).
|
||||||
|
#
|
||||||
|
# Example: hiragana letter A (U+3042) is in block 96 (0x60)
|
||||||
|
# lookup 96 in stage1 table yields 90
|
||||||
|
# lookup 66 (0x42) in table 90 in stage2 yields 564
|
||||||
|
# record 564 is { 27, 7, 12, 0, 0, 27, 0 }
|
||||||
|
# 27 = ucp_Hiragana => Hiragana script
|
||||||
|
# 7 = ucp_Lo => Other letter
|
||||||
|
# 12 = ucp_gbOther => Grapheme break property "Other"
|
||||||
|
# 0 => Not part of a caseless set
|
||||||
|
# 0 => No other case
|
||||||
|
# 27 = ucp_Hiragana => No special Script Extension property
|
||||||
|
# 0 => Dummy value, unused at present
|
||||||
|
#
|
||||||
|
# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39)
|
||||||
|
# lookup 57 in stage1 table yields 55
|
||||||
|
# lookup 80 (0x50) in table 55 in stage2 yields 458
|
||||||
|
# record 458 is { 28, 12, 3, 0, 0, -101, 0 }
|
||||||
|
# 28 = ucp_Inherited => Script inherited from predecessor
|
||||||
|
# 12 = ucp_Mn => Non-spacing mark
|
||||||
|
# 3 = ucp_gbExtend => Grapheme break property "Extend"
|
||||||
|
# 0 => Not part of a caseless set
|
||||||
|
# 0 => No other case
|
||||||
|
# -101 => Script Extension list offset = 101
|
||||||
|
# 0 => Dummy value, unused at present
|
||||||
|
#
|
||||||
|
# At offset 101 in the ucd_script_sets vector we find the list 3, 15, 107, 29,
|
||||||
|
# and terminator 0. This means that this character is expected to be used with
|
||||||
|
# any of those scripts, which are Bengali, Devanagari, Grantha, and Kannada.
|
||||||
|
#
|
||||||
|
# Philip Hazel, 03 July 2008
|
||||||
|
# Last Updated: 07 October 2018
|
||||||
##############################################################################
|
##############################################################################
|
||||||
|
|
||||||
|
|
||||||
|
@ -175,13 +199,13 @@ def get_other_case(chardata):
|
||||||
if chardata[1] == 'C' or chardata[1] == 'S':
|
if chardata[1] == 'C' or chardata[1] == 'S':
|
||||||
return int(chardata[2], 16) - int(chardata[0], 16)
|
return int(chardata[2], 16) - int(chardata[0], 16)
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
# Parse a line of ScriptExtensions.txt
|
# Parse a line of ScriptExtensions.txt
|
||||||
def get_script_extension(chardata):
|
def get_script_extension(chardata):
|
||||||
this_script_list = list(chardata[1].split(' '))
|
this_script_list = list(chardata[1].split(' '))
|
||||||
if len(this_script_list) == 1:
|
if len(this_script_list) == 1:
|
||||||
return script_abbrevs.index(this_script_list[0])
|
return script_abbrevs.index(this_script_list[0])
|
||||||
|
|
||||||
script_numbers = []
|
script_numbers = []
|
||||||
for d in this_script_list:
|
for d in this_script_list:
|
||||||
script_numbers.append(script_abbrevs.index(d))
|
script_numbers.append(script_abbrevs.index(d))
|
||||||
|
@ -190,18 +214,18 @@ def get_script_extension(chardata):
|
||||||
|
|
||||||
for i in range(1, len(script_lists) - script_numbers_length + 1):
|
for i in range(1, len(script_lists) - script_numbers_length + 1):
|
||||||
for j in range(0, script_numbers_length):
|
for j in range(0, script_numbers_length):
|
||||||
found = True
|
found = True
|
||||||
if script_lists[i+j] != script_numbers[j]:
|
if script_lists[i+j] != script_numbers[j]:
|
||||||
found = False
|
found = False
|
||||||
break
|
break
|
||||||
if found:
|
if found:
|
||||||
return -i
|
return -i
|
||||||
|
|
||||||
# Not found in existing lists
|
# Not found in existing lists
|
||||||
|
|
||||||
return_value = len(script_lists)
|
return_value = len(script_lists)
|
||||||
script_lists.extend(script_numbers)
|
script_lists.extend(script_numbers)
|
||||||
return -return_value
|
return -return_value
|
||||||
|
|
||||||
# Read the whole table in memory, setting/checking the Unicode version
|
# Read the whole table in memory, setting/checking the Unicode version
|
||||||
def read_table(file_name, get_value, default_value):
|
def read_table(file_name, get_value, default_value):
|
||||||
|
@ -402,7 +426,7 @@ script_names = ['Unknown', 'Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille
|
||||||
'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
|
'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin',
|
||||||
'Old_Sogdian', 'Sogdian'
|
'Old_Sogdian', 'Sogdian'
|
||||||
]
|
]
|
||||||
|
|
||||||
script_abbrevs = [
|
script_abbrevs = [
|
||||||
'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans',
|
'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans',
|
||||||
'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor',
|
'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor',
|
||||||
|
@ -434,7 +458,7 @@ script_abbrevs = [
|
||||||
'Zanb',
|
'Zanb',
|
||||||
#New for Unicode 11.0.0
|
#New for Unicode 11.0.0
|
||||||
'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd'
|
'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd'
|
||||||
]
|
]
|
||||||
|
|
||||||
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
|
category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu',
|
||||||
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
|
'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps',
|
||||||
|
@ -499,10 +523,10 @@ scriptx = read_table('Unicode.tables/ScriptExtensions.txt', get_script_extension
|
||||||
|
|
||||||
for i in range(0, MAX_UNICODE):
|
for i in range(0, MAX_UNICODE):
|
||||||
if scriptx[i] == script_abbrevs_default:
|
if scriptx[i] == script_abbrevs_default:
|
||||||
scriptx[i] = script[i]
|
scriptx[i] = script[i]
|
||||||
|
|
||||||
# With the addition of the new Script Extensions field, we need some padding
|
# With the addition of the new Script Extensions field, we need some padding
|
||||||
# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
|
# to get the Unicode records up to 12 bytes (multiple of 4). Set a value
|
||||||
# greater than 255 to make the field 16 bits.
|
# greater than 255 to make the field 16 bits.
|
||||||
|
|
||||||
padding_dummy = [0] * MAX_UNICODE
|
padding_dummy = [0] * MAX_UNICODE
|
||||||
|
@ -690,11 +714,11 @@ for line in file:
|
||||||
m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
|
m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
|
||||||
if m is None:
|
if m is None:
|
||||||
continue
|
continue
|
||||||
first = int(m.group(1),16)
|
first = int(m.group(1),16)
|
||||||
last = int(m.group(2),16)
|
last = int(m.group(2),16)
|
||||||
if ((last - first + 1) % 10) != 0:
|
if ((last - first + 1) % 10) != 0:
|
||||||
print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
|
print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
|
||||||
file=sys.stderr)
|
file=sys.stderr)
|
||||||
while first < last:
|
while first < last:
|
||||||
digitsets.append(first + 9)
|
digitsets.append(first + 9)
|
||||||
first += 10
|
first += 10
|
||||||
|
@ -724,9 +748,9 @@ count = 0
|
||||||
print(" /* 0 */", end='')
|
print(" /* 0 */", end='')
|
||||||
for d in script_lists:
|
for d in script_lists:
|
||||||
print(" %3d," % d, end='')
|
print(" %3d," % d, end='')
|
||||||
count += 1
|
count += 1
|
||||||
if d == 0:
|
if d == 0:
|
||||||
print("\n /* %3d */" % count, end='')
|
print("\n /* %3d */" % count, end='')
|
||||||
print("\n};\n")
|
print("\n};\n")
|
||||||
|
|
||||||
# Output the main UCD tables.
|
# Output the main UCD tables.
|
||||||
|
|
82
maint/README
82
maint/README
|
@ -23,11 +23,12 @@ GenerateUtt.py A Python script to generate part of the pcre2_tables.c file
|
||||||
ManyConfigTests A shell script that runs "configure, make, test" a number of
|
ManyConfigTests A shell script that runs "configure, make, test" a number of
|
||||||
times with different configuration settings.
|
times with different configuration settings.
|
||||||
|
|
||||||
MultiStage2.py A Python script that generates the file pcre2_ucd.c from five
|
MultiStage2.py A Python script that generates the file pcre2_ucd.c from six
|
||||||
Unicode data tables, which are themselves downloaded from the
|
Unicode data files, which are themselves downloaded from the
|
||||||
Unicode web site. Run this script in the "maint" directory.
|
Unicode web site. Run this script in the "maint" directory.
|
||||||
The generated file contains the tables for a 2-stage lookup
|
The generated file is written to stdout. It contains the
|
||||||
of Unicode properties.
|
tables for a 2-stage lookup of Unicode properties, along with
|
||||||
|
some auxiliary tables.
|
||||||
|
|
||||||
pcre2_chartables.c.non-standard
|
pcre2_chartables.c.non-standard
|
||||||
This is a set of character tables that came from a Windows
|
This is a set of character tables that came from a Windows
|
||||||
|
@ -40,14 +41,15 @@ README This file.
|
||||||
Unicode.tables The files in this directory were downloaded from the Unicode
|
Unicode.tables The files in this directory were downloaded from the Unicode
|
||||||
web site. They contain information about Unicode characters
|
web site. They contain information about Unicode characters
|
||||||
and scripts. The ones used by the MultiStage2.py script are
|
and scripts. The ones used by the MultiStage2.py script are
|
||||||
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
|
CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt,
|
||||||
GraphemeBreakProperty.txt, and emoji-data.txt. I've kept
|
ScriptExtensions.txt, GraphemeBreakProperty.txt, and
|
||||||
UnicodeData.txt (which is no longer used by the script)
|
emoji-data.txt. I've kept UnicodeData.txt (which is no longer
|
||||||
because it is useful occasionally for manually looking up the
|
used by the script) because it is useful occasionally for
|
||||||
details of certain characters. However, note that character
|
manually looking up the details of certain characters.
|
||||||
names in this file such as "Arabic sign sanah" do NOT mean
|
However, note that character names in this file such as
|
||||||
that the character is in a particular script (in this case,
|
"Arabic sign sanah" do NOT mean that the character is in a
|
||||||
Arabic). Scripts.txt is where to look for script information.
|
particular script (in this case, Arabic). Scripts.txt and
|
||||||
|
ScriptExtensions.txt are where to look for script information.
|
||||||
|
|
||||||
ucptest.c A short C program for testing the Unicode property macros
|
ucptest.c A short C program for testing the Unicode property macros
|
||||||
that do lookups in the pcre2_ucd.c data, mainly useful after
|
that do lookups in the pcre2_ucd.c data, mainly useful after
|
||||||
|
@ -61,7 +63,7 @@ utf8.c A short, freestanding C program for converting a Unicode code
|
||||||
point into a sequence of bytes in the UTF-8 encoding, and vice
|
point into a sequence of bytes in the UTF-8 encoding, and vice
|
||||||
versa. If its argument is a hex number such as 0x1234, it
|
versa. If its argument is a hex number such as 0x1234, it
|
||||||
outputs a list of the equivalent UTF-8 bytes. If its argument
|
outputs a list of the equivalent UTF-8 bytes. If its argument
|
||||||
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
||||||
treats them as a UTF-8 character and outputs the equivalent
|
treats them as a UTF-8 character and outputs the equivalent
|
||||||
code point in hex.
|
code point in hex.
|
||||||
|
|
||||||
|
@ -72,25 +74,31 @@ Updating to a new Unicode release
|
||||||
When there is a new release of Unicode, the files in Unicode.tables must be
|
When there is a new release of Unicode, the files in Unicode.tables must be
|
||||||
refreshed from the web site. If the new version of Unicode adds new character
|
refreshed from the web site. If the new version of Unicode adds new character
|
||||||
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
|
scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the
|
||||||
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
|
GenerateUtt.py scripts must be edited to add the new names. I have been adding
|
||||||
can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be
|
each new group at the end of the relevant list, with a comment. Note also that
|
||||||
run to generate the tricky tables for inclusion in pcre2_tables.c.
|
both the pcre2syntax.3 and pcre2pattern.3 man pages contain lists of Unicode
|
||||||
|
script names.
|
||||||
|
|
||||||
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
|
MultiStage2.py has two lists: the full names and the abbreviations that are
|
||||||
the cause is usually a missing (or misspelt) name in the list of scripts. I
|
found in the ScriptExtensions.txt file. A list of script names and their
|
||||||
couldn't find a straightforward list of scripts on the Unicode site, but
|
abbreviations s can be found in the PropertyValueAliases.txt file on the
|
||||||
there's a useful Wikipedia page that lists them, and notes the Unicode version
|
Unicode web site. There is also a Wikipedia page that lists them, and notes the
|
||||||
in which they were introduced:
|
Unicode version in which they were introduced:
|
||||||
|
|
||||||
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
||||||
|
|
||||||
|
Once the script name lists have been updated, MultiStage2.py can be run to
|
||||||
|
generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to
|
||||||
|
generate the tricky tables for inclusion in pcre2_tables.c (which must be
|
||||||
|
hand-edited). If MultiStage2.py gives the error "ValueError: list.index(x): x
|
||||||
|
not in list", the cause is usually a missing (or misspelt) name in one of the
|
||||||
|
lists of scripts.
|
||||||
|
|
||||||
The ucptest program can be compiled and used to check that the new tables in
|
The ucptest program can be compiled and used to check that the new tables in
|
||||||
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
|
pcre2_ucd.c work properly, using the data files in ucptestdata to check a
|
||||||
number of test characters. The source file ucptest.c must be updated whenever
|
number of test characters. The source file ucptest.c should also be updated
|
||||||
new Unicode script names are added.
|
whenever new Unicode script names are added, and adding a few tests for new
|
||||||
|
scripts is a good idea.
|
||||||
Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain
|
|
||||||
lists of Unicode script names.
|
|
||||||
|
|
||||||
|
|
||||||
Preparing for a PCRE2 release
|
Preparing for a PCRE2 release
|
||||||
|
@ -401,26 +409,6 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
strings, at least one of which must be present for a match, efficient
|
strings, at least one of which must be present for a match, efficient
|
||||||
pre-searching of large datasets could be implemented.
|
pre-searching of large datasets could be implemented.
|
||||||
|
|
||||||
. There's a Perl proposal for some new (* things, including alpha synonyms for
|
|
||||||
the lookaround assertions:
|
|
||||||
|
|
||||||
(*pla: …)
|
|
||||||
(*plb: …)
|
|
||||||
(*nla: …)
|
|
||||||
(*nlb: …)
|
|
||||||
(*atomic: …)
|
|
||||||
(*positive_look_ahead:...)
|
|
||||||
(*negative_look_ahead:...)
|
|
||||||
(*positive_look_behind:...)
|
|
||||||
(*negative_look_behind:...)
|
|
||||||
|
|
||||||
Also a new one (with synonyms):
|
|
||||||
|
|
||||||
(*script_run: ...) Ensure all captured chars are in the same script
|
|
||||||
(*sr: …)
|
|
||||||
(*atomic_script_run: …) A combination of script_run and atomic
|
|
||||||
(*asr:...)
|
|
||||||
|
|
||||||
. If pcre2grep had --first-line (match only in the first line) it could be
|
. If pcre2grep had --first-line (match only in the first line) it could be
|
||||||
efficiently used to find files "starting with xxx". What about --last-line?
|
efficiently used to find files "starting with xxx". What about --last-line?
|
||||||
|
|
||||||
|
@ -441,4 +429,4 @@ very sensible; some are rather wacky. Some have been on this list for years.
|
||||||
Philip Hazel
|
Philip Hazel
|
||||||
Email local part: ph10
|
Email local part: ph10
|
||||||
Email domain: cam.ac.uk
|
Email domain: cam.ac.uk
|
||||||
Last updated: 21 August 2018
|
Last updated: 07 October 2018
|
||||||
|
|
Loading…
Reference in New Issue