From 83726c359d6a2fb2f65a06c7d65a6b76a390850e Mon Sep 17 00:00:00 2001 From: "Philip.Hazel" Date: Sun, 7 Oct 2018 16:29:51 +0000 Subject: [PATCH] Documentation update for Script Extensions property coding. --- maint/MultiStage2.py | 222 ++++++++++++++++++++++++------------------- maint/README | 82 +++++++--------- 2 files changed, 158 insertions(+), 146 deletions(-) diff --git a/maint/MultiStage2.py b/maint/MultiStage2.py index 2765a81..0c2e50a 100755 --- a/maint/MultiStage2.py +++ b/maint/MultiStage2.py @@ -8,26 +8,28 @@ # the upgrading of Unicode property support. The new code speeds up property # matching many times. The script is for the use of PCRE maintainers, to # generate the pcre2_ucd.c file that contains a digested form of the Unicode -# data tables. +# data tables. A number of extensions have been added to the original script. # # The script has now been upgraded to Python 3 for PCRE2, and should be run in # the maint subdirectory, using the command # # [python3] ./MultiStage2.py >../src/pcre2_ucd.c # -# It requires five Unicode data tables: DerivedGeneralCategory.txt, -# GraphemeBreakProperty.txt, Scripts.txt, CaseFolding.txt, and emoji-data.txt. -# These must be in the maint/Unicode.tables subdirectory. +# It requires six Unicode data tables: DerivedGeneralCategory.txt, +# GraphemeBreakProperty.txt, Scripts.txt, ScriptExtensions.txt, +# CaseFolding.txt, and emoji-data.txt. These must be in the +# maint/Unicode.tables subdirectory. # # DerivedGeneralCategory.txt is found in the "extracted" subdirectory of the # Unicode database (UCD) on the Unicode web site; GraphemeBreakProperty.txt is -# in the "auxiliary" subdirectory. Scripts.txt and CaseFolding.txt are directly -# in the UCD directory. The emoji-data.txt file is in files associated with -# Unicode Technical Standard #51 ("Unicode Emoji"), for example: -# -# http://unicode.org/Public/emoji/11.0/emoji-data.txt +# in the "auxiliary" subdirectory. Scripts.txt, ScriptExtensions.txt, and +# CaseFolding.txt are directly in the UCD directory. The emoji-data.txt file is +# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"), +# for example: # +# http://unicode.org/Public/emoji/11.0/emoji-data.txt # +# ----------------------------------------------------------------------------- # Minor modifications made to this script: # Added #! line at start # Removed tabs @@ -61,78 +63,8 @@ # property, which is used by PCRE2 as a grapheme breaking property. This was # done when updating to Unicode 11.0.0 (July 2018). # -# Added code to add a Script Extensions field to records. -# -# -# The main tables generated by this script are used by macros defined in -# pcre2_internal.h. They look up Unicode character properties using short -# sequences of code that contains no branches, which makes for greater speed. -# -# Conceptually, there is a table of records (of type ucd_record), containing a -# script number, script extension value, character type, grapheme break type, -# offset to caseless matching set, offset to the character's other case, for -# every character. However, a real table covering all Unicode characters would -# be far too big. It can be efficiently compressed by observing that many -# characters have the same record, and many blocks of characters (taking 128 -# characters in a block) have the same set of records as other blocks. This -# leads to a 2-stage lookup process. -# -# This script constructs six tables. The ucd_caseless_sets table contains -# lists of characters that all match each other caselessly. Each list is -# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than -# any valid character. The first list is empty; this is used for characters -# that are not part of any list. -# -# The ucd_digit_sets table contains the code points of the '9' characters in -# each set of 10 decimal digits in Unicode. This is used to ensure that digits -# in script runs all come from the same set. The first element in the vector -# contains the number of subsequent elements, which are in ascending order. -# -# The ucd_script_sets vector contains lists of script numbers that are the -# Script Extensions properties of certain characters. Each list is terminated -# by zero (ucp_Unknown). A character with more than one script listed for its -# Script Extension property has a negative value in its record. This is the -# negated offset to the start of the relevant list. -# -# The ucd_records table contains one instance of every unique record that is -# required. The ucd_stage1 table is indexed by a character's block number, and -# yields what is in effect a "virtual" block number. The ucd_stage2 table is a -# table of "virtual" blocks; each block is indexed by the offset of a character -# within its own block, and the result is the offset of the required record. -# -# The following examples are correct for the Unicode 11.0.0 database. Future -# updates may make change the actual lookup values. -# -# Example: lowercase "a" (U+0061) is in block 0 -# lookup 0 in stage1 table yields 0 -# lookup 97 in the first table in stage2 yields 16 -# record 17 is { 33, 5, 11, 0, -32 } -# 33 = ucp_Latin => Latin script -# 5 = ucp_Ll => Lower case letter -# 12 = ucp_gbOther => Grapheme break property "Other" -# 0 => not part of a caseless set -# -32 => Other case is U+0041 -# -# Almost all lowercase latin characters resolve to the same record. One or two -# are different because they are part of a multi-character caseless set (for -# example, k, K and the Kelvin symbol are such a set). -# -# Example: hiragana letter A (U+3042) is in block 96 (0x60) -# lookup 96 in stage1 table yields 90 -# lookup 66 in the 90th table in stage2 yields 515 -# record 515 is { 26, 7, 11, 0, 0 } -# 26 = ucp_Hiragana => Hiragana script -# 7 = ucp_Lo => Other letter -# 12 = ucp_gbOther => Grapheme break property "Other" -# 0 => not part of a caseless set -# 0 => No other case -# -# In these examples, no other blocks resolve to the same "virtual" block, as it -# happens, but plenty of other blocks do share "virtual" blocks. -# -# Philip Hazel, 03 July 2008 -# Last Updated: 03 October 2018 -# +# Added code to add a Script Extensions field to records. This has increased +# their size from 8 to 12 bytes, only 10 of which are currently used. # # 01-March-2010: Updated list of scripts for Unicode 5.2.0 # 30-April-2011: Updated list of scripts for Unicode 6.0.0 @@ -155,6 +87,98 @@ # Pictographic property. # 01-October-2018: Added the 'Unknown' script name # 03-October-2018: Added new field for Script Extensions +# ---------------------------------------------------------------------------- +# +# +# The main tables generated by this script are used by macros defined in +# pcre2_internal.h. They look up Unicode character properties using short +# sequences of code that contains no branches, which makes for greater speed. +# +# Conceptually, there is a table of records (of type ucd_record), containing a +# script number, script extension value, character type, grapheme break type, +# offset to caseless matching set, offset to the character's other case, for +# every Unicode character. However, a real table covering all Unicode +# characters would be far too big. It can be efficiently compressed by +# observing that many characters have the same record, and many blocks of +# characters (taking 128 characters in a block) have the same set of records as +# other blocks. This leads to a 2-stage lookup process. +# +# This script constructs six tables. The ucd_caseless_sets table contains +# lists of characters that all match each other caselessly. Each list is +# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than +# any valid character. The first list is empty; this is used for characters +# that are not part of any list. +# +# The ucd_digit_sets table contains the code points of the '9' characters in +# each set of 10 decimal digits in Unicode. This is used to ensure that digits +# in script runs all come from the same set. The first element in the vector +# contains the number of subsequent elements, which are in ascending order. +# +# The ucd_script_sets vector contains lists of script numbers that are the +# Script Extensions properties of certain characters. Each list is terminated +# by zero (ucp_Unknown). A character with more than one script listed for its +# Script Extension property has a negative value in its record. This is the +# negated offset to the start of the relevant list in the ucd_script_sets +# vector. +# +# The ucd_records table contains one instance of every unique record that is +# required. The ucd_stage1 table is indexed by a character's block number, +# which is the character's code point divided by 128, since 128 is the size +# of each block. The result of a lookup in ucd_stage1 a "virtual" block number. +# +# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by +# the offset of a character within its own block, and the result is the index +# number of the required record in the ucd_records vector. +# +# The following examples are correct for the Unicode 11.0.0 database. Future +# updates may make change the actual lookup values. +# +# Example: lowercase "a" (U+0061) is in block 0 +# lookup 0 in stage1 table yields 0 +# lookup 97 (0x61) in the first table in stage2 yields 17 +# record 17 is { 34, 5, 12, 0, -32, 34, 0 } +# 34 = ucp_Latin => Latin script +# 5 = ucp_Ll => Lower case letter +# 12 = ucp_gbOther => Grapheme break property "Other" +# 0 => Not part of a caseless set +# -32 (-0x20) => Other case is U+0041 +# 34 = ucp_Latin => No special Script Extension property +# 0 => Dummy value, unused at present +# +# Almost all lowercase latin characters resolve to the same record. One or two +# are different because they are part of a multi-character caseless set (for +# example, k, K and the Kelvin symbol are such a set). +# +# Example: hiragana letter A (U+3042) is in block 96 (0x60) +# lookup 96 in stage1 table yields 90 +# lookup 66 (0x42) in table 90 in stage2 yields 564 +# record 564 is { 27, 7, 12, 0, 0, 27, 0 } +# 27 = ucp_Hiragana => Hiragana script +# 7 = ucp_Lo => Other letter +# 12 = ucp_gbOther => Grapheme break property "Other" +# 0 => Not part of a caseless set +# 0 => No other case +# 27 = ucp_Hiragana => No special Script Extension property +# 0 => Dummy value, unused at present +# +# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39) +# lookup 57 in stage1 table yields 55 +# lookup 80 (0x50) in table 55 in stage2 yields 458 +# record 458 is { 28, 12, 3, 0, 0, -101, 0 } +# 28 = ucp_Inherited => Script inherited from predecessor +# 12 = ucp_Mn => Non-spacing mark +# 3 = ucp_gbExtend => Grapheme break property "Extend" +# 0 => Not part of a caseless set +# 0 => No other case +# -101 => Script Extension list offset = 101 +# 0 => Dummy value, unused at present +# +# At offset 101 in the ucd_script_sets vector we find the list 3, 15, 107, 29, +# and terminator 0. This means that this character is expected to be used with +# any of those scripts, which are Bengali, Devanagari, Grantha, and Kannada. +# +# Philip Hazel, 03 July 2008 +# Last Updated: 07 October 2018 ############################################################################## @@ -175,13 +199,13 @@ def get_other_case(chardata): if chardata[1] == 'C' or chardata[1] == 'S': return int(chardata[2], 16) - int(chardata[0], 16) return 0 - + # Parse a line of ScriptExtensions.txt def get_script_extension(chardata): this_script_list = list(chardata[1].split(' ')) if len(this_script_list) == 1: return script_abbrevs.index(this_script_list[0]) - + script_numbers = [] for d in this_script_list: script_numbers.append(script_abbrevs.index(d)) @@ -190,18 +214,18 @@ def get_script_extension(chardata): for i in range(1, len(script_lists) - script_numbers_length + 1): for j in range(0, script_numbers_length): - found = True + found = True if script_lists[i+j] != script_numbers[j]: - found = False + found = False break if found: return -i - - # Not found in existing lists - + + # Not found in existing lists + return_value = len(script_lists) script_lists.extend(script_numbers) - return -return_value + return -return_value # Read the whole table in memory, setting/checking the Unicode version def read_table(file_name, get_value, default_value): @@ -402,7 +426,7 @@ script_names = ['Unknown', 'Arabic', 'Armenian', 'Bengali', 'Bopomofo', 'Braille 'Dogra', 'Gunjala_Gondi', 'Hanifi_Rohingya', 'Makasar', 'Medefaidrin', 'Old_Sogdian', 'Sogdian' ] - + script_abbrevs = [ 'Zzzz', 'Arab', 'Armn', 'Beng', 'Bopo', 'Brai', 'Bugi', 'Buhd', 'Cans', 'Cher', 'Zyyy', 'Copt', 'Cprt', 'Cyrl', 'Dsrt', 'Deva', 'Ethi', 'Geor', @@ -434,7 +458,7 @@ script_abbrevs = [ 'Zanb', #New for Unicode 11.0.0 'Dogr', 'Gong', 'Rohg', 'Maka', 'Medf', 'Sogo', 'Sogd' - ] + ] category_names = ['Cc', 'Cf', 'Cn', 'Co', 'Cs', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'Mc', 'Me', 'Mn', 'Nd', 'Nl', 'No', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps', @@ -499,10 +523,10 @@ scriptx = read_table('Unicode.tables/ScriptExtensions.txt', get_script_extension for i in range(0, MAX_UNICODE): if scriptx[i] == script_abbrevs_default: - scriptx[i] = script[i] + scriptx[i] = script[i] -# With the addition of the new Script Extensions field, we need some padding -# to get the Unicode records up to 12 bytes (multiple of 4). Set a value +# With the addition of the new Script Extensions field, we need some padding +# to get the Unicode records up to 12 bytes (multiple of 4). Set a value # greater than 255 to make the field 16 bits. padding_dummy = [0] * MAX_UNICODE @@ -690,11 +714,11 @@ for line in file: m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line) if m is None: continue - first = int(m.group(1),16) - last = int(m.group(2),16) + first = int(m.group(1),16) + last = int(m.group(2),16) if ((last - first + 1) % 10) != 0: print("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last), - file=sys.stderr) + file=sys.stderr) while first < last: digitsets.append(first + 9) first += 10 @@ -724,9 +748,9 @@ count = 0 print(" /* 0 */", end='') for d in script_lists: print(" %3d," % d, end='') - count += 1 + count += 1 if d == 0: - print("\n /* %3d */" % count, end='') + print("\n /* %3d */" % count, end='') print("\n};\n") # Output the main UCD tables. diff --git a/maint/README b/maint/README index 816b001..eb12561 100644 --- a/maint/README +++ b/maint/README @@ -23,11 +23,12 @@ GenerateUtt.py A Python script to generate part of the pcre2_tables.c file ManyConfigTests A shell script that runs "configure, make, test" a number of times with different configuration settings. -MultiStage2.py A Python script that generates the file pcre2_ucd.c from five - Unicode data tables, which are themselves downloaded from the +MultiStage2.py A Python script that generates the file pcre2_ucd.c from six + Unicode data files, which are themselves downloaded from the Unicode web site. Run this script in the "maint" directory. - The generated file contains the tables for a 2-stage lookup - of Unicode properties. + The generated file is written to stdout. It contains the + tables for a 2-stage lookup of Unicode properties, along with + some auxiliary tables. pcre2_chartables.c.non-standard This is a set of character tables that came from a Windows @@ -40,14 +41,15 @@ README This file. Unicode.tables The files in this directory were downloaded from the Unicode web site. They contain information about Unicode characters and scripts. The ones used by the MultiStage2.py script are - CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt, - GraphemeBreakProperty.txt, and emoji-data.txt. I've kept - UnicodeData.txt (which is no longer used by the script) - because it is useful occasionally for manually looking up the - details of certain characters. However, note that character - names in this file such as "Arabic sign sanah" do NOT mean - that the character is in a particular script (in this case, - Arabic). Scripts.txt is where to look for script information. + CaseFolding.txt, DerivedGeneralCategory.txt, Scripts.txt, + ScriptExtensions.txt, GraphemeBreakProperty.txt, and + emoji-data.txt. I've kept UnicodeData.txt (which is no longer + used by the script) because it is useful occasionally for + manually looking up the details of certain characters. + However, note that character names in this file such as + "Arabic sign sanah" do NOT mean that the character is in a + particular script (in this case, Arabic). Scripts.txt and + ScriptExtensions.txt are where to look for script information. ucptest.c A short C program for testing the Unicode property macros that do lookups in the pcre2_ucd.c data, mainly useful after @@ -61,7 +63,7 @@ utf8.c A short, freestanding C program for converting a Unicode code point into a sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes. If its argument - is sequence of concatenated UTF-8 bytes (e.g. e188b4) it + is a sequence of concatenated UTF-8 bytes (e.g. e188b4) it treats them as a UTF-8 character and outputs the equivalent code point in hex. @@ -72,25 +74,31 @@ Updating to a new Unicode release When there is a new release of Unicode, the files in Unicode.tables must be refreshed from the web site. If the new version of Unicode adds new character scripts, the source file pcre2_ucp.h and both the MultiStage2.py and the -GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py -can be run to generate a new version of pcre2_ucd.c, and GenerateUtt.py can be -run to generate the tricky tables for inclusion in pcre2_tables.c. +GenerateUtt.py scripts must be edited to add the new names. I have been adding +each new group at the end of the relevant list, with a comment. Note also that +both the pcre2syntax.3 and pcre2pattern.3 man pages contain lists of Unicode +script names. -If MultiStage2.py gives the error "ValueError: list.index(x): x not in list", -the cause is usually a missing (or misspelt) name in the list of scripts. I -couldn't find a straightforward list of scripts on the Unicode site, but -there's a useful Wikipedia page that lists them, and notes the Unicode version -in which they were introduced: +MultiStage2.py has two lists: the full names and the abbreviations that are +found in the ScriptExtensions.txt file. A list of script names and their +abbreviations s can be found in the PropertyValueAliases.txt file on the +Unicode web site. There is also a Wikipedia page that lists them, and notes the +Unicode version in which they were introduced: http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts +Once the script name lists have been updated, MultiStage2.py can be run to +generate a new version of pcre2_ucd.c, and GenerateUtt.py can be run to +generate the tricky tables for inclusion in pcre2_tables.c (which must be +hand-edited). If MultiStage2.py gives the error "ValueError: list.index(x): x +not in list", the cause is usually a missing (or misspelt) name in one of the +lists of scripts. + The ucptest program can be compiled and used to check that the new tables in pcre2_ucd.c work properly, using the data files in ucptestdata to check a -number of test characters. The source file ucptest.c must be updated whenever -new Unicode script names are added. - -Note also that both the pcre2syntax.3 and pcre2pattern.3 man pages contain -lists of Unicode script names. +number of test characters. The source file ucptest.c should also be updated +whenever new Unicode script names are added, and adding a few tests for new +scripts is a good idea. Preparing for a PCRE2 release @@ -401,26 +409,6 @@ very sensible; some are rather wacky. Some have been on this list for years. strings, at least one of which must be present for a match, efficient pre-searching of large datasets could be implemented. -. There's a Perl proposal for some new (* things, including alpha synonyms for - the lookaround assertions: - - (*pla: …) - (*plb: …) - (*nla: …) - (*nlb: …) - (*atomic: …) - (*positive_look_ahead:...) - (*negative_look_ahead:...) - (*positive_look_behind:...) - (*negative_look_behind:...) - - Also a new one (with synonyms): - - (*script_run: ...) Ensure all captured chars are in the same script - (*sr: …) - (*atomic_script_run: …) A combination of script_run and atomic - (*asr:...) - . If pcre2grep had --first-line (match only in the first line) it could be efficiently used to find files "starting with xxx". What about --last-line? @@ -441,4 +429,4 @@ very sensible; some are rather wacky. Some have been on this list for years. Philip Hazel Email local part: ph10 Email domain: cam.ac.uk -Last updated: 21 August 2018 +Last updated: 07 October 2018