Commit Graph

37 Commits

Author SHA1 Message Date
Guldoman 2d8d39c7c0 Skip patterns matching nothing in `tokenizer` (#1743)
These patterns cause infinite loops, so warn about them and skip them.
2024-04-24 21:01:05 +01:00
Guldoman 234dd40e49 Fix patterns starting with `^` in `tokenizer` (#1645)
Previously the "dirty" version of the pattern was used, which could 
result in trying to match with multiple `^`, which failed valid matches.
2023-12-26 13:16:33 +00:00
Guldoman 0ebf3c0393 Return state when tokenizing plaintext syntaxes 2023-08-07 14:51:14 +01:00
Guldoman 1c8c569fae Allow `tokenizer` to pause and resume in the middle of a line (#1444) 2023-08-07 14:50:58 +01:00
Guldoman d9925b7d44 Allow groups to be used in end delimiter patterns in tokenizer (#1317)
* Allow empty groups as first match in tokenizer
* Avoid pushing tokens with empty strings
* Allow groups to be used in end delimiter in tokenizer
* Use the first entry of the type table for the middle part of a subsyntax
This applies to delimited matches with a table for `type` and without a 
`syntax` field.
* Match only once if using `at_start` in tokenizer `find_text`
* Check if match is escaped in the "close" case too
Also allow continuing matching if the match was escaped.
2023-08-07 14:50:58 +01:00
xwii 271a804986
Fix popping subsyntaxes that end consecutively (#1246) 2022-12-27 20:24:52 -04:00
Guldoman 9d48441685
Add `regex.find_offsets`, `regex.find`, improve `regex.match` (#1232)
`regex.match` now behaves like `string.match`.
This required changes in the `tokenizer` and in the `detectindent` 
plugin.
2022-12-11 22:25:42 -04:00
Guldoman 0a1b8b6bb1
Set initial tokenizer state to a `NULL` byte 2022-11-15 16:01:04 +01:00
Guldoman e147a6cb9b
Add `tokenizer.extract_subsyntaxes` 2022-11-15 16:00:48 +01:00
Jefferson González b8a4f729df
tokenizer: remove the limit of 3 subsyntaxes depth (#1186)
* tokenizer: remove the limit of 3 subsyntaxes depth

Make the state a string of bytes instead of a 32bits integer to be able
to have deeper subsyntax support. Fixes issues with syntax files like
the one for PHP that was already hitting more than 3 subsyntaxes depth.

* remove unnecesary call to set_subsyntax_pattern_idx

* fixed wrong word on comments
2022-11-03 18:56:20 -04:00
Jefferson González 880e6e4f0f
Merge pull request #1040 from Guldoman/PR_tokenizer_errors_alert
Add more tokenizer errors/warnings
2022-06-22 19:43:51 -04:00
Jefferson González d2fd5c9df7
Merge pull request #1034 from Guldoman/PR_escape_start_patterns
Check if "open" pattern is escaped
2022-06-15 16:51:34 -04:00
Guldoman d169619f69
Warn if token type is a table when not needed 2022-06-15 21:31:16 +02:00
Guldoman 2e37e85a48
Add helper function to report bad patterns in tokenizer 2022-06-15 21:28:46 +02:00
Guldoman 5027a0f12b
Fix malformed pattern check for group patterns in tokenizer
If the token type was a simple string (and not a table), the size of the 
string was used instead of `1`.
2022-06-15 19:33:58 +02:00
Guldoman 5b6b48320f
Check if "open" pattern is escaped
Previously this check was only done for "close" patterns.
2022-06-12 04:19:05 +02:00
Guldoman c947e8a4d1
Convert more byte offsets to utf-8 pos in regex tokenizer 2022-06-12 02:55:36 +02:00
Guldoman d8efb1ab53
Show error if language plugin pattern has mismatching number of groups
The number of results from a pattern with groups must never be greater
than the number of token types for that pattern.

Also if a token type was undefined, it's now pushed as a `normal` one.
2022-05-31 02:05:37 +02:00
Guldoman 7ac776bef6
Fix UTF-8 matches in regex group `tokenizer` 2022-05-31 01:59:14 +02:00
Guldoman 2a41002355
Allow using regex groups to split tokens
Before, this was only supported by Lua patterns.

This expects the regex to use the same syntax used for patterns. That 
is, the token should be split by empty groups.
2022-05-28 01:38:22 +02:00
jgmdev 94430bcbd2 tokenizer: fix next utf8 char retrieval bug 2022-05-13 11:21:46 -04:00
Jefferson González e572c58f24
Add utf8 support to tokenizer (#945)
* add utf8 support to tokenizer

* wrap utf8 functions in string table using a 'u' prefix

* document new utf8 functions
2022-04-26 09:42:02 -04:00
Guldoman caefc9112a
Force syntax patterns starting with `^` to match with the whole line
Before, syntax patterns/regexes that started with `^` didn't have the 
desired effect of matching with the start of the line.

Now those patterns are used only when matching the whole line.
2022-03-04 11:27:01 +01:00
Guldoman 51975472a9
Add bit32 polyfill globally 2022-01-12 00:07:53 +01:00
Jan200101 99ddf1fb92
Migrate to Lua 5.4 2021-12-31 13:53:01 +01:00
Guldoman 4faaf089ef
Consume unmatched character correctly
We must consume the whole UTF-8 character, not just a single byte.
2021-12-11 03:43:33 +01:00
Adam Harrison 96db380c73 Manual merge of into . 2021-11-23 15:57:22 -05:00
Francesco Abbate 5cdd800910 Fix problem checking utf-8 cont at end of string 2021-10-23 15:03:09 +02:00
Guldoman 8a516d35ce
Correctly identify the start of the next character in `tokenizer`
When moving to the next character, we have to consider that the current 
one might be multi-byte.
2021-10-11 22:37:31 +02:00
takase1121 30ccde896d
replace unpack() with table.unpack()
I have no idea unpack() is still used and how it still worked.
2021-08-29 09:14:12 +08:00
Adam 248d70a8ca
Add PCRE to support regular expressions
Use regular expressions instead of Lua patterns for find and replace editor commands.

Syntax files can now use regex or Lua patterns as before keeping backward compatibility for plugins.
2021-06-02 21:27:00 +02:00
Adam 949692860e
Tokenizer cleanup (#198)
* Cleaned up tokenizer to make subsyntax operations more clear.

* Explanatory comments.

* Made it so push_subsyntax could be safely called elsewhere.

* Unified terminology.

* Minor bug fix.

* State is an incredibly vaguely named variable. Changed convention to represent what it actually is.

* Also changed function name.

* Fixed bug.
2021-05-20 21:58:27 +02:00
liquidev 86a7037ed9
support for multiple groups in one pattern (#196) 2021-05-19 22:35:28 +02:00
lqdev ba4fbde33d fixed mixed indentation 2021-05-18 17:52:18 +02:00
adamharrison 3fe6665b9a
Nested Syntax Highlighting (#160) 2021-05-01 11:45:30 +02:00
rxi 6525269386 Made tokenizer skip parsing process on plain-text files
This, along with the earlier rencache changes should resolve #64
2020-05-14 10:10:50 +01:00
rxi f5025efbb8 Moved highlighter code from `DocView` to `Doc`
* Only one highlighter state is kept per-document as opposed
  to one per-docview
* Fixes a bug with retaining older highlighter state as a
  DocView wasn't able to detect lines changing above it's viewport
* Renames `highlighter` module to more descriptive `tokenizer`
2020-05-07 21:14:46 +01:00