Note character encoding in README, note cvt2utf

To help people out, note the potential character encoding issue
in the README (pointing to the documentation for more details)
and note the "cvt2utf" Python program.

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
This commit is contained in:
David A. Wheeler 2019-10-24 08:22:59 -04:00
parent 578c99cc17
commit 293ca17d82
2 changed files with 30 additions and 1 deletions

View File

@ -55,6 +55,32 @@ flawfinder (including its various options) and related information
(such as how it supports CWE). For example, the `--html` option generates
output in HTML format. The `--help` option gives a brief list of options.
# Character Encoding Errors
Flawfinder must be able to correctly interpret your source code's
character encoding.
In the vast majority of cases this is not a problem, especially
if the source code is correctly encoded using UTF-8 and your system
is configured to use UTF-8 (the most common situation by far).
However, it's possible for flawfinder to halt if there is a
character encoding problem and you're running Python3.
The usual symptom is error meesages like this:
`Error: encoding error in FILENAME 'ENCODING' codec can't decode byte ... in position ...: invalid start byte`
Unfortunately, Python3 fails to provide useful built-ins to deal with this.
Thus, it's non-trivial to deal with this problem without depending on external
libraries (which we're trying to avoid).
If you have this problem, see the flawfinder manual page for a collection
of various solutions.
One of the simplest is to simply convert the source code and system
configuration to UTF-8.
You can convert source code to UTF-8 using tools such as the
system tool `iconv` or the Python program
[`cvt2utf`](https://pypi.org/project/cvt2utf/);
you can install `cvt2utf` using `pip install cvt2utf`.
# Under the hood
More technically, flawfinder uses lexical scanning to find tokens

View File

@ -664,7 +664,10 @@ You may be able to replace "C" after LC_ALL= with your real language locale
Option #4: Convert the encoding of the files to be analyzed so that it's
a single encoding - it's highly recommended to convert to UTF-8.
For example, the program "iconv" can be used to convert encodings.
For example, the system program "iconv"
or the Python program cvt2utf
can be used to convert encodings.
(You can install cvt2utf with "pip install cvtutf").
This works well if some files have one encoding, and some have another,
but they are consistent within a single file.
If the files have encoding errors, you'll have to fix them.