Note character encoding in README, note cvt2utf

To help people out, note the potential character encoding issue in the README (pointing to the documentation for more details) and note the "cvt2utf" Python program. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
2019-10-24 08:22:59 -04:00 · 2019-10-24 08:22:59 -04:00 · 293ca17d82
parent 578c99cc17
commit 293ca17d82
2 changed files with 30 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -55,6 +55,32 @@ flawfinder (including its various options) and related information
 (such as how it supports CWE).  For example, the `--html` option generates
 output in HTML format. The `--help` option gives a brief list of options.

+# Character Encoding Errors
+
+Flawfinder must be able to correctly interpret your source code's
+character encoding.
+In the vast majority of cases this is not a problem, especially
+if the source code is correctly encoded using UTF-8 and your system
+is configured to use UTF-8 (the most common situation by far).
+
+However, it's possible for flawfinder to halt if there is a
+character encoding problem and you're running Python3.
+The usual symptom is error meesages like this:
+`Error: encoding error in FILENAME 'ENCODING' codec can't decode byte ... in position ...: invalid start byte`
+
+Unfortunately, Python3 fails to provide useful built-ins to deal with this.
+Thus, it's non-trivial to deal with this problem without depending on external
+libraries (which we're trying to avoid).
+
+If you have this problem, see the flawfinder manual page for a collection
+of various solutions.
+One of the simplest is to simply convert the source code and system
+configuration to UTF-8.
+You can convert source code to UTF-8 using tools such as the
+system tool `iconv` or the Python program
+[`cvt2utf`](https://pypi.org/project/cvt2utf/);
+you can install `cvt2utf` using `pip install cvt2utf`.
+
 # Under the hood

 More technically, flawfinder uses lexical scanning to find tokens
--- a/flawfinder.1
+++ b/flawfinder.1
@ -664,7 +664,10 @@ You may be able to replace "C" after LC_ALL= with your real language locale

 Option #4: Convert the encoding of the files to be analyzed so that it's
 a single encoding - it's highly recommended to convert to UTF-8.
-For example, the program "iconv" can be used to convert encodings.
+For example, the system program "iconv"
+or the Python program cvt2utf
+can be used to convert encodings.
+(You can install cvt2utf with "pip install cvtutf").
 This works well if some files have one encoding, and some have another,
 but they are consistent within a single file.
 If the files have encoding errors, you'll have to fix them.