From 293ca17d8212905c7788aca1df7837d4716bd456 Mon Sep 17 00:00:00 2001 From: "David A. Wheeler" Date: Thu, 24 Oct 2019 08:22:59 -0400 Subject: [PATCH] Note character encoding in README, note cvt2utf To help people out, note the potential character encoding issue in the README (pointing to the documentation for more details) and note the "cvt2utf" Python program. Signed-off-by: David A. Wheeler --- README.md | 26 ++++++++++++++++++++++++++ flawfinder.1 | 5 ++++- 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 0a3aea8..d8f0fbf 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,32 @@ flawfinder (including its various options) and related information (such as how it supports CWE). For example, the `--html` option generates output in HTML format. The `--help` option gives a brief list of options. +# Character Encoding Errors + +Flawfinder must be able to correctly interpret your source code's +character encoding. +In the vast majority of cases this is not a problem, especially +if the source code is correctly encoded using UTF-8 and your system +is configured to use UTF-8 (the most common situation by far). + +However, it's possible for flawfinder to halt if there is a +character encoding problem and you're running Python3. +The usual symptom is error meesages like this: +`Error: encoding error in FILENAME 'ENCODING' codec can't decode byte ... in position ...: invalid start byte` + +Unfortunately, Python3 fails to provide useful built-ins to deal with this. +Thus, it's non-trivial to deal with this problem without depending on external +libraries (which we're trying to avoid). + +If you have this problem, see the flawfinder manual page for a collection +of various solutions. +One of the simplest is to simply convert the source code and system +configuration to UTF-8. +You can convert source code to UTF-8 using tools such as the +system tool `iconv` or the Python program +[`cvt2utf`](https://pypi.org/project/cvt2utf/); +you can install `cvt2utf` using `pip install cvt2utf`. + # Under the hood More technically, flawfinder uses lexical scanning to find tokens diff --git a/flawfinder.1 b/flawfinder.1 index 992e789..07acc1e 100644 --- a/flawfinder.1 +++ b/flawfinder.1 @@ -664,7 +664,10 @@ You may be able to replace "C" after LC_ALL= with your real language locale Option #4: Convert the encoding of the files to be analyzed so that it's a single encoding - it's highly recommended to convert to UTF-8. -For example, the program "iconv" can be used to convert encodings. +For example, the system program "iconv" +or the Python program cvt2utf +can be used to convert encodings. +(You can install cvt2utf with "pip install cvtutf"). This works well if some files have one encoding, and some have another, but they are consistent within a single file. If the files have encoding errors, you'll have to fix them.