Improve docs on character encoding

Provide better info on how to handle character encoding problems. As more people use Python3 this is more likely to be a problem. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
2019-09-22 15:21:11 -04:00 · 2019-09-22 15:21:11 -04:00 · f1fdd59da5
parent a3ff9a89d6
commit f1fdd59da5
1 changed files with 61 additions and 23 deletions
--- a/flawfinder.1
+++ b/flawfinder.1
@ -598,17 +598,22 @@ The difference algorithm is conservative;
 hits are only considered the ``same'' if they have the same
 filename, line number, column position, function name, and risk level.

-.SS "Character Encoding"
+.SS "Character Encoding Errors"

-Flawfinder presumes that the character encoding your system uses is
-also the character encoding used by your source files.
-Even if this isn't correct, if you run flawfinder with Python 2
+Flawfinder uses the character encoding rules set by Python.
+Sometimes source code does not perfectly follow some encoding rules.
+If you run flawfinder with Python 2
 these non-conformities often do not impact processing in practice.

 However, if you run flawfinder with Python 3, this can be a problem.
-Python 3 wants the world to always use encodings perfectly correctly,
-everywhere, even though the world often doesn't care what Python 3 wants.
-This is a problem even if the non-conforming text is in comments or strings
+Python 3 developers wants the world to always use encodings perfectly correctly,
+everywhere, and in general wants everyone to only use UTF-8.
+UTF-8 is a great encoding, and it is very popular, but
+the world often doesn't care what the Python 3 developers want.
+
+When running flawfinder using Python 3, the program will crash hard if
+\fIany\fR source file has \fIany\fR non-conforming text.
+It will do this even if the non-conforming text is in comments or strings
 (where it often doesn't matter).
 Python 3 fails to provide useful built-ins to deal with
 the messiness of the real world, so it's
@ -618,31 +623,64 @@ libraries (which we're trying to avoid).
 A symptom of this problem
 is if you run flawfinder and you see an error message like this:

-\fIUnicodeDecodeError: 'utf-8' codec can't decode byte ... in position ...:
-invalid continuation byte\fR
+\fIError: encoding error in ,1.c\fR

-If this happens to you, there are several options.
+\fI'utf-8' codec can't decode byte 0xff in position 45: invalid start byte\fR

-The first option is to
-convert the encoding of the files to be analyzed so that it's
-a single encoding (usually the system encoding).
+What you are seeing is the result of an internal UnicodeDecodeError.
+
+If this happens to you, there are several options:
+
+Option #1 (special case):
+if your system normally uses an encoding other than UTF-8,
+is properly set up to use that encoding (using LC_ALL and maybe LC_CTYPE),
+and the input files are in that non-UTF-8 encoding,
+it may be that Python3 is (incorrectly) ignoring your configuration.
+In that case, simply tell Python3 to use your
+configuration by setting the environment variable PYTHONUTF8=0, e.g.,
+run flawfinder as:
+"PYTHONUTF8=0 python3 flawfinder ...".
+
+Option #2 (special case): If you know what the encoding of the files is,
+you can force use of that encoding. E.g., if the encoding
+is BLAH, run flawfinder as:
+"PYTHONUTF8=0 LC_ALL=C.BLAH python3 flawfinder ...".
+You can replace "C" after LC_ALL= with your real language locale
+(e.g., "en_US").
+
+Option #3: If you don't know what the encoding is, or the encoding is
+inconsistent (e.g., the common case of UTF-8 files with some
+characters encoded using Windows-1252 instead),
+then you can force the system to use the
+ISO-8859-1 (Latin-1) encoding in which all bytes are allowed.
+If the inconsistencies are only in comments and strings, and the
+underlying character set is "close enough" to ASCII, this can get you
+going in a hurry.
+You can do this by running:
+"PYTHONUTF8=0 LC_ALL=C.ISO-8859-1 python3 flawfinder ...".
+In some cases you may not need the "PYTHONUTF8=0".
+You may be able to replace "C" after LC_ALL= with your real language locale
+(e.g., "en_US").
+
+Option #4: Convert the encoding of the files to be analyzed so that it's
+a single encoding - it's highly recommended to convert to UTF-8.
 For example, the program "iconv" can be used to convert encodings.
 This works well if some files have one encoding, and some have another,
 but they are consistent within a single file.
 If the files have encoding errors, you'll have to fix them.
-I strongly recommend using the UTF-8 encoding for all source code
-and in the system itself; if you do that, many problems disappear.

-The second option is to
-tell flawfinder what the encoding of the files is.
-E.G., you can set the LANG environment variable.
-You can set PYTHONIOENCODING to
-the encoding you want your output to be in, if that's different.
-This in theory would work, but I haven't had much success with this.
-
-The third option is to run flawfinder using Python 2 instead of Python 3.
+Option #5: Run flawfinder using Python 2 instead of Python 3.
 E.g., "python2 flawfinder ...".

+To be clear:
+I strongly recommend using the UTF-8 encoding for all source code,
+and use continuous integration tests to ensure that the source code
+is always valid UTF-8.
+If you do that, many problems disappear.
+But in the real world this is not always the situation.
+Hopefully
+this information will help you deal with real-world encoding problems.
+
 .SH EXAMPLES

 Here are various examples of how to invoke flawfinder.