Improve docs on character encoding

Provide better info on how to handle character encoding problems. As more people use Python3 this is more likely to be a problem. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
2019-09-22 15:21:11 -04:00 · 2019-09-22 15:21:11 -04:00 · f1fdd59da5
parent a3ff9a89d6
commit f1fdd59da5
1 changed files with 61 additions and 23 deletions
--- a/flawfinder.1
+++ b/flawfinder.1
@ -598,17 +598,22 @@ The difference algorithm is conservative;
 hits are only considered the ``same'' if they have the same
 filename, line number, column position, function name, and risk level.
-.SS "Character Encoding"
+.SS "Character Encoding Errors"
-Flawfinder presumes that the character encoding your system uses is
+Flawfinder uses the character encoding rules set by Python.
-also the character encoding used by your source files.
+Sometimes source code does not perfectly follow some encoding rules.
-Even if this isn't correct, if you run flawfinder with Python 2
+If you run flawfinder with Python 2
 these non-conformities often do not impact processing in practice.
 However, if you run flawfinder with Python 3, this can be a problem.
-Python 3 wants the world to always use encodings perfectly correctly,
+Python 3 developers wants the world to always use encodings perfectly correctly,
-everywhere, even though the world often doesn't care what Python 3 wants.
+everywhere, and in general wants everyone to only use UTF-8.
-This is a problem even if the non-conforming text is in comments or strings
+UTF-8 is a great encoding, and it is very popular, but
 the world often doesn't care what the Python 3 developers want.
 When running flawfinder using Python 3, the program will crash hard if
 \fIany\fR source file has \fIany\fR non-conforming text.
 It will do this even if the non-conforming text is in comments or strings
 (where it often doesn't matter).
 Python 3 fails to provide useful built-ins to deal with
 the messiness of the real world, so it's
@ -618,31 +623,64 @@ libraries (which we're trying to avoid).
 A symptom of this problem
 is if you run flawfinder and you see an error message like this:
-\fIUnicodeDecodeError: 'utf-8' codec can't decode byte ... in position ...:
+\fIError: encoding error in ,1.c\fR
 invalid continuation byte\fR
-If this happens to you, there are several options.
+\fI'utf-8' codec can't decode byte 0xff in position 45: invalid start byte\fR
-The first option is to
+What you are seeing is the result of an internal UnicodeDecodeError.
-convert the encoding of the files to be analyzed so that it's
+
-a single encoding (usually the system encoding).
+If this happens to you, there are several options:
 Option #1 (special case):
 if your system normally uses an encoding other than UTF-8,
 is properly set up to use that encoding (using LC_ALL and maybe LC_CTYPE),
 and the input files are in that non-UTF-8 encoding,
 it may be that Python3 is (incorrectly) ignoring your configuration.
 In that case, simply tell Python3 to use your
 configuration by setting the environment variable PYTHONUTF8=0, e.g.,
 run flawfinder as:
 "PYTHONUTF8=0 python3 flawfinder ...".
 Option #2 (special case): If you know what the encoding of the files is,
 you can force use of that encoding. E.g., if the encoding
 is BLAH, run flawfinder as:
 "PYTHONUTF8=0 LC_ALL=C.BLAH python3 flawfinder ...".
 You can replace "C" after LC_ALL= with your real language locale
 (e.g., "en_US").
 Option #3: If you don't know what the encoding is, or the encoding is
 inconsistent (e.g., the common case of UTF-8 files with some
 characters encoded using Windows-1252 instead),
 then you can force the system to use the
 ISO-8859-1 (Latin-1) encoding in which all bytes are allowed.
 If the inconsistencies are only in comments and strings, and the
 underlying character set is "close enough" to ASCII, this can get you
 going in a hurry.
 You can do this by running:
 "PYTHONUTF8=0 LC_ALL=C.ISO-8859-1 python3 flawfinder ...".
 In some cases you may not need the "PYTHONUTF8=0".
 You may be able to replace "C" after LC_ALL= with your real language locale
 (e.g., "en_US").
 Option #4: Convert the encoding of the files to be analyzed so that it's
 a single encoding - it's highly recommended to convert to UTF-8.
 For example, the program "iconv" can be used to convert encodings.
 This works well if some files have one encoding, and some have another,
 but they are consistent within a single file.
 If the files have encoding errors, you'll have to fix them.
 I strongly recommend using the UTF-8 encoding for all source code
 and in the system itself; if you do that, many problems disappear.
-The second option is to
+Option #5: Run flawfinder using Python 2 instead of Python 3.
 tell flawfinder what the encoding of the files is.
 E.G., you can set the LANG environment variable.
 You can set PYTHONIOENCODING to
 the encoding you want your output to be in, if that's different.
 This in theory would work, but I haven't had much success with this.
 The third option is to run flawfinder using Python 2 instead of Python 3.
 E.g., "python2 flawfinder ...".
 To be clear:
 I strongly recommend using the UTF-8 encoding for all source code,
 and use continuous integration tests to ensure that the source code
 is always valid UTF-8.
 If you do that, many problems disappear.
 But in the real world this is not always the situation.
 Hopefully
 this information will help you deal with real-world encoding problems.
 .SH EXAMPLES
 Here are various examples of how to invoke flawfinder.