Improve docs on character encoding
Provide better info on how to handle character encoding problems. As more people use Python3 this is more likely to be a problem. Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
This commit is contained in:
parent
a3ff9a89d6
commit
f1fdd59da5
84
flawfinder.1
84
flawfinder.1
|
@ -598,17 +598,22 @@ The difference algorithm is conservative;
|
||||||
hits are only considered the ``same'' if they have the same
|
hits are only considered the ``same'' if they have the same
|
||||||
filename, line number, column position, function name, and risk level.
|
filename, line number, column position, function name, and risk level.
|
||||||
|
|
||||||
.SS "Character Encoding"
|
.SS "Character Encoding Errors"
|
||||||
|
|
||||||
Flawfinder presumes that the character encoding your system uses is
|
Flawfinder uses the character encoding rules set by Python.
|
||||||
also the character encoding used by your source files.
|
Sometimes source code does not perfectly follow some encoding rules.
|
||||||
Even if this isn't correct, if you run flawfinder with Python 2
|
If you run flawfinder with Python 2
|
||||||
these non-conformities often do not impact processing in practice.
|
these non-conformities often do not impact processing in practice.
|
||||||
|
|
||||||
However, if you run flawfinder with Python 3, this can be a problem.
|
However, if you run flawfinder with Python 3, this can be a problem.
|
||||||
Python 3 wants the world to always use encodings perfectly correctly,
|
Python 3 developers wants the world to always use encodings perfectly correctly,
|
||||||
everywhere, even though the world often doesn't care what Python 3 wants.
|
everywhere, and in general wants everyone to only use UTF-8.
|
||||||
This is a problem even if the non-conforming text is in comments or strings
|
UTF-8 is a great encoding, and it is very popular, but
|
||||||
|
the world often doesn't care what the Python 3 developers want.
|
||||||
|
|
||||||
|
When running flawfinder using Python 3, the program will crash hard if
|
||||||
|
\fIany\fR source file has \fIany\fR non-conforming text.
|
||||||
|
It will do this even if the non-conforming text is in comments or strings
|
||||||
(where it often doesn't matter).
|
(where it often doesn't matter).
|
||||||
Python 3 fails to provide useful built-ins to deal with
|
Python 3 fails to provide useful built-ins to deal with
|
||||||
the messiness of the real world, so it's
|
the messiness of the real world, so it's
|
||||||
|
@ -618,31 +623,64 @@ libraries (which we're trying to avoid).
|
||||||
A symptom of this problem
|
A symptom of this problem
|
||||||
is if you run flawfinder and you see an error message like this:
|
is if you run flawfinder and you see an error message like this:
|
||||||
|
|
||||||
\fIUnicodeDecodeError: 'utf-8' codec can't decode byte ... in position ...:
|
\fIError: encoding error in ,1.c\fR
|
||||||
invalid continuation byte\fR
|
|
||||||
|
|
||||||
If this happens to you, there are several options.
|
\fI'utf-8' codec can't decode byte 0xff in position 45: invalid start byte\fR
|
||||||
|
|
||||||
The first option is to
|
What you are seeing is the result of an internal UnicodeDecodeError.
|
||||||
convert the encoding of the files to be analyzed so that it's
|
|
||||||
a single encoding (usually the system encoding).
|
If this happens to you, there are several options:
|
||||||
|
|
||||||
|
Option #1 (special case):
|
||||||
|
if your system normally uses an encoding other than UTF-8,
|
||||||
|
is properly set up to use that encoding (using LC_ALL and maybe LC_CTYPE),
|
||||||
|
and the input files are in that non-UTF-8 encoding,
|
||||||
|
it may be that Python3 is (incorrectly) ignoring your configuration.
|
||||||
|
In that case, simply tell Python3 to use your
|
||||||
|
configuration by setting the environment variable PYTHONUTF8=0, e.g.,
|
||||||
|
run flawfinder as:
|
||||||
|
"PYTHONUTF8=0 python3 flawfinder ...".
|
||||||
|
|
||||||
|
Option #2 (special case): If you know what the encoding of the files is,
|
||||||
|
you can force use of that encoding. E.g., if the encoding
|
||||||
|
is BLAH, run flawfinder as:
|
||||||
|
"PYTHONUTF8=0 LC_ALL=C.BLAH python3 flawfinder ...".
|
||||||
|
You can replace "C" after LC_ALL= with your real language locale
|
||||||
|
(e.g., "en_US").
|
||||||
|
|
||||||
|
Option #3: If you don't know what the encoding is, or the encoding is
|
||||||
|
inconsistent (e.g., the common case of UTF-8 files with some
|
||||||
|
characters encoded using Windows-1252 instead),
|
||||||
|
then you can force the system to use the
|
||||||
|
ISO-8859-1 (Latin-1) encoding in which all bytes are allowed.
|
||||||
|
If the inconsistencies are only in comments and strings, and the
|
||||||
|
underlying character set is "close enough" to ASCII, this can get you
|
||||||
|
going in a hurry.
|
||||||
|
You can do this by running:
|
||||||
|
"PYTHONUTF8=0 LC_ALL=C.ISO-8859-1 python3 flawfinder ...".
|
||||||
|
In some cases you may not need the "PYTHONUTF8=0".
|
||||||
|
You may be able to replace "C" after LC_ALL= with your real language locale
|
||||||
|
(e.g., "en_US").
|
||||||
|
|
||||||
|
Option #4: Convert the encoding of the files to be analyzed so that it's
|
||||||
|
a single encoding - it's highly recommended to convert to UTF-8.
|
||||||
For example, the program "iconv" can be used to convert encodings.
|
For example, the program "iconv" can be used to convert encodings.
|
||||||
This works well if some files have one encoding, and some have another,
|
This works well if some files have one encoding, and some have another,
|
||||||
but they are consistent within a single file.
|
but they are consistent within a single file.
|
||||||
If the files have encoding errors, you'll have to fix them.
|
If the files have encoding errors, you'll have to fix them.
|
||||||
I strongly recommend using the UTF-8 encoding for all source code
|
|
||||||
and in the system itself; if you do that, many problems disappear.
|
|
||||||
|
|
||||||
The second option is to
|
Option #5: Run flawfinder using Python 2 instead of Python 3.
|
||||||
tell flawfinder what the encoding of the files is.
|
|
||||||
E.G., you can set the LANG environment variable.
|
|
||||||
You can set PYTHONIOENCODING to
|
|
||||||
the encoding you want your output to be in, if that's different.
|
|
||||||
This in theory would work, but I haven't had much success with this.
|
|
||||||
|
|
||||||
The third option is to run flawfinder using Python 2 instead of Python 3.
|
|
||||||
E.g., "python2 flawfinder ...".
|
E.g., "python2 flawfinder ...".
|
||||||
|
|
||||||
|
To be clear:
|
||||||
|
I strongly recommend using the UTF-8 encoding for all source code,
|
||||||
|
and use continuous integration tests to ensure that the source code
|
||||||
|
is always valid UTF-8.
|
||||||
|
If you do that, many problems disappear.
|
||||||
|
But in the real world this is not always the situation.
|
||||||
|
Hopefully
|
||||||
|
this information will help you deal with real-world encoding problems.
|
||||||
|
|
||||||
.SH EXAMPLES
|
.SH EXAMPLES
|
||||||
|
|
||||||
Here are various examples of how to invoke flawfinder.
|
Here are various examples of how to invoke flawfinder.
|
||||||
|
|
Loading…
Reference in New Issue