Add documentation about encoding

Signed-off-by: David A. Wheeler <dwheeler@dwheeler.com>
2017-08-26 17:51:27 -04:00 · 2017-08-26 17:51:27 -04:00 · cead0828ef
parent b1d1b2e74d
commit cead0828ef
1 changed files with 46 additions and 1 deletions
--- a/flawfinder.1
+++ b/flawfinder.1
@ -413,7 +413,7 @@ This will often work, but the line numbers will be relative
 to the beginning of the patch file, not the positions in the
 source code.
 Note that you \fBmust\fR also provide the actual files to analyze,
-and not just the patch file; when using \f\-P\fR files are only reported
+and not just the patch file; when using \fB\-P\fR files are only reported
 if they are both listed in the patch and also listed (directly or indirectly)
 in the list of files to analyze.

@ -585,6 +585,51 @@ The difference algorithm is conservative;
 hits are only considered the ``same'' if they have the same
 filename, line number, column position, function name, and risk level.

+.SS "Character Encoding"
+
+Flawfinder presumes that the character encoding your system uses is
+also the character encoding used by your source files.
+Even if this isn't correct, if you run flawfinder with Python 2
+these non-conformities often do not impact processing in practice.
+
+However, if you run flawfinder with Python 3, this can be a problem.
+Python 3 wants the world to always use encodings perfectly correctly,
+everywhere, even though the world often doesn't care what Python 3 wants.
+This is a problem even if the non-conforming text is in comments or strings
+(where it often doesn't matter).
+Python 3 fails to provide useful built-ins to deal with
+the messiness of the real world, so it's
+non-trivial to deal with this problem without depending on external
+libraries (which we're trying to avoid).
+
+A symptom of this problem
+is if you run flawfinder and you see an error message like this:
+
+\fIUnicodeDecodeError: 'utf-8' codec can't decode byte ... in position ...:
+invalid continuation byte\fR
+
+If this happens to you, there are several options.
+
+The first option is to
+convert the encoding of the files to be analyzed so that it's
+a single encoding (usually the system encoding).
+For example, the program "iconv" can be used to convert encodings.
+This works well if some files have one encoding, and some have another,
+but they are consistent within a single file.
+If the files have encoding errors, you'll have to fix them.
+I strongly recommend using the UTF-8 encoding for any source code;
+if you do that, many problems disappear.
+
+The second option is to
+tell flawfinder what the encoding of the files is.
+E.G., you can set the LANG environment variable.
+You can set PYTHONIOENCODING to
+the encoding you want your output to be in, if that's different.
+This in theory would work well, but I haven't had much success with this.
+
+The third option is to run flawfinder using Python 2 instead of Python 3.
+E.g., "python2 flawfinder ...".
+
 .SH EXAMPLES

 Here are various examples of how to invoke flawfinder.