Item12027: statistics/logfile iterator: not respecting SiteCharset

pencil
Priority: Normal
Current State: Closed
Released In: 2.0.0
Target Release: major
Applies To: Engine
Component: FoswikiLogger, FoswikiUIStatistics, I18N
Branches: trunk
Reported By: JanKrueger
Waiting For:
Last Change By: GeorgeClark
Event log messages are saved according to the SiteCharset but read back in using Perl's default charset. This causes statistics to misrender topic names. For example, "Tüdeldü" turns into "Tüdeldü".

My (currently untested) attempt at solving this simply involved copying the code from log to eachEventSince like this:

diff --git a/lib/Foswiki/Logger/PlainFile.pm b/lib/Foswiki/Logger/PlainFile.pm
index 821dfe3..24df920 100644
--- a/lib/Foswiki/Logger/PlainFile.pm
+++ b/lib/Foswiki/Logger/PlainFile.pm
@@ -216,7 +216,15 @@ sub eachEventSince {
     foreach my $logfile (@logs) {
         next unless -r $logfile;
         my $fh;
-        if ( open( $fh, '<', $logfile ) ) {
+        my $mode = '<';
+        # Item10764, SMELL UNICODE: actually, perhaps we should open the stream this
+        # way for any encoding, not just utf8. Babar says: check what Catalyst does.
+        if (   $Foswiki::cfg{Site}{CharSet}
+            && $Foswiki::cfg{Site}{CharSet} =~ /^utf-?8$/ )
+        {
+            $mode .= ":encoding($Foswiki::cfg{Site}{CharSet})";
+        }
+        if ( open( $fh, $mode, $logfile ) ) {
             push(
                 @iterators,
                 new Foswiki::Logger::PlainFile::EventIterator(

-- JanKrueger - 02 Aug 2012

Okay, no, that was exactly the wrong approach. I actually needed to fix the code in log. I've explained the mechanics of what was going on in the commit message.

-- JanKrueger - 02 Jan 2014

Reopening this. We need a unit test, and need to verify that the CompatibilityLogger and LogDispatchContrib don't suffer from the same problems.

-- GeorgeClark - 07 Jan 2014

Jan,

Compatibility.pm seems to fit your PlainFile patch just fine. I'll get that merged over. However LogDispatch is a lot more complex.

The LogDispatch.pm parent class establishes a binmode() method which is used to pass the encoding down to the output classes that support it. It's set when the class is instantiated. When foswiki writes a log, it calls Logger->log(), which is handled by LogDispatch->log(). It in turn drives the Log::Dispatch->dispatch() routine to cause each candidate logger type to be executed. If a logger output class wants to handle the record, it then calls the _flattenLog() callback to build the record.

So the message is actually built in LogDispatch::_flattenLog(). So I believe the fix is to make LogDispatch::binmode() support both set & get operations. Then in _flattenLog, do the appropriate isUtf8 tests and set the binmode.

The thing that is a bit confusing to me - is this a record-by-record decision? The file is opened for each write, and then closed. So changing binmode per-log-record ... does that make sense. Is there a risk of a mix of utf-8 and non-utf8 records in the same file?

And to make things more complicated, the same _flatten routine is used for the non-file based loggers - Screen (STDERR) and Syslog for example. Do then need to pass the message through the encoding? (Or will it hurt to pass it through regardless of the type of output).

-- GeorgeClark - 07 Jan 2014

LogDispatch::_flattenLog currently always makes sure that the output string is a byte string... unless the site charset is UTF-8. In that case, the string possibly remains a Unicode string, but of course that depends on the original data that is passed to _flattenLog. So, with site charset being UTF-8, _flattenLog may generate either a Unicode string or a byte string, depending on what parameters it receives. It's very likely that all callers pass byte strings, so the result would most likely be a byte string in all cases. UTF-8 byte strings are exactly the thing that was handled incorrectly by the function I changed. UTF-8 byte strings will always be output incorrectly if you use a PerlIO :encoding(utf-8) layer.

Once we make an effort to Unicode-ify all of Foswiki, it makes sense to revisit this – from that point we can set a default encoding for all filehandles and just use Unicode strings everywhere. That would eliminate pretty much all calls to Encode::encode, and many calls to Encode::decode, too. For now, though, I don't see an alternative to doing what we're doing right now.

By the way, as far as I can tell, the binmode field in LogDispatch objects is not actually used by anything...

-- JanKrueger - 07 Jan 2014

It's a bit tricky to follow, but binmode is indeed used.

Foswiki::Logger::LogDisplatch::File passes is as an argument to $logd->{dispatch}->add(  ... binmode   => $logd->binmode(),  ...

And then Log::Dispatch::File::_open_file() calls it against the file handle: ...    if ( $self->{binmode} ) {         binmode $fh, $self->{binmode};     }

So if I actually followed that correctly, the binmode is set, possibly differently, for each file open. which happens at every log event.

-- GeorgeClark - 07 Jan 2014

Okay, I've written a unit test, and seem to have found some issues with your fix, and bigger issues with LogDispatch. It all depends on what the {Site}{CharSet} is set to:

If the CharSet is un-set, then it defaults to iso-8859-1. Any unicode logged is corrupted on replay. If CharSet is utf8, then your fixes work.

Log::Dispatch doesn't give me any way to change the binmode on the fly, so it gets set based on the LocalSite.cfg CharSet setting during initialization. I think the solution here is to just declare that LogDispatchContrib will be hard-coded to utf8 for logging. Everything passes if I do that.

So I'm checking in changes:
  • Force LogDispatchContrib into utf8 for all file I/O
  • Add a unit test to write a bytestring and unicode string and play them back
  • Add new test fixtures, CharsetUtf8 and Charset8859, so that all of the logger tests are run with both settings.

That will leave us with 3 unit test failures, for the Compat, and Plain loggers, where they corrupt unicode if site charset is set to 8858.

-- GeorgeClark - 07 Jan 2014

Changed Compatibilty and PlainFile loggers to also log always as utf8. This makes all the loggers consistent and all the unit tests pass regardless of the site character set.

-- GeorgeClark - 17 Jun 2014
 
Topic revision: r18 - 05 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy