You are here: Foswiki>Tasks Web>Item10635 (05 Jul 2015, GeorgeClark)Edit Attach

Item10635: Wrong international character display.

pencil
Priority: Urgent
Current State: Closed
Released In: 2.0.0
Target Release: major
Applies To: Engine
Component: I18N, Unicode
Branches:
Reported By: JozefMojzis
Waiting For:
Last Change By: GeorgeClark
My config:
$Foswiki::cfg{UserInterfaceInternationalisation} = 1;
$Foswiki::cfg{Languages}{cs}{Enabled} = 1;   #all enabled
$Foswiki::cfg{UseLocale} = 1;
$Foswiki::cfg{Site}{Locale} = 'sk_SK.UTF-8';
$Foswiki::cfg{Site}{CharSet} = 'UTF-8';

When displaying any page, the character 'ž' in the same menu WebLeftBarExample displayed differently.

#: PatternSkin/data/System/WebLeftBarExample.txt:16
msgid "User Reference"
msgstr "Uživatelské reference"
#"ž" displayed correctly in the browser

in the
#: PatternSkin/data/System/WebLeftBarExample.txt:26
msgid "Admin Maintenance"
msgstr "Administrátorská údržba"
# the "ž" is displayed wrong in the browser

Foswiki RC1, you can reproduce it:
  • configure for utf8 as above
  • select interface language czech (what has a wrong translation: _jméno_jazyka)
  • show any page a watch the sidebar (default pattern skin)

In addition: With the above settings, text rendered differently in verbatim block.

Administrátorská údržba - displayed wrong

Administrátorská údržba - the same text displayed OK

When disable UseLocale in cfg:
$Foswiki::cfg{UseLocale} = 0;
this bug disappear, so using real SiteLocale is not recommended.

The dirty hack, what solving the issue - in the first try is in Render.pm
        $text = Encode::decode("utf-8", $text); #here
        $text =~ s($STARTWW
            (?:($Foswiki::regex{webNameRegex})\.)?
            ($Foswiki::regex{wikiWordRegex}|
                $Foswiki::regex{abbrevRegex})
            ($Foswiki::regex{anchorRegex})?)
           (_handleWikiWord( $this, $topicObject, $1, $2, $3))gexom;
        $text = Encode::encode("utf-8", $text); #and here
later will try adding a diff file...

-- JozefMojzis - 14 Apr 2011

So, Encode::decode converts octets into an unicode-aware "character" string (with utf flag set), and encode does the opposite.

I am kind of disappointed that we treat $text as octets by default. We should really be treating $text as character data rather than octets for as much of our code as possible, right?

-- PaulHarvey - 19 May 2011

This explains why I've been having issues with MongoDB enabled. MongoDB returns character strings, with utf flag set. See Item10748 where we added a similar fix at a much later stage in the rendering pipeline.

-- PaulHarvey - 19 May 2011

Unfortunately, the "quick fix" above solving the main problem from this Task, but somewhere deeper cause double encoding or something like - and therefore it is not correct.

Encoding problems like this are the main reason, why is good encode/decode from/to utf8 as close to IO as possible... (e.g. for example in the Foswiki::readFile). IMHO, Foswiki should internally use only and only utf-8 characters and not octets/bytes.

One possible solution - not ideal, but usable: Foswiki::readFile (or the new Store::VC interface) should read all files from the filesystem in binary mode (as Store::VC it does). If the SIteCharset is not utf8, should convert (recode) the text files from the SiteCharset to utf8 and return utf8-flagged string to Foswiki. From this moment, everything is done in utf8 . At the write, from utf8 convert to SiteCharset and write binary. (here can be problem of lost characters,, e.g. when someone want convert greek omega into 8859-1, but possible workaround with html entities.). If the SiteCharset is UTF8 - no conversions are needed.

At the DB level alike. If the DB is not utf8 - convert before read/write. If the DB is utf8 (like Mongo) = no conversions. And this allow having multiple DB's with different encodings and so on.

This way will not break anything at Topic level (so old 88591 Topics remain untouched). Unfortunately, this will break much things internally. (Will need revisit every regex, especially like /\177/ or so.). Will probably break much of plugins. and so on. But doing it sooner is better than later. If here is no another reason: it is easier optimize a debugged code as debugging the (badly) optimized one. ($0.02) smile

And if someone understand my broken english - it's a miracle. smile

-- JozefMojzis - 20 May 2011

Jozef, your English is just fine; we just invent stuff to go in the broken bits wink

Yes, using perl unicode (which is implemented using utf8) internally is the master plan. Unfortunately there are a number of problems, such as badly behaved plugins that assume 1 byte=1 char, that make this harder than you might think. We have been waiting for enough people to get together with the common goal of making this work before embarking on major changes; it's too much for one person to tackle.

-- CrawfordCurrie - 21 May 2011

Just installed Foswiki-1.1.4-beta2, Mon, 31 Oct 2011, build 12966, Plugin API version 2.1 and this bug not shown anymore... So something was changed in the core what repaired this bug too... cool... wink

-- JozefMojzis - 08 Nov 2011

Hi Jozef, I've also found that running the latest CGI.pm (version 3.43 shipped with Ubuntu 10.04 LTS and I think Debian Lenny is broken, see Support.Faq25) and FCGI.pm (If you're using ModFastCGIEngineContrib) really helps a lot.

-- PaulHarvey - 08 Nov 2011

checked in 1.1.5 - everything is OK, so "No action required" wink

-- JozefMojzis - 03 May 2012

The above is true for Freebsd. Not for OS X. In OS X the problem remaining. So reopen...

-- JozefMojzis - 25 May 2012

Deferring to 1.2.0

-- GeorgeClark - 27 Nov 2012

Sorry, but this problem is only going to be solved when we move to unicode in the core, and the constraint on that is needing a common perl version that has working unicode. With redhat still on 5.10, that's a way away. Deferred until 2.0.0.

-- CrawfordCurrie - 14 Mar 2014

Solved in utf8 branch (unicode core)

-- Main.CrawfordCurrie - 17 May 2015 - 10:03
 

  • screenshot:
    foswiki error.jpg

  • screenshot showing different behavior in verbatim block:
    foswiki error2.jpg

ItemTemplate edit

Summary Wrong international character display.
ReportedBy JozefMojzis
Codebase 1.1.3 beta1, trunk
SVN Range
AppliesTo Engine
Component I18N, Unicode
Priority Urgent
CurrentState Closed
WaitingFor
Checkins
TargetRelease major
ReleasedIn 2.0.0
CheckinsOnBranches
trunkCheckins
masterCheckins
ItemBranchCheckins
Release01x01Checkins
I Attachment Action Size Date Who Comment
TestTopic0.txttxt TestTopic0.txt manage 222 bytes 14 Apr 2011 - 15:31 JozefMojzis the topic source text
foswiki_error.jpgjpg foswiki_error.jpg manage 100 K 14 Apr 2011 - 15:06 JozefMojzis screenshot
foswiki_error2.jpgjpg foswiki_error2.jpg manage 39 K 14 Apr 2011 - 15:30 JozefMojzis screenshot showing different behavior in verbatim block
Topic revision: r16 - 05 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy