Item10995: Wysiwyg destroys umlaute in text.

pencil
Priority: Normal
Current State: Closed
Released In: 2.0.0
Target Release: major
Applies To: Engine
Component: I18N, Unicode, WysiwygPlugin
Branches:
Reported By: MichaelDaum
Waiting For:
Last Change By: GeorgeClark
My settings

  • CGI-3.43
  • perl-5.10.1
  • HTML::Entity 3.64
  • Encode 2.35
  • {Site}{CharSet} = iso-8859-15
  • {Site}{LocaleRegexes} = 1
  • {UseLocale} = 0
  • UserInterfaceInternationalisation} = 1
  • fcgi
  • reproducible using apache2 and lighttpd
  • reproducible using pattern, pattern+natedit, natskin

The error can be reproduced like this:

  • edit a topic with wysiwyg enabled
  • add grüße
  • save
  • (things are fine up to here)
  • edit again
  • editor displays gr\xFC\xDFe

This is a very serious bug affecting 1.1.3 as well as trunk.


The editor destroys html entities as well during multiple edit-save cycles. Tested with

Hüte

Various values of TINYMCEPLUGIN_ENTITY_ENCODING does not fix it.

To the contrary: TINYMCEPLUGIN_ENTITY_ENCODING = raw will add random utf8 chars before and after unordered lists. I get random \xA0 here and there.


Paul mentioned on IRC that there's some double decoding going on in some combinations of perl package versions. Here's a patch that at least works for me fixing the first issue reported above:

Index: lib/Foswiki/Plugins/WysiwygPlugin/Handlers.pm
===================================================================
--- lib/Foswiki/Plugins/WysiwygPlugin/Handlers.pm       (revision 12180)
+++ lib/Foswiki/Plugins/WysiwygPlugin/Handlers.pm       (working copy)
@@ -612,7 +612,7 @@
     # Encode::FB_PERLQQ makes decode_utf8 convert invalid octet sequences
     # into a perl escape sequence, octet for octet (e.g. \xFF\x80),
     # instead of throwing an exception. This defuses the invalid sequence.
-    $text = Encode::decode_utf8( $text, Encode::FB_PERLQQ );
+    $text = Encode::decode_utf8( $text, Encode::FB_PERLQQ ) if utf8::is_utf8($text);
 
     # $text now contains unicode characters
     #print STDERR "as utf-8  [". WC::debugEncode($text). "]\n\n";

-- MichaelDaum - 22 Jul 2011

That patch looks exceptionally dangerous, as IME utf8 sequences can occur in plain text resulting in bogus decodings.

Without the patch, my system works fine.
  • CGI-3.49 (note that the most recent version is 3.55 - I don't know how you managed to install 4.34!)
  • perl 5.10.1
  • HTML::Entity not installed
  • Encode 2.35
  • {Site}{CharSet} = iso-8859-15
  • {Site}{LocaleRegexes} = 1
  • {UseLocale} = 0
  • UserInterfaceInternationalisation} = 0
Not using fcgid or mod_perl.

Note that the code you patched is involved in interpreting an AJAX request from the client. AJAX requests are always encoded using UTF8. As the comment states Encode::FB_PERLQQ makes decode_utf8 convert invalid octet sequences. Since you are getting a conversion to \x, it must think that the UTF8 sequences in your text are invalid. The most likely cause of this is double-decoding i.e. the request coming in has already been converted to another charset, perhaps even unicode. Without being able to reproduce this problem, it's anybodies guess.

If we had an answer to this, we would have fixed it years ago. The encoding problem is a nightmare.

-- CrawfordCurrie - 23 Jul 2011

Argh, Crawford is right. Michael: the if utf8... check is probably working for the wrong reason. I suspect you've got a unicode "tainted" string which (somehow) contains un-decoded utf8 octets. In other words, double-encoded.

Michael said on IRC he was using CGI 3.43. This is known in the changelogs of subsequent releases to double-utf8-encode URL param values (I forget the circumstances), and I know it is impossible to work with when Foswiki is using {Site}{CharSet} = 'utf-8'.

I intend to spend some time on unicode issues shortly.

-- PaulHarvey - 23 Jul 2011

So it would be very valuable if Michael could test upgrading CGI.pm via libcgi-pm-perl on his buggy system. Or we can try to reproduce on Ubuntu 10.10 (the buggy OS).

-- PaulHarvey - 23 Jul 2011

Ubuntu 10.10 comes with two versions of CGI.pm, one part of perl-modules containing 3.43, the other part of libcgi-pm-perl which is 3.49. configure reports using the latter. Still the error persists. Could be the problem is more in FCGI.pm.

Untainting the string before decoding it did not make a difference.

Crawford, how comes you don't have HTML::Entities on your system. It is required by WysiwygPlugin. It comes as part of libhtml-parser-perl on ubuntu. (Removing this package takes lots of other required perl packages with it).

I tried to install the latest CGI using cpan CGI which fails as also reported elsewhere.

CGI or FCGI doesn't make a difference either. Both show the error.

-- MichaelDaum - 23 Jul 2011

I've found the error. It is related to Item10825 where all url params of a REST call are already decoded as part of building up the Request object. While this is a good idea in general because now plugins don't have to reinvent toSiteCharSet() on their own, the changes to Request.pm induce a plugin incompatibility like we see now here as both - the core as well as the WysiwygPlugin - try to decode the request.

Any ideas how to break out of this maze?

-- MichaelDaum - 23 Jul 2011

Yeah. We're going to have to do the unicode properly. No more octet strings in core (or plugins) frown, sad smile

-- PaulHarvey - 23 Jul 2011

I have HTML::Entities, but not HTML::Entity.

I also had problems installing CGI 3.55.

Can you be more specific about where the error is?

-- CrawfordCurrie - 24 Jul 2011

More specific about which error? The error installing CGI 3.55? See the link above. Otherwise the problem been reported here has been analyzed sufficiently afaics.

-- MichaelDaum - 25 Jul 2011

The issue I'm having with this is knowing which - if any - versions of the core do not re-encode incoming params to the site charset. There must be at least one, otherwise we wouldn't have had to decode them in the rest handler frown, sad smile

-- CrawfordCurrie - 27 Jul 2011

Spot on: that's the maze we are in now. Could be a $Foswiki::Plugins::VERSION thingy but better would be to add a flag to the Request object indicating params have already been decoded defaulting to not decoded yet thus making it backwards compatible with old foswiki engines.

-- MichaelDaum - 27 Jul 2011

But where is this decoding done? AFAICT the code that build the Request object doesn't do any conversion to the site charset.

-- CrawfordCurrie - 27 Jul 2011

The decoding should be done as early as possible as part of the construction of the Request object as outlined in Item10825 (see my red comment above)

-- MichaelDaum - 27 Jul 2011

We can't fix it until all of Foswiki decodes to internal string representation ... nothing we want to do on foswiki-1.1.x as it affects lots of code that needs fixing. So downgrading this bug item to normal not to block 1.1.4 anymore.

-- MichaelDaum - 18 Oct 2011

I concur. Changed ReleaseTarget to minor

-- PaulHarvey - 19 Oct 2011

Solved in utf8 branch. Awaiting merge.

-- Main.CrawfordCurrie - 17 May 2015 - 10:04
 

ItemTemplate edit

Summary Wysiwyg destroys umlaute in text.
ReportedBy MichaelDaum
Codebase 1.1.3, trunk
SVN Range
AppliesTo Engine
Component I18N, Unicode, WysiwygPlugin
Priority Normal
CurrentState Closed
WaitingFor
Checkins
TargetRelease major
ReleasedIn 2.0.0
CheckinsOnBranches
trunkCheckins
masterCheckins
ItemBranchCheckins
Release01x01Checkins
Topic revision: r20 - 05 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy