You are here: Foswiki>Tasks Web>Item5437 (05 Jul 2015, GeorgeClark)Edit Attach

Item5437: UTF-8 fixes for Foswiki 2.0 (was Foswiki 1.1 but deferred, was Foswiki 1.0 but deferred, was T4.2 but deferred)

pencil
Priority: Urgent
Current State: Closed
Released In: 2.0.0
Target Release: major
Applies To: Engine
Component: I18N, Unicode
Branches: Release01x01 trunk
Reported By: TWiki:Main.PeterThoeny
Waiting For:
Last Change By: GeorgeClark
TWiki:Main/IvanBaktsheev writes this in his blog at http://dot-and-thing.blogspot.com/2008/03/twiki-utf8.html :
two years ago i made (successfully) twiki 4 installation with utf-8 (one source modification required). but last week i couldn't make another utf-8 installation from twiki 4.2.

unsuccessies:

  1. changing configuration parameters is not enough for unicode twiki installation. in my case, for correct handling utf-8 data, received from user, three .pm files required changes: lib/TWiki/{Save,View,Edit}.pm

  1. for correct view within TinyMCE editor i had to make another change for correct handling javascript escaped unicode (from module URI::Escape::JavaScript)

[snip] twiki and similar systems should be configured to utf-8 as default, because it allows any people write in any language. [snip]

I appreciate his patch (get it from his blog; I could not upload it because it just hangs on upload); tarcked in Item5438. He also said some not so nice things about TWiki 4.2.

-- TWiki:Main/PeterThoeny - 13 Mar 2008

From TWiki:Codev/GeorgetownReleaseMeeting2008x03x17: This is of urgent nature but should not block 4.2.1 release. We need an owner of I18N.

-- TWiki:Main.PeterThoeny - 17 Mar 2008

I have attached the diff.

I wonder why the reporter chose to put this in a blog posting instead of here???

-- TWiki:Main.KennethLavrsen - 31 Mar 2008

See also http://twiki.org/cgi-bin/view/Codev/UseUTF8

I haven't been through Ivan's patches in detail, but on a cursory inspection they look correct. One thing to watch out for is the problem with the accept-charset in forms that Harald noted.

Confirmed.

Please note: contrary to the release meeting minutes I am not working on this; I have too much on my plate ATM. I was just trying to be helpful.

CC

Warning: based on my new understanding of encodings, the patch here is seriously incomplete. It does some of what is necessary, but not all.

-- TWiki:Main.CrawfordCurrie - 25 May 2008

A few weeks ago I upgraded our multi-lingual UTF-8 Cairo installation to 4.2.0. After tweaking some localisation settings I got it to display everything correctly, but unfortunately editing a page destroys all special characters. I've applied Ivan's patches. I needed to set the character set to "UTF-8", because any other setting breaks display on either Firefox or IE. Editing now now seems to preserve characters, but as Crawford implied it does not work in all cases. Raw edit breaks characters, and special characters in links are also not preserved in TinyMCE.

-- TWiki:Main.LevienVanZon - 10 Jun 2008

A correction on my previous comment, the old site was running Dakar, not Cairo. And while display of many UTF-8 characters now works in TinyMCE with Ivan's patch, saving a page still destroys special characters. So effectively, this leaves me with no way to edit pages with special characters (e.g. also anything in French, Spanish, Portuguese, etc.). I'm going to revert to the situation before the patch.

-- TWiki:Main.LevienVanZon - 17 Jun 2008

I've just spent a few hours in an attempt to further analyse the issue. This is what I found so far:

  • With the correct setting for locale (en_US.utf8) and character set (utf-8), TWiki 4.2.0 has no problems displaying and raw-editing UTF-8 topics. This is basically the same as with Dakar, except...
  • When editing a page using the WysiwygPlugin (i.e. either in TinyMCE or with the raw-edit link from TinyMCE), any multi-byte UTF-8 characters get converted to questionmark-icons.
  • If I force UTF-8 I/O by changing the top-line of the TWiki edit script to "#!/usr/bin/perl -wTCS", UTF-8 characters are correctly displayed in TinyMCE. However, on reloading, switching to raw edit from TinyMCE or saving, UTF-8 characters still get converted to questionmark-icons. Moreover, direct raw-edit (using ?nowysiwyg=on) displays special characters incorrectly (they seem to be double-encoded).
  • Forcing UTF-8 mode by adding the CS switches to view script also seems to give double-encoding problems. Using Ivan's patches instead (and twiddling with the locale and charset settings a bit) gives more or less the same results on viewing and editing.
  • So it seems that most code handles UTF-8 correctly (or at least leaves it untouched), except the TML<->HTML conversions in WysiwygPlugin. Furthermore, it seems that TML->HTML can be made to work by forcing UTF-8, but HTML->TML always seems to break UTF-8 in our case.
  • I am somewhat mystified as to why forcing UTF-8 leads to double-encoding problems though. Our hosting provider runs a (Debian-based?) Linux system with Perl 5.8.8. I've checked that the topic-datafiles are really valid UTF-8, and the locale used is present in the "locale -a" listing.
  • The urlEncode/urlDecode roundtrip (in TWiki.pm, but called from WysiwigPlugin) correctly escapes UTF-8 characters to %uXXXX, but does not restore them. (Ivan's patch for TWiki.pm already fixes this.)

I'm afraid I'm not much of a perl-wizard. So far I have been unable to locate in which step of the WysiwygPlugin conversion process the UTF-8 characters get clobbered.

-- TWiki:Main.LevienVanZon - 19 Jun 2008

Per TWiki:Codev.GeorgetownReleaseMeeting2008x07x21 and by recommendation from Richard Donkin the UTF8 requires more work than what 4.2.1 allows. Deferred to 5.0.


I'm responding to OlivierRaginel's email now - basically we should target Perl 5.8 only, and look at UseUTF8 as the best outline of what to do, perhaps doing a subset initially to get going. There is quite a lot of work here, but we don't need a gold plated solution. As usual, I will be too busy to do any real coding but I'm happy to provide my opinion - best to email me when a change is made here.

Is there some way to get an alert for changes to this page only?

-- RichardDonkin - 12 Dec 2008

changing priority back to Enhancement (one could view it as an enhancement as TWiki was not designed for UTF8) to remove it from the Release Blockers for Foswiki 1.0.0. however, after Foswiki 1.0.0 is released, this should become Urgent again, as one of the main themes of Foswiki 1.1 is to finally support UFT8.

-- WillNorris - 05 Jan 2009

Changing it back to Urgent, and assigning it to me.

-- OlivierRaginel - 13 Jan 2009

Changed target release to major; we've done as much as we're going to for 1.1

-- CrawfordCurrie - 13 Jul 2010

WysiwygPlugin & TinyMCEPlugin are a bit confused about how to encode URIs - specifically path/filenames. And I think the fixes I did in Tasks.Item9973, made things work a little better, but actually makes the code more confused (extra escaping steps). And I think what I did there was a mistake.

Now: Other than Tasks.Item10230 (which is trying to finish off, I.E. make the Tasks.Item9973 fixes complete), I'm just thinking out loud here:
  • Consider a Foswiki installation with {CharSet} = 'cp-1251'.
  • We make an AJAX request to WysiwygPlugin's TML2HTML
  • The responding HTML is correctly encoded cp-1251, and the http headers match.
  • We're not using viewfile rewrites. Access to /pub goes directly to apache.
  • There's an image tag to a topic + attachment both of which contain high-bit characters from cp-1251 (imagine that this part of this topic is cp-1251 native chars):
    <img src="http://ogg.lan/foswiki/a/pub/Sandbox/TеЙeЦДstеF/TеЙeЦ.png"/>
  • The browser must make an HTTP request for the image. But as per rfc3986, the high-bit characters must be percent-escaped first. According to the apache logs, the request URI looks like this:
    http://ogg.lan/foswiki/a/pub/Sandbox/T%D0%B5%D0%99e%D0%A6%D0%94st%D0%B5F/T%D0%B5%D0%99e%D0%A6.png
    - which is the previous string, in unicode/utf-8.
  • This is because there's no way of encoding what charset the escaped octets come from, in the URI itself. So the convention is that the browser converts to unicode/utf-8 first, and escapes that.
  • Apache saw a request for a filename that doesn't exist on-disk. On-disk, we only have the cp-1251 encoded filename, because that's what Foswiki saved it as. If apache somehow knew that the URI was utf-8/unicode (it cannot), and apache knew that the on-disk filenames are cp-1251, there might be a chance there could be some conversion.... however as-is, we suffer a file-not-found error.
  • Instead, what works is escaping "early" (with the initial rendered HTML). This works:
    <img src="http://ogg.lan/foswiki/a/pub/Sandbox/T%e5%c9e%d6%c4st%e5F/T%e5%c9e%d6.png"/>
    - the browser didn't have to escape anything, because there were no high-bit chars; and apache didn't bother trying to understand what charset the URI was in, it just un-escaped the octets which happened to match byte-for-byte the cp-1251 filename on disk.

-- PaulHarvey - 16 Apr 2011

My local hacks to avoid corruption with MongoDBPlugin seem sensitive to the exact version of CGI.pm and FCGI.pm.

Current plan:
  • Adding selenium-based tests. The Foswiki::Engine in use, and the webserver config, and the versions of CGI.pm etc., the store/listeners that are in use, all seem to be significant variables in a successful utf-8 Foswiki. And I keep breaking mine.
  • So I'm making a test generator. Checkins soon.
  • I hope to make it verify over all available Foswiki::Engines
  • To simulate MongoDBPlugin and (I assume) DBIStoreContrib behaviour, I want to add a special experimental configure setting to RcsLite=/=RcsWrap which will return (decoded) "unicode" perl character data, rather than octets.
  • Need to act on Item1344, add charset to TOPICINFO
  • Needs a new configure setting, for which charset should be assumed when TOPICINFO.CharSet is missing. But it's not that simple:
    • What about externally generated Topic.txt which omit the TOPICINFO completely
    • What about topics which now have the TOPICINFO.CharSet, but older revisions don't
    • Can we say: "When TOPICINFO.CharSet is absent, BUT TOPICINFO.date exists, AND that TOPICINFO.date is older than (some $Foswiki::cfg{Site}{Default}{CharSet}{OlderThan}{$epoch} setting), then assume the encoding is the value of $Foswiki::cfg{Site}{Default}{CharSet}{OlderThan}{$epoch}
    • Can we also say: "Assume the topic to be in $Foswiki::cfg{Site}{CharSet} encoding otherwise"

-- PaulHarvey - 30 Jun 2011

I fear that trying to be too clever and convert on the fly is hopeless. I think we need a bulk convertor, that will take a Foswiki DB and a site charset and convert it to UTF8, including all histories.
  • No need for a TOPICINFO.charset, because the charset is always UTF8.
  • Histories are converted to UTF8.

Checked in and waiting for test.

-- CrawfordCurrie - 29 Sep 2011

Crawford forgot to remove his script from the core MANIFEST, therefore preventing the build to succeed. Fixed.

-- OlivierRaginel - 11 Oct 2011

See UnicodeSupport

-- PaulHarvey - 15 Nov 2011

Because in IRC Sven asked for it, I attached some patchfiles from the our utf8 experiments with fw 1.0.{4|5}. We inserted a new cfg variable, so every patch is surrounded by if ($Foswiki::cfg{UseUTF8}) {

We started made some changes to several extensions to (like FormPlugin), but those are less important...

happy digging - if you want do it...

-- JozefMojzis - 11 Dec 2011

I've pushed an update to https://github.com/cdot/foswiki - unicode branch now runs to completion, no more (insane) memory leaks.

This is after merging unicode branch with all the latest code from trunk, up to distro:204a9cbf3cfa

---++ Module Failure summary
FormattingTests has 1 unexpected results (of 90):
   * F: FormattingTests::test_Item11671
SemiAutomaticTestCaseTests has 1 unexpected results (of 24):
   * F: SemiAutomaticTestCaseTests::test_TestCaseAutoInternalTags
TWikiFuncTests has 1 unexpected results (of 24):
   * F: TWikiFuncTests::test_getExternalResource
ViewFileScriptTests has 1 unexpected results (of 14):
   * F: ViewFileScriptTests::test_simple_textfile
ResponseTests has 2 unexpected results (of 9):
   * F: ResponseTests::test_body
   * F: ResponseTests::test_empty_new
ExpandMacrosTests has 2 unexpected results (of 11):
   * F: ExpandMacrosTests::test_delayedExpansionInline
   * F: ExpandMacrosTests::test_delayedExpansionInlineTypeString
NetTests has 2 unexpected results (of 5):
   * F: NetTests::verify_getExternalResource_Sockets_HTTPResponse
   * F: NetTests::verify_getExternalResource_Sockets_noHTTPResponse
RenderFormTests has 2 unexpected results (of 8):
   * F: RenderFormTests::test_render_for_edit
   * F: RenderFormTests::test_render_formfield_with_form
UIFnCompileTests has 2 unexpected results (of 78):
   * F: UIFnCompileTests::verify_switchboard_function_compare
   * F: UIFnCompileTests::verify_switchboard_function_compareauth
RenameTests has 4 unexpected results (of 29):
   * F: RenameTests::test_referringTopicsThisWeb
   * F: RenameTests::test_renameTopic_find_referring_topics_in_all_webs
   * F: RenameTests::test_renameTopic_new_web_same_topic_name
   * F: RenameTests::test_renameTopic_same_web_new_topic_name
ConfigureTests has 8 unexpected results (of 22):
   * F: ConfigureTests::test_Package_install
   * F: ConfigureTests::test_Package_makeBackup
   * F: ConfigureTests::test_Package_sub_install
   * F: ConfigureTests::test_UI
   * F: ConfigureTests::test_conflict
   * F: ConfigureTests::test_loadpluggables
   * F: ConfigureTests::test_parseSave
   * F: ConfigureTests::test_resection
TableParserTests has 8 unexpected results (of 8):
InitFormTests has 10 unexpected results (of 10):
WysiwygPluginTests has 10 unexpected results (of 18):
EngineTests has 10 unexpected results (of 12):
ExtendedTranslatorTests has 22 unexpected results (of 63):
HTMLValidationTests has 32 unexpected results (of 106):
TranslatorTests has 98 unexpected results (of 390):


2644 of 2868 test cases passed(2640)+failed(4) ok from 2881 total, 13 skipped
0 + 224 = 224 incorrect results from unexpected passes + failures
1..70907

I have added MichaelDaum and JozefMojzis, who earlier expressed interest in helping out or leading the charge for a UTF-8 Foswiki

-- PaulHarvey - 16 Jun 2012 - 10:24

I won't be able to check this out until after 1.2.0 is out and you've merged to trunk.

-- MichaelDaum - 07 Nov 2012

At the FoswikiCamp2014 we agreed to defer (!) this to 2.0.0. Some day, some day.

-- CrawfordCurrie - 13 Mar 2014

Fully implemented in utf8 branch (off master). Waiting for RM's approval to merge.

-- CrawfordCurrie - 15 May 2015
 

ItemTemplate edit

Summary UTF-8 fixes for Foswiki 2.0 (was Foswiki 1.1 but deferred, was Foswiki 1.0 but deferred, was T4.2 but deferred)
ReportedBy TWiki:Main.PeterThoeny
Codebase trunk
SVN Range TWiki-5.0.0, Sun, 09 Mar 2008, build 16496
AppliesTo Engine
Component I18N, Unicode
Priority Urgent
CurrentState Closed
WaitingFor
Checkins distro:96ad6f119b4d Rev 12691 not found Rev 12692 not found Rev 12696 not found Rev 12697 not found distro:390769313963 distro:09e598b486d0 distro:67e1685de277 distro:6325bc09b81e distro:0920c368b336 distro:b01720dd8499 distro:c6c29ea52600 distro:5b39b20f5f70
TargetRelease major
ReleasedIn 2.0.0
CheckinsOnBranches Release01x01 trunk
trunkCheckins distro:96ad6f119b4d Rev 12691 not found Rev 12692 not found Rev 12696 not found Rev 12697 not found distro:390769313963 distro:09e598b486d0 distro:67e1685de277 distro:0920c368b336 distro:b01720dd8499 distro:c6c29ea52600
masterCheckins
ItemBranchCheckins
Release01x01Checkins distro:6325bc09b81e distro:5b39b20f5f70
I Attachment Action Size Date Who Comment
fwpatch-core-1.0.5-utf8-v1.04-2009050604-20090506 fwpatch-core-1.0.5-utf8-v1.04-20090506 manage 17 K 11 Dec 2011 - 12:20 JozefMojzis patchfile0 for 1.0.5 - our experiments with utf8 fw
fwpatch-core-1.0.5-utf8-v1.04-20090506-fix1-2009061704-20090506-fix1-20090617 fwpatch-core-1.0.5-utf8-v1.04-20090506-fix1-20090617 manage 1 K 11 Dec 2011 - 12:21 JozefMojzis patchfile1 for 1.0.5 our experiments with utf8 fw
fwpatch-core-1.0.5-utf8-v1.04-20090506-fix1a-2009062604-20090506-fix1a-20090626 fwpatch-core-1.0.5-utf8-v1.04-20090506-fix1a-20090626 manage 2 K 11 Dec 2011 - 12:21 JozefMojzis patchfile1a for 1.0.5 our experiments with utf8 fw
fwpatch-core-1.0.5-utf8-v1.04-20090506-fix2-2009030304-20090506-fix2-20090303 fwpatch-core-1.0.5-utf8-v1.04-20090506-fix2-20090303 manage 867 bytes 11 Dec 2011 - 12:21 JozefMojzis patchfile2 for 1.0.5 our experiments with utf8 fw
fwpatch-core-1.0.5-utf8-v1.04-20090506-fix3-2009080604-20090506-fix3-20090806 fwpatch-core-1.0.5-utf8-v1.04-20090506-fix3-20090806 manage 3 K 11 Dec 2011 - 12:22 JozefMojzis patchfile3 for 1.0.5 our experiments with utf8 fw
fwpatch-core-1.0.5-utf8-v1.04-20090506-fix4-2009082604-20090506-fix4-20090826 fwpatch-core-1.0.5-utf8-v1.04-20090506-fix4-20090826 manage 1009 bytes 11 Dec 2011 - 12:23 JozefMojzis patchfile4 for 1.0.5 our experiments with utf8 fw
twikiutf8.diffdiff twikiutf8.diff manage 3 K 31 Mar 2008 - 16:36 KennethLavrsen  
Topic revision: r48 - 05 Jul 2015, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy