Understanding Encodings

Why do we have such a hard time with international character sets? As more of us start to understand how they work, the implications of support for UTF-8 (and other encodings) are becoming more apparent.

Some or all of the following may be wrong; I'm sure you'll correct it silently by editing the text, and adding your name to the contributors, rather than adding discussion.

Primer

Character Sets and Encodings

When talking about character sets, we have to bear in mind the fact that the basic storage element in a file on disk is a byte (8 bits). We have to be able to represent all possible characters in all the worlds' character sets using bytes alone, so they can be stored in files.

If you put your imagination into gear, you can picture all the characters in all the languages in the world all laid out in a big long line. If you then assign an integer code to each character, you can see we have a way of addressing every character with one integer number. One such character encoding is called unicode. The code for an individual character is called its codepoint. (There are different flavours of unicode, but for simplicity we'll treat this as one concept).

The ASCII character set defines the first 128 (0..127) of these integer codes. Characters from the ASCII character set are referred to as 7-bit characters because they can all be represented using the lower 7 bits of a byte.

When bit 8 of a byte is set, another 128 characters (128..255) are available within a single byte. These are referred to as the high bit characters. In Western European 8-bit character sets (iso-8859-*), high bit characters are used for accented characters (umlauts, cedillas etc) and some standard symbols. When added to ASCII, this single-byte range covers many Western alphabets.

Some other character sets also use 8-bit characters, but redefine all the codepoints between 0 and 255 to refer to different characters (e.g. koi8-r does this for the cyrillic alphabet)

Many character sets, notably those from Asia, require many more characters than are available in a single byte. You could of course move to wider characters - 16-bit, 32-bit, 64-bit - but if you do that, you waste an awful lot of memory; a one megabyte, ASCII-encoded document would be four times the size in a 32-bit encoding, without adding very much new information. So instead of fixed-width characters, it makes sense to move to a multi-byte encoding, or character encoding scheme, that lets us map a codepoint to a sequence of one or more bytes.

There are several different multi-byte encodings, but the one we are most interested in is UTF-8. UTF-8 is a variable-width encoding, meaning that single encoded characters can be anything from 1 to 4 bytes in length.

For example, the character represented by the codepoint 32765 is way outside the range that can be represented by a byte. So the number 32765 is encoded into 3 byte values as: 231 191 189. See the Wikipedia article on UTF-8 for all the gory details.

Note that correct interpretation of this encoding depends on our knowing in advance that this string of bytes encodes a single character, rather than three separate characters. There is nothing inherent in the string to say whether these bytes represent character 32765, or the three high-bit characters 231, 191 and 189 - or even some other completely different character in a different encoding. Hence the importance of knowing the precise encoding used for any piece of text. All the common character encodings have been given names (e.g. utf-8) which can be used to label text so you know what encoding was used (though this in itself can create a problem when you don't know what encoding was used to encode the characters "utf-8").

Note that a string of bytes that is perfectly legal in one encoding may represent an illegal character in another encoding, and even cause Perl to crash (segfault) when text containing that string is manipulated.

Note also that it's sometimes necessary to make a distinction between the character encoding form, and the character encoding scheme. The form describes how characters are represented as integer values, and the scheme describes how those values are encoded as bytes. A large number of different character encoding forms have been identified, and their unique names registered with the IANA. The character encoding form is usually combined with the character encoding scheme to create a single identifier, the character encoding. For example, iso-8859-1 defines both a character encoding form (latin1) and a character encoding scheme (8-bit). utf-8 defines a character encoding form (unicode) and a character encoding scheme (variable width multi-byte).

Even if we are able to determine the character encoding of a random string of bytes, we still have a final problem. How do we compare two character strings? In theory we would just compare the integers that represent each character. Unfortunately, some encodings use different codepoints for the "same" character. This may sound like an abstract problem, but it bites even in "simple" character encodings. A single character may have multiple codepoints within a single character encoding. Anyone who has tried to paste a Microsoft Word document into HTML is familiar with this problem, as single quote characters get mapped to a high-bit character instead of the usual ASCII character.

So for compatibility, certain codepoints (and sequences of codepoints) in unicode are treated as equivalent. To support matching of these sequences, they can be normalised to a canonical form. There are a number of these normalisation forms. In general you won't need to know about normalisation; it is described here to explain why two apparently identical strings can be sorted differently, depending on the character encoding chosen.

A last twist of the knife is the effect f the locale. The locale affects the collating sequence of codepoints - i.e. the order that results when characters are sorted. It also affects the mapping of case (upper, lower), if such a concept exists in the locale. See #Locales below for more on locales.

So in summary, a string of characters is represented using a list of integer numbers, called codepoints. Each codepoint refers to a single character in a character set, via a mapping called a character encoding form. Visually identical characters may be repeated at several codepoints. Codepoints are reduced to sequences of bytes for ease of use by computers using a character encoding scheme. Together, a character encoding form and a character encoding scheme are called a character encoding (and sometimes - wrongly - a character set). Codepoints can only be compared and sorted after taking into account the locale. When stored in a file, codepoints are converted to bytes by encoding them (e.g. using utf-8).

Browsers

All the protocols used on the web support the specification of the character encoding used in the content being transferred. For text-based media types like HTML, HTTP defaults to an encoding of ISO-8859-1 (RFC:2616, section 3.7.1). A different encoding needs to be explicitly specified in the HTTP Content-Type header. Other protocols (e.g. MIME) have equivalent headers.

Note that the encoding cannot be reliably specified within the content, because applications (like web browsers) need to know the encoding in advance to understand the content. Most browsers accept a meta element in HTML to specify a nonstandard encoding, but this only works if the encoding is identical to ASCII (and hence ISO-8859-1) for the characters <, m, e, t, a, and so on.

Computer languages may have their own defaults for encoding. For example, XML, and by inference XHTML, defines UTF-8 as the default encoding.

In any case, relying on defaults is not very useful, which is why Foswiki specifies the character set in the HTTP Content-Type header. This helps ensure that the browser makes the right decision rather than guessing.

URLs

URLs are generally submitted in an HTTP request (GET, POST, etc) in one of two formats, both of them URL-encoded. The URL format used is completely dependent on the browser, and is independent of the web server or the web page's character encoding:
  • UTF-8 - this is used regardless of the web page's encoding.
  • Site charset (e.g. ISO-8859-1, KOI8-R for Russian, etc) - this is the default in older browsers such as Lynx, and also in Firefox 2.0 and earlier.
In current (<= 1.1.4) versions of Foswiki, all incoming URLs are mapped back to the site charset. Foswiki must dynamically detect the character encoding of the URL. While detecting charset encodings is a hard problem in general, it's not too hard to discriminate between UTF-8 and 'other', as UTF-8 has a distinctive format. Hence this code uses a simple URL that detects a valid encoding, ensuring security by rejecting 'overlong' UTF-8 encodings.

Forms

Internally, most browsers represent the characters being displayed using unicode. When a user types into an input field, those characters are captured as unicode characters. All JavaScript operations on characters assume unicode.

When displaying the web page, the browser will either:
  • heuristically guess the encoding for the data from the Content-Type header, meta tags, and the form tag (which may have an enctype).
  • using the explicitly configured encoding set by the user (generally when the heuristics are wrong)
When an HTML form is submitted, then the data for that form is generally submitted using the encoding specified for the web page. However, there are no standards in this area (see InternationalisationUTF8 for links), so the browser could (in theory) do whatever it wants. However the browser can usually be relied on to converts its internal Unicode characters to the web page's encoding before sending them as POST or GET data to the web server.

For example, let's say the server generates a web page containing a form, and sets the encoding in that page to iso-8859-15. A € symbol is represented in iso-8859-15 by code point 0x80. When the page is loaded, the browser maps this to the unicode code point 0x20AC, which is the correct unicode codepoint for a euro character. If the user then_ types_ a euro character in an input field, it is the unicode codepoint that is captured by the browser. However when the form is submitted, the browser silently converts that codepoint back to 0x80 before sending the data to the server, because that was the encoding on the web page.

Thus a server using forms can assume that if it imposes a character encoding on a web page, then form data returned to it will also use that encoding.

Since the Foswiki raw editor defines the encoding in the page and uses a textarea in a form, this means that the site character set appears to be used throughout.

XmlHttpRequest (XHR)

Unfortunately nothing is ever that simple. If you use XHR from Javascript to send the same form data, then it will always send it UTF-8 encoded, irrespective of the encoding specified for the web page. Thus our euro symbol gets sent as 0x20AC. It is left to the server to know that this 0x20AC should be mapped back to 0x80.

Fortunately there are only a few characters that need to be mapped this way. The table at http://www.alanwood.net/demos/ansi.html show the mapping from unicode to Windows 1252, which is one Microsoft ANSI character set (though it varies from the ISO-8859-* standards). This character set considerably overlaps code points with iso-8859-1 but is not identical. Any server that uses iso-8859-1 has to know that these characters (128..159) will appear as unicode code points in data submitted via XHR, and convert them back to the equivalent iso-8859-1 code points if necessary.

Security

Since UTF-8 provides another way to get data into the server, it provides the potential for new security holes. It's possible to use 'overlong' encodings that are actually illegal but sometimes interpreted like a shorter encoding.

Perl

Starting from Perl 5.6.0, Perl was able to handle unicode characters, but the unicode support has changed (improved) significantly with almostevery version of Perl since. Perl will silently do some magic, but there are limitations.
  • When reading from an external source like a file or a socket, Perl has no knowledge of the encoding used in the data it receives. As long as Perl just moves the data around, this does not matter at all, because it is just shifting the encoded bytes. However problems can arise if:
    • Perl is forced to interpret data as one encoding when it is in fact encoded using a different encoding,
    • data using one encoding are compared with data using another encoding.
    • Perl is asked to output multibyte unicode characters to output streams which are not enabled for the correct encoding (Wide character in print is the typical error)
So, it is the job of the "environment" to know how data are encoded. Tough luck if all you have is a bunch of files on a disk: All you can do is either make assumptions or intelligent guesses (using utf8::is_utf8, for example).

Perl has extensive (scary) documentation about its unicode support in its perluniintro and perlunicode man pages. For future guidance to Foswiki development, two documents coming with 5.10 (but valid in current versions of Perl) are helpful: A tutorial perlunitut and a FAQ perlunifaq.

Perl also has a number of modules that can help with character set conversions.
  • Encode supports transforming strings between the Perl internal unicode representation and the various byte encodings (though be warned that it doesn't handle remapping out-of-range characters, such as the euro example discussed above).
  • HTML::Entities can be used to convert characters to a 7-bit representation (though users should be aware that decoding HTML entities always decodes back to unicode).

Locales

Another important concept to get under your belt is the locale. http://en.wikipedia.org/wiki/Locale provides a good definition of a locale, but what does this mean to Foswiki?

Foswiki makes heavy use of regular expressions for analysing the text in your topics. Among the things that Foswiki uses regular expressions for are:
  • distinguishing text and punctuation
  • distinguishing upper and lower case (for CamelCase) and converting between them
  • identifying numbers

You can tell Foswiki to {UseLocale}, which will cause it to access the locale definitions on the server when it has to perform any of these operations (={Site}{LocaleRegexes} = provides finer-grained control over the use of locale regexes).

A second effect of {UseLocale} is that it overrides the LC_CTYPE and LC_COLLATE settings defined on the server with to Foswiki {Site}{Locale} setting. This in turn affects programs such as grep and rcs which are called by Foswiki.

Foswiki

Foswiki assumes that:
  1. All topics (names and content) on a Foswiki site will use the same character encoding,
  2. Once you have selected an encoding, you will never change to a different character encoding,
  3. If you don't select an encoding, you are going to be happy with iso-8859-1,
  4. All tools used on topics support all possible character encodings,
  5. The same character encoding is used for storing topics on disk as for transport via HTTP.

These assumptions are inherent in the fact that Foswiki uses global settings (in configure) to determine the encoding to use.

Note that Foswiki is known to be broken for XmlHttpRequests due to point 5; Foswiki tries to decode the UTF8-encoded requests sent by XHR using the site character set.

Foswiki provides the encoding, as given from the configure setting, to the browsers in both a Content-Type header and a meta element. However it does not use unicode internally. Foswiki instead relies on the existence of predefined character classes that match the byte sequences representing "interesting" characters. Thus the use of Perl functions such as lc, uc and sort are fraught with risk; it is up to the code to ensure that the correct collating sequence is used.

ALERT! Coders note: while setting the Foswiki site character set will tell Foswiki to interpret byte data (filenames and content) as a specific character set, it will not tell the functions that Perl defines for manipulating character strings. So, for example, uc() and lc() will operate on data as if it were encoded in iso-8859-1. So long as you are only working with ASCII characters this should be no problem, but don't expect any character-set-sensitive functions to work outside the ASCII range unless you force Perl to use a different character set e.g. by setting LC_CTYPE. See the perl unicode documentation for more detail.

{Site}{Locale}

Locales should be largely irrelevant to Foswiki's use of encodings. However, you still need to consider them because locales are used to specify national collation orders which vary depending on locale.

Note that locales were found to create very interesting bugs when combined with utf8 mode in older versions of Perl, hence Richard decided to abandon trying to combine locales with Perl in earlier work on unicode support.

{Site}{CharSet}

WORK IN PROGRESS - WHY DO WE STILL DISCOURAGE UTF-8 ?

This is the critical setting for deciding what encoding Foswiki is going to use. You are strongly discouraged from setting this to utf-8 when you first install Foswiki for most sites (the only exceptions are sites using languages such as Japanese and Chinese where you don't care about WikiWords, or sites where you must mix languages with different 8-bit character sets). Note: you must specify utf-8 and not utf8, which doesn't work on Windows operating systems.

  • RD: Complete change above - utf-8 is NOT recommended for most current Foswiki installations until UseUTF8 / UnicodeSupport is implemented. See InstallationWithI18N for more details.
  • RD: Actually this is intended to define only the browser charset (in original code) with the locale defining the charset/encoding that Foswiki (and grep) uses internally on server, and of course they should match. I think this is still the case.
  • RD: This was originally derived from the locale if the CharSet was empty, and it would be much better if it had remained that way, as there's no point having this differ from the encoding in 99% of cases. I only created this setting in order to cover cases where the spelling of the encoding differs between the configured locales on the server (locale -a output) and the browser - e.g. where it's utf8 on server and utf-8 in browser. Some such cases are covered in the code. This caused quite a lot of confusion, but in 99% of the cases, at least in the original Feb 2003 code, this setting was not needed - a later change broke the derivation of CharSet from locale setting. See InstallationWithI18N for some more detail here.

Issues, problems and directions

CodersLoveUnicode.jpgThere are a number of major problems with encoding support in Foswiki. It has to be borne in mind that the current code was never intended to support Unicode, so we shouldn't be surprised that there are problems. In addition, lack of regression testing has resulted in some code rot over the years.
  1. Foswiki is difficult to set up for a specific encoding. The guides are inadequate.
  2. The default encoding, iso-8859-1, was the right choice in 2003; it's used by the vast majority of European languages, UTF-8 is much more complex, and there was no way we could go straight to Unicode, due to time available and the immature state of Unicode in Perl.
  3. It was never intended for there to be any conversion to/from character sets within TWiki/Foswiki (with exception of the later EncodeURLsWithUTF8 feature), i.e. Foswiki simply works in one character set for files, internal processing and web browser interactions, whether this is ISO-8859-1, KOI8-R (Russian) or EUC-JP (Japanese). However, later developers have tried to convert to and from site character sets, for reason that are not clear but may include partial support for UTF-8 - this has resulted in a lot of broken code.
  4. Foswiki doesn't remember the encoding used for a topic, which means that you can't simply trade topics between Foswiki installations that might be using different character encodings.

    If you try to read a topic written on an ISO-8859-1 site on a site that uses UTF-8, Foswiki (or rather Perl) may crash (actually, it just exits with an uncaught Perl error). If you try to use a client-side tool (such as a WYSIWYG editor) without telling it the correct encoding, the tool may crash (which is the tool's fault). A crash isn't guaranteed because only a few 8-bit characters are valid as the first byte in a UTF-8 sequence, and this very pseudo-randomness has been the cause of many mysterious bugs.

    I think this is a non-problem although it would be nice to include the encoding in the topic metadata. Crashes aren't a big surprise given that it was never a goal to swap topic data between installations like this. Foswiki was only intended to work in a single character set - if the Foswiki administrator wants to swap topics they are expected to convert pathnames and file contents to the right character sets.

  5. Foswiki support for character sets has been focused on using a single site character encoding since it was first introduced in Feb 2003, not on any conversion to/from other encodings. The more recent introduction of WYSIWYG editors such as TinyMCE, which work natively in Unicode, has somewhat broken this assumption. This is no problem if the site encoding can represent Unicode (although UTF-8 on the server is somewhat broken as mentioned), but the content will be silently mangled if it can't. Thus if you use Unicode characters in a Tiny MCE session and try to save them on a site that doesn't support Unicode, the characters may not be saved correctly.
  6. Plugin authors don't know (or don't care) about character encodings, and some plugins can damage encoded characters, mis-interpret encodings, or other such nasties. This won't normally happen if they operate exclusively via the Func interface, but if they get content from elsewhere, it can fail. This won't go away without some concerted effort on adoption InternationalisationGuidelines and helping plugin developers somehow. Use of the default [a-z] type regexes will break with Unicode just as much as with ISO-8859-1.
So, what can be done about all this? Well, there are a number of possible approaches:
  1. Only store HTML entities. HTML has it's own encoding for characters that are outside standard ASCII. Integer entities such as 翽 can be used to normalise all content to 7-bit ASCII, so negating the requirement for a site character set.
    • ISO8859-1 plus Entities would be crash-proof, but require search patterns in many places to be entity-escaped where they are not today.
    • RD: This would break quite a few things such as server-side search engines that index the topic files directly - they may well support UTF-8 but not Entities.
  2. Fix the guides and the code. There are a lot of places where Foswiki can break due to character set differences, but these can probably be tested for with no more than a doubling in the number of unit tests.
    • RD: Not a good option, makes everything messier.
  3. Standardise on UTF-8. If all topics were stored using a UTF-8 encoding, then all characters in unicode would be available all of the time.
    • Would need a (batch or silently behind the scenes?) conversion from legacy charsets to UTF-8 to avoid crashes
    • RD: Agree with this as the core approach, but we need to consider whether we provide backward compatibility at all. (See my comments on UnicodeProblemsAndSolutionCandidates)
  4. Record the encoding used in a topic, in meta-data, to allow topics to be moved.
    • This can only be done reliably for new topics.
    • RD: This is a good idea in addition to moving to UTF-8 as the default. If we are supporting backward compatibility i.e. still allow a native ISO-8859-15 site, then it would be very useful to have this in the metadata. Getting a canonical name for encoding may be an issue - check the IANA list.
AFAICT from looking at other wiki implementations, standardisation on UTF-8 is the most popular approach.

See http://develop.twiki.org/~twiki4/cgi-bin/view/LitterTray/TestEncodings for a script that should help developers fighting with these problems.

Resources

Please add any useful external links on this topic here - InternationalisationUTF8, UnicodeSupport and InternationalisationEnhancements have some good links. -- Contributors: CrawfordCurrie, HaraldJoerg, RichardDonkin

Discussion

I could not find a place where TWiki tries to convert from the site charset to UTF-8. Standardising on UTF-8 would seem like a good idea, but would need a (batch or silently behind the scenes?) conversion from legacy charsets to UTF-8 to avoid crashes. ISO8859-1 plus Entities would be crash-proof, but require search patterns in many places to be entity-escaped where they are not today. And finally, recording the encoding in a topic can only be done reliably for new topics.

-- HaraldJoerg - 09 Apr 2008

Good content. I added related links at the bottom. It might be better to update InternationalisationGuidelines instead of creating this topic?

-- PeterThoeny - 09 Apr 2008

It is the most actual problem in TWiki. Many people turned from TWiki to mediawiki exactly for this reason. It seems to be the right choice to recode all existing topics to utf-8 and forget forever about storing topics in other encodings. The problem is that user names and wikiwords in utf8 and generated internal links should be well tested before and all the bugs should be eliminated. It is really a hard work.

-- SergejZnamenskij - 09 Apr 2008

This started out as a blog post capturing my learning about encodings, but it turned into something more over time. InternationalisationGuidelines is a cookbook, and should be updated when we have decided how to handle encodings in the future. For now, I changed this to a brainstorming topic.

An on-the-fly conversion to UTF-8 would work by assuming the {Site}{CharSet} is the encoding for topics if the contents are not valid UTF-8 strings. There is an existing regex that appears to be designed to detect UTF-8 strings, $regex{validUtf8StringRegex}, though such a test may be expensive.

The trick is (I think) to make TWiki normalise all strings as soon as possible, so that internal strings are always perl strings, irrespective of the encoding used in the source. The store should take care of converting the encoding on topic content and topic names. Because the encoding always accompanies an HTTP request, it should be possible to decode URL parameters at point of entry too - if this isn't already done in CGI. Encoded byte strings should never be allowed to bleed into the core. Testing is then a case of throwing encoded strings at TWiki from all angles, and making sure they all get converted to perl strings correctly.

One thing that bugs me is that RichardDonkin has alluded to performance issues, I think related to non-ISO8859, and we have to understand these issues before making any decisions regarding encodings. There's also been the implication that UTF-8 support in perl is incomplete; again, I'm not clear if this is still the case.

-- CrawfordCurrie - 09 Apr 2008

The more I think about it, the more I am convinced that the only sensible approach is to use UTF-8 exclusively in the core. Accordingly I am raising UseUTF8 as a feature proposal.

-- CrawfordCurrie - 12 Apr 2008

I'm VERY glad that SergejZnamenskij voiced out his opinion. While I strongly believe there are many who are silent with the same opinion, I think it's a good moment for decision-makers to seriously consider the future of TWiki. I remember many months ago, many were afraid of UTF-8 simply because it may cause breakage to the existing setup. I urge these people to seriously consider and evaluate what are the implications when moving to utf-8 through setting up a test environment and the report issues.

Sure, TWiki internal is not 100% UTF-8, but we must start somewhere.

That said, much thanks to Crawford in embarking on this (very challenging) journey!

-- KwangErnLiew - 25 May 2008

I think migrating TWiki to UTF-8 support only is the right way to go.

It ensures TWiki will work in all languages. It give ONE localization config to test.

But there is one TASK involved in this. If you take topics created in a locale like iso-8859-X then those topics will be pure garbage when viewed in utf-8 unless they are purely English.

So we need to design the right migration code from anything to utf-8.

You cannot search in a mix of utf-8 and none utf-8. At least not in practical. This means that most twiki applications that are using formatted searches will break unless you do a total conversion of topics from non-utf to utf-8.

Nothing prevents us from implementing something that does that. We have our configure in which we can build in service tasks including walking through all topics and convert anything that is not utf-8.

It can be tricky to know if something has been converted once. You do not want to convert the same file twice! But with META:TOPICINFO format we can control this in a safe way so a topic is only converted once and so that you can run a conversion again when needed.

But it is the right thing to do. Everything in utf-8 starting from TWiki 5.0.

For 4.2.1 getting a decent utf-8 is the best we can get. We are nearly there with the great work done by Crawford.

Item5529 and Item5566 are the few bugs that are still not closed that are preventing us from claiming utf-8 support. And both seems to be a matter of perl not knowing that specific variables contain utf-8 strings. Probably simple to fix but complicated to debug. Everyone can participate in analysing these two bugs.

-- KennethLavrsen - 26 May 2008

I wrote an outline plan in UseUTF8 some time ago. Might be an idea to focus specifics, such as online/offline/on-the-fly conversion, in that feature request.

-- CrawfordCurrie - 27 May 2008

Another interesting topic that I missed! Glad to see some links to InternationalisationEnhancements etc to avoid re-inventing the wheel.

I don't think it's a matter of closing a couple of bugs before we get UTF-8 support. I think UseUTF8 and my comments on UnicodeProblemsAndSolutionCandidates need some consideration before this is started, as they cover cover topics such as migration of topic/filename data.

I agree this should be done in a major version, and ideally on a separate branch if we are breaking backward compatibility with pre-Unicode character encodings such as ISO-8859-1.

Many changes above in various places, check the diffs. Windows-1252 is not quite the same as ISO-8859-1, in characters that are sometimes used - Google:demoronizer+iso has some links on this. Also fixed the bug where you said that ISO-8859-1 includes the Euro - this is only in -15.

Setting {Site}{Locale} to utf-8 on current TWiki versions only half works and is really only for sites that don't require WikiWords (e.g. Japanese, Chinese, etc) but no good for European languages that want accents to be included in WikiWords. So I don't agree with that recommendation above, and this also conflicts with the advice in InstallationWithI18N.

I also disagree with some of the issues analysis above - ISO-8859-* is and was a good choice for pre-Unicode encodings, since Perl in 2003 wasn't ready for TWiki to go UTF-8.

On performance - I saw a 3 times slowdown when I ran my own TWiki in Unicode mode (i.e. enabling Perl utf8 mode not just handling utf8 as bytes), which is pretty large. I suggest we consider an ASCII only mode for performance on English-only sites unless Perl has dramatically improved recently (or maybe hardware is faster these days, but on hosting sites CPU is still scarce and I don't think anyone wants a big slowdown...) This means some real optimisation after getting the basic UnicodeSupport going, focusing on the key bottlenecks for Unicode processing. Given that the system is processing 3 times as many bytes in some cases, it's reasonable to have a slowdown on those bytes, but it seems that Unicode makes Perl operations go slower even on the characters that are just ASCII. Needs some investigation.

-- RichardDonkin - 14 Jun 2008

I refactored most of your comments into the text, to try and keep the flow of the document. Richard, you are the TWiki God of Encodings, so I for one take what you write as gospel. smile

-- CrawfordCurrie - 14 Jun 2008

I have added quite a lot more material above, including a section on URLs, and completely reversed your recommendation to use UTF-8 in current TWiki versions - this breaks many things and is bad idea. It's only a good idea if you like a broken TWiki!

Also made quite a few additions above under {Site}{Locale} and {Site}{CharSet} to try and clear up a few things.

I now understand better where this is coming from, i.e. the fact that TinyMCE uses Unicode internally and so it's painful to convert to/from the site character set.

-- RichardDonkin - 15 Jun 2008

It's worse than that. Browsers use Unicode internally in the sense that Javascript always uses unicode. So this isn't a problem just for Legacy.TinyMCE, it's a problem for all client-side applications as well. AJAX nicely sidesteps the problem because XML is UTF8 encoded by default, but if you are not using AJAX - and most TWiki authors aren't - then it's a serious problem.

-- CrawfordCurrie - 15 Jun 2008

Good point about client-side apps generally, I was mostly thinking about server-side as is traditional TWiki but clearly AJAX and so on are increasing demanded.

On $regex{validUtf8StringRegex} this is only used in EncodeURLsWithUTF8 (described in URL section above) to determine if a string is truly valid UTF8 (avoiding overlong encodings in UTF8 which can be used in security exploits) or the site charset. Not very efficient on a large amount of data but not too bad on a small amount - e.g. you might try searching for the first high-bit-set byte in a page, grabbing the next 50 bytes, and using that as a heuristic. Since UTF-8 has a very distinctive encoding it's turned out to be quite reliable, and it's based on what an IBM mainframe web server does - see InternationalisationUTF8 for background.

-- RichardDonkin - 15 Jun 2008

Don't know were else to place this: are all regular expressions used by TWiki Unicode Regular Expressions or should they be?

-- FranzJosefGigler - 20 Oct 2008

Yes and no. The core regexes are compiled to reflect the site character set, so if you select UTF8 on your site, you will get unicode regexes. If we move to unicode in the core, then this complexity disappears.

-- CrawfordCurrie - 24 Oct 2008

Correction; if you select UTF8 on your site, you will get utf8 regexes, not unicode regexes.

-- CrawfordCurrie - 29 Nov 2011
Topic revision: r13 - 17 May 2015, CrawfordCurrie
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy