SEARCHes of type word do not work if word is non-English and the wiki is setup for UTF8

This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmussen er ikke så indlysende en kandidat til posten som for blot et par uger siden, skriver Morgenavisen Jyllands-Posten.

followed by these searches

First Regex which works

Searched: Løkke

Results from Tasks web retrieved at 19:26 (GMT)

Item5529
SEARCHes of type word do not work if word is non English and the wiki is setup for UTF8 This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmuss...
Number of topics: 1

Then query

Searched: text ~ '*Løkke*'

Results from Tasks web retrieved at 19:26 (GMT)

Item5529
SEARCHes of type word do not work if word is non English and the wiki is setup for UTF8 This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmuss...
Number of topics: 1

And finally word

Searched: Løkke

Results from Tasks web retrieved at 19:26 (GMT)

Item5529
kommende statsminister Lars Løkke Rasmussen er ikke så
Number of topics: 1

note that here on Bugs we do not run UTF8. You have to copy the examples to a UTF8 Foswiki

It also seems that the query does not really work at all with text ~ here on Bugs which runs iso-8859. It does work on my T42 with utf-8

-- KennethLavrsen - 13 Apr 2008

Problem is still there also after the SVN16656.

Regex works. But both word and query search does not work if the word you search for contains non-English characters and Foswiki runs UTF8.

-- KennethLavrsen - 13 Apr 2008

Would this be a simply (workaround) fix? : Scan for punctuation and whitespace instead of perl word boundaries.

-- PeterThoeny - 12 May 2008

I do not understand what the idea is of this work around.

We are talking about searching for plain simple words.

If you cannot search for plain words in languages that do not use only A-Z (that is the majority of this world) then the search is in practical totally worthless. This needs to be fixed if people are to be able to use UTF8.

-- KennethLavrsen - 13 May 2008

Search does also not work when the searched word contains a single quote, like Foswiki's.

-- ArthurClemens - 24 May 2008

Query using text ~ "something" does not work with English words either and not in ISO-8859-1. It seems Query is simply just broken now.

-- KennethLavrsen - 26 May 2008

After having fixed 5529 (Sven still need to check in the fix on SVN) I have been able to debug this one further and I know the exact root cause.

It is in lib\Foswiki\Store\SearchAlgorithms\Forking.pm we have the problem.

The problem only occurs in a search where we are looking for work boundaries but it is not the \b that is the problem.

There are the code lines

    if ($options->{wordboundaries} ) {
        $searchString = '\b'.quotemeta( $searchString ).'\b';
    }

and the problem is the quotemeta( $searchString ) which screws up the string when it contains unicode characters.

Crawford, you added this code originally. What is the quitemeta supposed to do? We obviously need to do the similar operation in a different way but before I just remove the function I need to understand what it is doing and what to watch out for.

-- KennethLavrsen - 30 May 2008

I'm really surprised to hear that quotemeta fails with UTF-8 encoding. quotemeta is a standard perl function used to escape regular expression meta-characters in the search string. However, when you read the doc in detail, you can see that it is absolute shit. I quote all characters not matching "/[A-Za-z_0-9]/" will be preceded by a backslash in the returned string, regardless of any locale settings. Note the "regardless of any locale settings" bit, which ensures it won't work for any multibyte character encoding.

The simlpest solution I can think of is to replace quotemeta with a method that actually recognises valid meta grep characters.
    if ($options->{wordboundaries} ) {
        $searchString =~ s#([][|/\\$^*()+{};@?.{}])#\\$1#g; # Can't use quotemeta because $searchString may be UTF8 encoded
        $searchString = '\b'.$searchString.'\b';
    }
If the above code doesn't work, try converting the string to unicode first:
$searchString = Encode::decode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet};
as the first line in the condition block. If this causes a Wide character in print error, then add
$searchString = Encode::encode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet};
as the last line in the condition block.

Note that all uses of quotemeta in the code that operate on data that is potentially UTF8-encoded will be similarly affected. I think this problem would "just go away" if Foswiki used unicde strings internally - this is a problem specific to multibyte encodings such as UTF8.

-- CrawfordCurrie - 30 May 2008

Working on this.

Tried the first solution and it works.

Tried the 2nd solution with Encode::decode. It also works. I did not need to use the Encode::encode.

I only tried with my test topic and only in utf-8.

I will try different other searches and combinations before I check in a fix.

For the moment I am mostly keen on the 2nd solution because it seems less a hack.

I cannot help thinking that the $searchString = Encode::decode($Foswiki::cfg{Site}{CharSet}, $searchString) if $Foswiki::cfg{Site}{CharSet}; operation should happen a lot earlier in the code to prevent other bugs that we have not seen reveal themselves yet.

Something for me to investigate a little further this weekend.

Thanks for following up on my questions Crawford.

-- KennethLavrsen - 31 May 2008

Note that you will have to test with at least one multibyte encoding (e.g. UTF-8) with a multibyte search string, at least one high bit encoding such as iso-8859-1 checking high-bit characters, normal 7-bit ascii, and you should also really test all legal meta-characters in regex searches.

-- CrawfordCurrie - 31 May 2008

I tried to Encode the $searchString much earlier. It seems to have a negative effect on the non-word type of searches resulting in searches results containing garbage. So it is a bit of a can of worms.

I continue learning all I can but we probably have to settle for the fix that targets this particular problem for 4.2.1

-- KennethLavrsen - 31 May 2008

I decided to go for the solution that does not use Encode because when we later want to change Foswiki to general utf-8 additional hidden Encode conversions can harm so the regex substitute is a better short term solution which will still work when we go utf-8

-- KennethLavrsen - 02 Jun 2008

Imported from TWikibugs:Item5529 by Babar because the regex is faulty. Created Item8657 to fix this fix.

-- OlivierRaginel - 03 Mar 2010

ItemTemplate edit

Summary SEARCHes of type word do not work if word is non-English and the wiki is setup for UTF8
ReportedBy Foswiki:Main.OlivierRaginel
Codebase
SVN Range
AppliesTo Engine
Component
Priority Normal
CurrentState Closed
WaitingFor
Checkins
TargetRelease patch
ReleasedIn 1.0.0
Topic revision: r2 - 03 Mar 2010, OlivierRaginel
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy