Feature Proposal: Agree on a minimum syntax for regexes

Motivation

Since time began, Foswiki has been forced to "support" two different regular expression syntaxes viz. GNU ERE (as used by grep) and Perl. These syntaxes are largely compatible, but there are some key issues with them:
  1. The word matching in searches uses the GNU-ERE specific \< and \> to match word ends. NativeSearchContrib maps these to \b, and this seems to work.
  2. GNU ERE does not support non-greedy matching (*?)
  3. GNU ERE only supports a subset of the Perl regular expression syntax.

However there really is no really good reason for this. GNU grep has a -P option that makes it accept Perl regular expressions. If a platform doesn't have a capable grep (such as older Ubuntus) then NativeSearchContrib (which uses PCRE) can be used.

So, the proposal is to change the grep invocation to grep -E, and recode the search algorithms to use \b rather than \< and \>

Note that this impacts:
  • Anyone who uses %SEARCH{type="regex"
  • The new type="query" "match" operator
  • Any extension developer who has used the internal search functions (usually by calling Foswiki::Func::expandCommonVariables with a %SEARCH string) in regex mode.

The main user-visible change will be the dropping of the \< and \> operators.

Implementation

  • Change $Foswiki::cfg{EgrepCmd} to grep -P
  • Eliminate usage of GNU ERE-specific syntax in REs in the core
  • I think the strategic position we need to take is as follows (this is a statement for Developers; a simpler version for end users is required)

"Foswiki strives to support the rich Perl regular expression syntax wherever regular expressions are required. However, because Foswiki has to interface with third party tools and libraries, it is not always to support all the features of Perl regular expressions in all places. Anyone who implements an interface to such a third-party tool will make every effort to map all the functionality of Perl regular expressions to the tool. However it will not always be possible to support everything, so the following table lists the features of regular expressions that must be available. The features are chosen from those described in http://www.regular-expressions.info/refflavors.html, which compares the regular expression support provided in several important enviroments.
  • Backslash escapes one metacharacter
  • \x00 through \xFF (ASCII character)
  • \n (LF), \r (CR) and \t (tab)
  • [abc] character class
  • [^abc] negated character class
  • [a-z] character class range
  • \d shorthand for digits
  • \w shorthand for word characters
  • \s shorthand for whitespace
  • \D, \W and \S shorthand negated character classes
  • . (dot; any character except line break)
  • ^ (start of string/line)
  • $ (end of string/line)
  • \b (at the beginning or end of a word)
  • \< \> - not perl compatible, but their use is deeply embedded in the core
  • | (alternation)
  • ? (0 or 1)
  • * (0 or more)
  • + (1 or more)
  • {n} (exactly n)
  • {n,m} (between n and m)
  • {n,} (n or more)

In the event that an external tool supports regular expression syntax that is not compatible with Perl, the calling code must defuse the regex feature that is not perl compatible. This may result in some loss of functionality, but is necessary to avoid confusing users."

-- Contributors: CrawfordCurrie - 06 Apr 2010

Discussion

yes, I think it is worthwhile to have a formal declaration of what regex parts we support, and which developers should avoid. This would also mean that we can validate the incoming regex's and indicate when it might be a problem.

we should at the same time, work up a list of OS and toolchain combination we will test this change on - the situation on OSX and Windows wrt non-case sensitivty is an example where we havn't done our jobs properly - and we need to work out a way to avoid adding more issues like that.

I see this as a step in the right direction.

-- SvenDowideit - 07 Apr 2010

I have often been fighting with regexes where something I thought would work does not work. The most common example is looking for "\s*(.*?)\s*" which does not work in many cases but " (.?) *" does.

I see this more as a bug to be fixed than as a feature proposal.

And we are to a great extend making things that do not work today work tomorrow. We are not really taking away anything as far as I can tell.

I would not see fixing this as breaking the feature freeze if you believe you have time to fix this for 1.1. But it would be good to get it fixed not in the last minute. I would like a fix present in the first beta we make, probably right after I branch off Release01x01 end of April.

It is also important to get the regex syntax documented for the end users so they know exactly what is supported.

-- KennethLavrsen - 07 Apr 2010

I was torn whether to treat this as a bug or a feature. The reason I decided to go the feature route was to try and give it as much press as possible, in case people are adversely affected. To further publicise it, I've created RegularExpressions

-- CrawfordCurrie - 07 Apr 2010

Bug or feature.

The proposal did not trigger any concerns.

And calling it a bug allows it to be FIXED in 1.1 scope.

I put this in accepted as it is a good document to document the fix details.

-- KennethLavrsen - 23 Apr 2010
Topic revision: r6 - 05 Dec 2010, GeorgeClark
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy