Feature Proposal: Support multi-key sorting in SEARCH

Motivation

The current support for sorting SEARCH results is crude at best. Improve it to allow sorting on multiple keys in different directions.

This proposal is a formalisation of a discussion that has been running for years. Sven reckons it's too late for him to implement in 1.1, but maybe it's important enough that we need to reconsider that, and get him some help.

Description and Documentation

Deprecate order and reverse parameters on SEARCH.

Add sort parameter (using the format tokens as a comma separated list) as follows:

sort="keys"
Specify the sort order for results, where keys is a comma-separated list of the following sortable fields: $web, $topic, $createddate, $modified, $editby, $formfield(name). Thus sort="$topic,$createddate,$web" will sort first by topic name, then by created time, and finally by web name.
sort="key[]"
By default sorting is alphabetical. By following the sort key name with "[]" you can specify that a numerical sort order is used instead. This will extract the leftmost sequence of one or more digits [0-9] from the key value, and numerically sort on that e.g. given sort="$topic[]", Topic10 will sort after Topic2 because 10 > 2 (a purely alphabetic sort would place Topic10 first). Where there is no number in the key value (or two topics have the same number), then an alphabetical sort will apply.
sort="-key"
Sort keys can be preceded by a '-' to reverse the order of the sort on that key, so sort="-$web,-$topic" will reverse-sort alphabetically on the web name and reverse-sort on the topic name

If the sort parameter is given, the order and reverse parameters will be ignored.

Add a groupby parameter, which triggers a footer and header to be rendered whenever that 'sorting' specifier changes.

so if you want to reproduce the legacy order output, you can say
%!SEARCH{
   sort="$web,$topic"
   groupby="$web"
]%

or, as a complex and slightly weird example
%SEARCH{
   sort=""%CALC{$SOMETHIGN($date)}%,$formfield(Status)"}%"
   groupby="$formfield(WaitingFor)"
}

which is likely to output many headers and footers - as the result is not sorted by WaitingFor - but if one person has alot of tasks waiting on them (um, like i do) they would show up in an output ordered by last edit date and status.

%WHATDOESITAFFECT%
edit

-- Contributors: Many

Discussion

the reason I think its too late, is that I have to re-write the sorting code already :/ but that said, yes, a spec that is agreed to will improve the chances that it'll be done smile

-- SvenDowideit - 10 Mar 2010

one thing that is missing from the above proposal, is how to do partitioning, or grouping - the historical rendering groups by web, and outputs one header and footer per grouping - imo this is still a useful thing - imagine grouping by date - day/week, or by status..

which brings me back to the original proposal, MakeSEARCHResultPartitioningByWebOptional and adding a partition or groupby parameter, with 'web' as the default..?

-- SvenDowideit - 13 Mar 2010

Yes, I deliberately avoided that here to keep the focus on sorting. Currently the only sorting option you have is: order="key". As you highlighted elsewhere, results are partitioned by web, so this in fact has to be read as order="web,key" - "web" is an implicit major sort key. Let's say you simply extend this with partition with web as the default. and say order="web" partition="topic", you are applying a major sort key topic and minor sort key web. That's workable, but:
  1. is not intention-revealing for the user, and
  2. requires complex interpretation of reverse (and/or support for partition="-topic"? Dunno)
The alternative is to have partition as a boolean, and write sort="topic,web" partition="on" which IMHO is more readable.

-- CrawfordCurrie - 14 Mar 2010

I'll see what happens if we use partition as a bool - though I think that will tell the user that its only the first of the multi-key sort, and it might limit the applicability needlessly - no idea smile

I'm preferring to call the param groupby and have it actually list the field/sto group results by - when there is a grouping, there's one header&footer output per group..

wrt reverse - yeah, basically, the sorting is implicit by the use of the groupby so maybe we just go all nerdy, and do sort="(web,date),topic" to groupby (web,date)....

-- SvenDowideit - 16 Mar 2010

I've changed the proposal to be much more expressive - note that while we're using the format $tokens, I'm not proposing that the sort parameter is made into a string and then sorted (though on some Search Algo's that may be a useful pre-sort.

-- SvenDowideit - 17 Mar 2010

OK, I see the connection of $keys with the format tokens. Makes sense.

re: call the param groupby and have it actually list the field/sto group results again raises the interesting question "should a group impose an implicit sort key"? To illustrate, let's say I have a sequence of results: Aardvark, Cat, Iguana, Jaguar, KingCobra, KomodoDragon, Lemur, Zebra. I say sort="-$topic" groupby="class" (where class divides between "mammals" and "reptiles"). I would expect:
  • Lemur

  • Jaguar

  • Iguana

  • Cat
  • Aardvark
viz. the sort is by topic name, but that is then divided into groups. If "groupby" defines an implicit sort key, then:
  • Zebra
  • Lemur
  • Jaguar
  • Cat
  • Aardvark

which to me is counter-intuitive.

The second interesting question is, how does the visual representation of a group get defined? Are there additional groupheader, groupseparator= and groupfooter params?

While not directly impinging on the definition of the sort parameter, I think these questions are important enough to raise a concern.

-- CrawfordCurrie - 17 Mar 2010

it depends if groups are higher order than results, and I think they are. So your counter-intuitive example is logical, it groups and then sorts within the groups.

-- ArthurClemens - 17 Mar 2010

If we look at the available keys, $web, $topic, $createddate, $modified, $editby, $formfield(name) we can see that the date keys are useless for grouping as they stand. So we are left with grouping by $web, $topic, $editby, $formfield(name). Grouping by topic is interesting - do the result sets treat multiple hits on the same topic as separate, groupable hits? If not, what does "group by topic" mean?

-- CrawfordCurrie - 17 Mar 2010

Grouping by date can theoretically be enhanced with grouping by day, week, month, year.

-- ArthurClemens - 17 Mar 2010

all good points - for after we've gotten 1.1 cooked smile

-- SvenDowideit - 01 Apr 2010

MartinKaufmann raised the need for having different sorting criteria. For example, suppose I want to sort stuff based on typical dotted-version-numbers used for software versions. I would like versions (in this case, perl versions) to sort so that "5.8.8" < "5.10" < "5.10.1"

I would also like this ordering-criterion to be pluggable. Perhaps plugins should be able to register comparison-operators.

-- MichaelTempest - 08 Jul 2010

The more I think about it, the more I think it's essential that we make the sort expression use the query syntax. For example, say I have a topic that has a formfield containing the name of another topic; I want to sort numerically on the value of a formfield in the second topic.

sort="number:OtherTopic/OtherField"

the typename: can be naturally extended to support different collation sequences; number, alpha, date, version etc.

-- CrawfordCurrie - 08 Jul 2010

Great discussion. At the risk of sounding a little crazy/precious/annoying - any chance we avoid "using up" the : character just now? smile

How about wrapping it in parens like a method call: sort="number(OtherTopic/OtherField)"

I have an extremely embryonic, distant vision forming in my head about how one might extend the query syntax to work with hierarchical data/result sets a little easier and it involves the :

On the other hand, I won't be offended at all if we go ahead with numeric:... - don't want to get in the way of progress - just wondering if the parens() idea was at all appealing in this case

(OTOH maybe I missed the boat entirely - are we talking about extending query syntax, or is numeric:... only meant to be pseudo-query syntax?)

-- PaulHarvey - 08 Jul 2010

AFAICT, sort="p,q,r..." means sort first by p, then by q, the by r, etc. Then, Crawford's suggestion is that each of p, q, r, etc is an expression in query syntax for the value to sort by, optionally preceded by the collation method. If present, the collation method is separated from the sort-by expression with :

-- MichaelTempest - 08 Jul 2010

Correct. And if there is no collation method, then the sort-by expression stands on it's own, which doesn't happen with parentheses, as they are legal in a query. The intent was:
sort-exprs ::= sort-expr ( ',' sort-exprs )? ;
sort-expr ::= ( collation-method ':' )? query ;
collation-method ::= /^[A-Za-z]+$/ ;
as long as we parse the sort order rather than trying to use regex analysis on it, we are not at risk of "using up" either ':' or ',' - and I very much appreciate the desire not to. This can be done as a clean derivation of the Query parser - in the same way as the IF parser is derived.

-- CrawfordCurrie - 08 Jul 2010

Sort::Maker is now part of the distributed CPAN libs. See GeneralSortingMechanism (the attached file shows code examples). That is for constructing sort queries. Up to now we have discussed the sort command queries.

Noone has argued against Crawford's idea to use query language. Do we need to decide on that?

-- ArthurClemens - 17 Feb 2011

Actually, I'd vote for re-using Crawfords suggestion of using some deriative of query syntax for specifying sort parameters.

If there is a caveat, it's this: sort=SomeTextField will potentially be able to use an index on a back-end store that supports it. Whereas something simple like sort=uc(SomeTextField) would prevent the use of the index (it's possible to define the index as uc(SomeTextField), but that requires extra user nous).

Note that in general sorting and indexing issues are close bedfellows. i.e. a query for 'LEVENS' = uc(SomeTextField) is likely to lose the benefit of the index.

On my work with VDBI, it has become clear that I may need to store fields two or more times: exactly as-is and indexable version(s). I see that this sort of concept has already been picked up in AllowTypedData.

-- JulianLevens - 27 Feb 2012

Is it not possible to automate generation of new indexes somehow? Perhaps that's asking too much.

See also: Tasks/Item11022 and SearchOrderByTopicElement.

Something that concerns me is sorting on a QuerySearch expression which resolves to something that isn't a scalar, eg. attachments.name, I guess we simply need to not support those (pick attachments[0].name instead), but we need SEARCH macro to be more vocal/informative when these unsupported things are given to it, rather than just silent failure.

-- PaulHarvey - 27 Feb 2012
Topic revision: r20 - 27 Feb 2012, PaulHarvey
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy