Feature Proposal: Make faster backlinks possible

Motivation

Finding backlinks can be quite slow. Some links are often not found as well. Abstracting backlinks out as an operation would allow alternatives as well as encapsulating this logic into one place. I suspect that this logic is scattered around FW code.

Description and Documentation

Some sort of backlinks search operation added to core, probably via QUERY.

I'm primarily thinking of stores which would be able to capture links on topic save and provide fast lookup. Note that a topic that is a non-wiki word will not benefit from this (it cannot be recognised during topic save in the general case, if the topic is squabbed that would be an exception), and any such search will need to resort to a full topic scan.

It will be worth comtemplating stuff referenced here: ImplementingLinkProposals, albeit trying not to boil the ocean.

Various documentation will need updating e.g. ManagingTopics will need changes, hopefully some of it's "Known Issues" can be addressed.

Examples

Impact

Any code that performs backlink will need finding and amending to use the new code. There are backlink templates that will need updating to use the new option rather the current SEARCH and regex. Existing stores will need a default implementation (effectively SEARCH and regex).

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: JulianLevens - 03 Oct 2012

Discussion

having just made the stupid regex search that is used to show backlinks worse (i suspect) It would be nice to solve this better.

PaulHarvey has a plugin that adds more META data to track this kind of thing (I've used it and the results were useful to me)

but fundamentally, it should require 'boiling the ocean' because there's some really painful problems in there. The Store can't determine what will become a link - its sadly necessary to run a full render, plugins included, and then mine them from there (to get the correct answer) - and if any setting / plugin changes, you really need to run again (if you really are worried about correctness).

-- SvenDowideit - 03 Oct 2012

SolrPlugin solves this by capturing all outgoing links while indexing a topic as discovered on the non-rendered text and in all META records. Backlinks are displayed in a special backlinks template using something like this:
%SOLRSEARCH{
    "outgoing:%BASEWEB%.%BASETOPIC%"
    format="..."
  }%

-- MichaelDaum - 03 Oct 2012

One problem with boiling the ocean will be handling the various URLPARMs.

I am certainly not against eventually boiling the ocean, but one sea at a time.

Part of the point of this proposal is to standardize back-linking, I'm doing something very similar to SolrPlugin in VersatileStore. Solr and Versatile need to compare notes on fore-link discovery in a topic.

Ideally core will provide a service to enumerate all the fore-links in a topic, which could then be used elsewhere (stores/caches being obvious candidates). Over time this could be enhanced for even greater intelligence (moving from no rendering to partial rendering and possibly even full rendering) - the fundamental trade-off will be how much time to you want to allow for fore-link discovery? That's another part of this proposal or should I separate that out?

-- JulianLevens - 03 Oct 2012

Problem with gathering links from a rendered page is false positives that the topic itself doesn't account for. These are hard to find out at that stage. I am not convinced that there is much gain pre-rendering the page for that purpose. At the end those that we want are wiki words or bracket links. These are visible in the non-rendered text already.

-- MichaelDaum - 03 Oct 2012

Tricky. My first reaction was that you really don't want to pre-render on every save, and even if you did, you wouldn't capture all the backlinks. Consider, for example, a transcluded topic; the backlink isn't to the transluded topic, it's to the topic doing the including. Since there is (currently) no central register of transclusions, you are back to boiling the ocean.

Since similar problems are addressed by the caching code, isn't that the logical place to look to do this?

-- CrawfordCurrie - 04 Oct 2012

Yes, I think I'm siding with why do you want to bother with rendering. Some processing is appropriate during save, e.g. %USERSWEB%.JulianLevens would need to pick out USERSWEB from config — It's all I'm doing in Versatile but maybe some prefs should also be handled. At some point you need to say good enough, indeed hopefully better than now.

This topic is AddBacklinksToQuery and the headline "Make faster Backlinks possible". The comments are (maybe not surprisingly) focussed on faster backlinks. However, this proposal is about abstracting out backlink searches. Once done, someone can always create a contrib (or make it part of one) which provides a different method. So this proposal might deliver something like: Foswiki::Backlinks, Foswiki::Backlinks::RegexSearch then later we could have Foswiki::Backlinks::Versatile; Foswiki::Backlinks::Solr and maybe even Foswiki::Backlinks::RenderTheOcean

To iterate, the focus of this proposal is to abstract out backlink searches.

This proposal will therefore make faster possible, actually faster is a potential excercise for the reader.

-- JulianLevens - 04 Oct 2012

There are already backlink specific search "macros", though buried in Foswiki::UI::Rename called %LOCAL_SEARCH and %GLOBAL_SEARCH which are placeholders for the appropriate %SEARCH expression going on underneath. I'd really like to see these two go way. At the very least: make them properly registered and documented macros, which they aren't.

Means: time to talk about syntax?

-- MichaelDaum - 04 Oct 2012

Syntax is easy - its a QuerySearch - I would suggest that backlinks returns a list - that way you
  • QUERY{"backlinks" format=" * $backlink"}
  • SEARCH{"'SomeTopic' IN backlinks"}

and maybe:
  • SEARCH{"backlinks[value='SomeTopic']"}
  • SEARCH{"backlinks[web='Main']"}
  • SEARCH{"backlinks[topic='SomeTopic']"}

and then we let the Store decide how to optimise the query.

-- SvenDowideit - 05 Oct 2012 - 04:12

I had never written down a syntax, but something like Sven's stuff above is very much the sort of thing that's been on my mind. There is always the possibility that it will need some refinement as I better understand the issues involved.

I note that within Versatile I recognise whether a link is embedded within any of the following blocks: pre, literal, noautolink or verbatim.

I'm not really sure what to do with these, I can see arguments for totally ignoring them or returning them as links (albeit marked as em-blocked). What do you think about that? Of course not all stores will implement this, indeed a regex search will find them but not know that they are em-blocked. One possible use would be to allow rename to only request links outside of blocks, if I understand correctly, rename would only want to be bothered with these.

-- JulianLevens - 05 Oct 2012

that was a reason i mentioned the 'maybe' style - if you have additional attributes, then you would be able to test for them inside the square brackets smile

-- SvenDowideit - 05 Oct 2012

Here are the working examples for SemanticLinksPlugin

It uses ordinary QuerySearch to query for backlinks in and out of a page.

It's far from perfect, but for my users, covers the 95% case decently. And I can't have an all-webs search on a 220k topic Foswiki installation timing out all the time.

I honestly don't care if the backlink search in a new implementation isn't perfect (I mean, it's fuzzy enough as it is - see the rename/backlink fixes we've done over the years).

Here is an example to an "all webs" backlink query, it's slow because I don't have any indexes turned on for backlinks, but it's still about 10 minutes faster than the default Foswiki regex-based backlinks query.

-- PaulHarvey - 11 Oct 2012

I'd like to prioritize bringing this to a conclusion for our first 2.x minor release, 2.1.0. Backlinks have become an urgent issue because of the move to Unicode. The current Backlinks search uses A-Za-z0-9 style regex character searches and needs to be replaced. Item9289 reports the I18N issues with backlinks. Hijacking that for this feature.

-- GeorgeClark - 18 Jul 2015

Julian asked that this be deferred to 2.2. His initial implementation needs some adjustments to be more compatible with the query language.

-- GeorgeClark - 06 Dec 2015

Note also request QueryCustomCollections. Seems to propose related feature.

-- GeorgeClark - 13 Feb 2016

At the heart of the issue is the need to perform a sub-query. The current query language (syntax and semantics) are not generically defined here, and I need to consider how that needs to be constructed, rather than a one off point solution.

Is overloading the ArrayKeyword[ array-selection ] operator (i.e. aka OP_where) the appropriate method? It's a good fit in as much as the [] operator returns an array and a sub-query will do the same.

There are two distinctions with a sub-query:
  1. With an array-selection, the data is readily available in a single topic and fast
  2. A sub query demands looking outside of the current topic or even web

If, for a sub-query, you naively get the whole list (as an array) and then filter this array it will perform very poorly. If instead the sub-query is optimized then much less work can be done. Ideally the query and sub-query can be optimized together.

This also raises concerns about how sub-queries might interact with a future JOIN operator.

There are in fact 2 immediate use cases: all-backlinks, web-backlinks and they can be implemented easily enough as special cases. They would actually allow use of the [] operator but for backlinks that is likely to be slow and not recommended - i.e just use the standard form.

SQL has been criticized for it's many vagaries. One in particular that comes to mind is the WHERE clause, for example

SELECT * from t1, t2
   WHERE t1.size = 900
     AND t2.weight < 2500
     AND t1.corrId = t2.id
     AND t1.colour = 'red'

The phrase AND t1.corrId = t2.id is actually indicating the join, the other phrases are limiting the rows to return. The argument is that it's better to have a distinct JOIN phrase for clearer understanding of what is being done especially for users.

All I'm basically saying, is that I want to take care to think about the impact of an enhancement to SEARCH/QUERY's user interface (syntax and semantics).

This may need another FP to handle query sub-queries? (maybe even JOINs but pretend I didn't say that)

-- JulianLevens - 29 May 2016

Do you still consider indexing outgoing links (rather searching for incoming links) as we discussed earlier?

-- MichaelDaum - 30 May 2016
 
Topic revision: r20 - 30 Jan 2018, MichaelDaum
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy