I want a Foswiki::DOM!

Why

From https://github.com/csirac2/core/blob/dom/lib/Foswiki/DOM.pm

The mission is to present a tree of Foswiki::DOM::Node objects from some input (normally TML string). Evaluators such as Foswiki::DOM::Writer::XHTML then use this to produce reliable, well-formed output markup (HTML5, XMLs, JSON, etc.).

Discussion

Separately, the TOM - Topic Object Model - is concerned with managing structured data. Merging TOM & DOM architectures may be possible; on the other hand, their specialisations may prove useful. A special (simplified) 'TOM' DOM view might be possible for any TOM data member's TML content when that content is accessed via the TOM; eg. the ideal DOM for QuerySearching might look different to one ideal for (reversible) TML <-> XHTML rendering; but this remains to be proven.

Rationale for a DOM

  • The venerable Foswiki::Render has served well, but untangling and changing this web of regex spaghetti is daunting
  • Extending (and/or re-purposing) TML is full of surprises & edge cases, opaque re-(de)-escaping/evaluation/takeout/putback tricks... A DOM should make this easier, more consistent, less bug-prone. Especially for plugins.
  • Use a common codebase for all TML applications: WYSIWYG, XHTML, HTML5, RTF, XML, JSON etc. so we can fight bugs in one place.
  • Avoid wasted effort in parallel TML rendering implementations of varying completeness/bug-compatibility
  • Most other wiki & CMS platforms have a DOM for their wikitext: Mediawiki, Confluence, X-Wiki, etc.
  • But does using a DOM erode some of Foswiki's approachability, hackability, 'charm'? - hopefully Foswiki::DOM::Parser::Scanner code such as Foswiki::DOM::Parser::TML::Verbatim feels familiar to regex hackers, given its approach of claiming regions of syntax first, leaving the complexity of building/optimizing tree structures from these ranges up to a separate, syntax/feature-agnostic step
  • Could allow native storage of content in markups other than TML
  • Could cache the Foswiki::DOM tree, possibly enabling performance improvements (or making up for lost perf over Foswiki::Render)

TODO

How

  • Mark regions of input, much like a syntax highlighter.
  • Resolve the regions into an interval tree.
  • Build a DOM tree.
  • Pass it to an evaluator, such as Foswiki::DOM::Writer::XHTML

Experiments

Marpa

SvenDowideit says we should try to build a partial parser with Marpa (before we assume we can't derive a benefit from it - http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html ).

I always thought that it was "too hard" to create a formal BNF spec for WikiText, but it seems to have been attempted by the MediaWiki guys:

-- PaulHarvey - 18 Jan 2012

Reading a little more, Marpa looks really nifty, but doesn't seem to help with lexing (well, my code calls it scanning), which is what I'm coding right now.

But it might be a fun way to turn my scanner result into a DOM tree.

-- PaulHarvey - 18 Jan 2012

I don't understand how you plan to handle the dynamic nature of TML. Macros can create syntax, including new macros. There is no record of the relationship between the resulting TML and the originating source I am working towards keeping track of (and recording) those relationships. - PH Some macros may not evaluate fully until after rendering is complete. Indeed - I will be supporting those - PH Plugins can register for access before and after rendering, and can directly modify content I don't intend to support commonTagsHandler macros, and perhaps that's a show-stopper for ever adopting my current approach. There will, however, be several alterative methods for performing the same tasks in a DOM-friendly way; in fact, by adding a new syntax, you can even re-use the very same regexes (with different replace-callback fn) that were used with traditoinal commonTagsHandlers. - PH . The only point at which the content is stable enough to generate a DOM is when the output HTML has been generated.... which kind of defeats the object, doesn't it? I hope to prove you wrong smile - PH

IMHO macros are the most important and powerful feature of Foswiki, the feature that sets it apart from other wikis.

-- CrawfordCurrie - 18 Jan 2012

Crawford, I have similar concerns. See IRC logs: http://irclogs.foswiki.org/bin/irclogger_log/foswiki?date=2012-01-18,Wed&sel=421#l417

I could see multiple passes working but it does of course limit the benefit. I am interested in a more clearly defined structure (aka spec) for the syntax. I had the idea of an 'explain' processor (a la SQL) which would show you how your viewed page was constructed by Foswiki phase by phase. It would be a useful debugging and learning tool. It may even be that this exercise simply helps to make that possible. Ideally, they would combine to reduce maintenance, but the explain may prove to be very important to help people understand Foswiki processing and build pages much more easily.

Back to the multiple pass process: if a store cached the parse-tree (even if not quite a full parse), that may not help much, but if %INCLUDE brought back a pre-parsed chunk and macros were extended to be able return pre-parsed chunks that would help.

However, you said "Some macros may not evaluate fully until after rendering is complete." I don't get that. A macro is called and returns a string, and then later once rendering is complete changes that string behind the scenes? Ah, you must mean that a macro will be called that returns a marker, then the plugin can after rendering find the marker and perform further transforms - or something like that? Does this stop the multiple pass process in it's tracks?

As I said I see much benefit in clarifying the TML spec, and building an explain processor. So although this exercise may fail in it's ultimate goal - it may still be beneficial.

-- JulianLevens - 18 Jan 2012

My approach has been actually to not modify the input buffer at all. Or at least, not use it as a place to transform it into the output markup. Rather, the scanners work to only identify markup & content boundaries - ranges - and these are all gathered, before any attempt at producing output markup is made.

Yes, the input buffer is modified by some syntaxes: macros, verbatim, escaped newlines, etc. But I can keep track of these permutations so that already claimed regions are kept up to date wrt their locations within the evolving input.

Basically,
  1. 'Scanning' produces an sorted-array-per-syntax of ranges - these have: start_markup, start_content, end_content, end_markup, and a partially instantiated Foswiki::DOM::Node object ( although, I'm now rethinking this, if we use Marpa to assemble the DOM tree, it's better to have a more lightweight DOM::Token thing perhaps )
  2. A data structure is built (some sort of sorted tree, an interval tree perhaps) is built (or not, if we use Marpa)
  3. This tree can be walked to either be transformed in-place into a DOM tree (maybe?)
  4. The DOM tree gets passed off to something that will evaluate it into the desired output markup, Eg. DOM::Writer::XHTML

In my approach, there are two kinds of macros: traditional registerTagHandler tags which are processed early on (after verbatim), and DOM tags which get a second opportunity for processing, after the DOM has already been assembled. These tags can't return simple text trings, though; they need to return DOM nodes/trees, but that might be a small change in habit:
return 'This is *bold* text!';

# to

return Foswiki::DOM->new('This is *bold* text!');

-- PaulHarvey - 18 Jan 2012

Julian, the difficulty with parsing TML is that it's a context-sensitive, non-deterministic mess - if we try to fully process macros, then each time we parse, the parse tree will look different.

I am hopeful I have some practical solutions to this mess, but it's only a fun experiment for me at this stage.

I wouldn't have embarked on this if I didn't see all the other wiki engines out there sporting server-side DOMs for their wikitext markup.

-- PaulHarvey - 18 Jan 2012

They can sport server-side DOMs precisely because they don't support that most excellent feature, viz. macros.

One other thing to think about; permissions. Your syntax tree has to know about the permissions that apply to each part of the tree if it is ever to be used to modify content.

-- CrawfordCurrie - 19 Jan 2012

Paul your eyes do seem to be wide open to the challenge ahead — and God speed I say. I look forward to seeing some great lateral thinking.

As has been mentioned, you may not achieve you ultimate goal, but you will undoubtedly clarify what the TML parser must; must not; should and should not do etc. You will also need to handle the broken TML and even define good rules/guidelines as to how to handle it.

All in all you will lay the foundations for a much better TML parser (or even find that you have written one).

I look forward to adding explain processing (but you can add that as you go if you like smile )

-- JulianLevens - 19 Jan 2012

Sure - I am hopeful that after I feel comfortable with the lexing, a naive Marpa-based parser might be an easy way to produce an explain type output. Or even a TML-tidy! smile

-- PaulHarvey - 19 Jan 2012
Topic revision: r9 - 19 Jan 2012, PaulHarvey
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy