Discussion: Problem with ACL Performance At large Sites

Motivation

Low performance at large legacy sites with mandatory ACLs can be a showstopper for upgrade from ancient TWiki to new Foswiki.

Description and Documentation

On a sample environment with about 10K users, most of them in the access control group for a simple page, the page took 3.5s to render. TWiki2 rendering the same page on identical hardware takes 650ms. The hardware was an unloaded quad core with 32G of RAM vs about 6G of twikitext.

Examples

apache.conf

Apache's logfile format was modified to also report the total time Apache2 and TWiki/Foswiki took to generate and deliver the HTML. The second-to-last entry is the number of microseconds. The additional latency and range items are at the end of an otherwise normal Apache logformat, so most logfile parsers should need little to no modification.

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %D \"%{Range}i\"" loglatency

CustomLog /var/log/apache2/latency.log loglatency

latency with ACLs

WebHome took 1.5s to render, while a simpler ProjectPage of similar size (no %INCLUDE or %WEBLIST) took 3.3s.

n.n.n.n - user [04/Nov/2009:14:42:21 -0800] 
"GET /twiki/bin/view/Main/WebHome HTTP/1.1" 200 47255 "-" 
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) Gecko/2009090216 Ubuntu/8.04 (hardy) Firefox/3.0.14"
 1524630 "-"

n.n.n.n - user [04/Nov/2009:16:22:24 -0800] "GET /twiki/bin/view/Main/ProjectPage HTTP/1.1" 200 45271 "-" 
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) Gecko/2009090216 Ubuntu/8.04 (hardy) Firefox/3.0.14" 
3335258 "-"

latency without ACLs

After bisecting the slow page, removing the ACLs reduced rendering time to 1.4s, and putting them back raised it back to 3.3s. The ACL on this page was a pair of groups that included most of the 10K users.

n.n.n.n - user [04/Nov/2009:18:26:50 -0800] "GET /twiki/bin/view/Main/ProjectPage HTTP/1.1" 200 41692 "-" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) Gecko/2009090216 Ubuntu/8.04 (hardy) Firefox/3.0.14" 1412737 "-"

Impact

%WHATDOESITAFFECT%
edit

Implementation

-- Contributors: DrakeDiedrich - 05 Nov 2009

Discussion

Drake, the feature request relates to ACLs. What makes you think it's the ACLs that are killing performance? Have you done any experiments? If so, can you pin down what aspect of ACL processing is taking the time? e.g. reading/parsing topics, memory management, etc.

I'm not saying you are wrong; I know the ACL impl is not as good as it could be. I'm just seeking confirmation that this is the reason for 0.65 becoming 3.5.

-- CrawfordCurrie - 05 Nov 2009

  • Which user mapping and login manager are you using?
  • How many users are in the relevant group?
  • How many groups and users are there overall?
  • How are groups nested?
  • What exactly is the performance difference of the same page with and without ACLs? (I don't get the numbers in the logs wink )
-- MichaelDaum - 05 Nov 2009

I've improved my reported examples so I hope they're clearer for the stats with and without ACLs. I'm using a modified Apache2 logformat to get the times. SvenDowideit had me try to set up nytprof last night, but the machine it is being tested on is running an ancient OS, so it will take me some time to backport everything to get better stats. The effort looks comparable in size and scope to upgrading. I'll report back with more as I get better information - probably a script to generate a similar test set that has the same performance issues on a modern system.

The login manager is ApacheLogin, each user has a user page and each group is a normal wiki group listing all members (generated by scripts hourly from other sources, but Foswiki/TWiki just see normal wikigroups). There are about 500 groups total, most small, but a handful that contain many or most of the wiki users. One large and one small group were the ACL for this page, they weren't nested. Other pages have ACLs with a group that nests these two, though I haven't yet tested that performance.

The issue reported here is the gap from 1.4s to 3.5s on Foswiki. I'll raise a separate issue for the TWiki2-to-Foswiki gap of .65s to 1.4s when I have at least a suspect for that gap.

-- DrakeDiedrich - 05 Nov 2009

For completeness, what version of Foswiki are you running?

-- CrawfordCurrie - 05 Nov 2009

foswiki 1.0.7-auto5254.deb at the moment, tracking a couple of days back from the head of the latest release branch usually. I'll get a test machine up that I can profile trunk with (lesser hardware though), and add some new data when I have it going.

-- DrakeDiedrich - 05 Nov 2009

additional bits from irc - Drake indicated his test system is a quad core with 32GB ram - and he was the only user..

-- SvenDowideit - 06 Nov 2009

The TopicUserMapper is not intended as a large scale user and groups system - it simply is too primitive, and while we could add caching to mask the effects, it is better to replace it with a more suitable solution. I wrote HTTPDUserAdminContrib specifically to start the ball rolling on this.

If you could (please) import your users&groups into a Database?

even so, I think you're seeing a convergence of 2 problems - massive GROUP definitions, and inefficient re-re-re-testing of topic and web permissions - so whatever happens, its an important issue.

-- SvenDowideit - 30 Nov 2009

Having read the user code (in various stages) multiple times it seems the problem is going deeper than any XYZUserContrib could mend. User mapping code only sits on top of standard operations of the core like "is member of group" and the like that are expanding lists recursively which of course is not an optimal approach. There are a couple of factors whose impact on the complexity of the algorithms must be decreased or even removed from the equation, i.e. number of users and number of groups per standard operation. This doesn't go without YAUCR (yet another user code rewrite).

-- MichaelDaum - 01 Dec 2009

if you use groups nexted groups that cannot be expanded in O(1), then yes, you're always going to get a slowdown - that isn't going change, no matter what you do. On the other hand, like most scalable data designs, there is nothing that will prevent a good UserMapper from making "is member of group" O(1) - there is no need to change the core for that.

in perl code recursive expansion of nested groups is only the default implementation - we are free (and i think i already have in one or more of my mappers) to re-implement that in a more efficient way.

When I did the usermapper refacotings II was careful to make the topic based , re-interpret every cgi call assumptions and implementations replaceable by mapper implementations so no, I don't think a YAUCR is needed - especially not for this.

-- SvenDowideit - 01 Dec 2009

I have changed the format of this topic from feature proposal to basic form Brain storm.

To be a feature proposal, there should be a proposed spec for a new feature or change of feature.

This topic describes a problem with performance and the discussion is about tracing the root cause.

If someone comes up with a proposal to resolve this in core, by all means raise a feature proposal. It is an important subject. Very important. But I would like the feature proposal wiki application to be the tool for making decisions on spec changes and enhancements to core and default plugins.

-- KennethLavrsen - 24 Mar 2010

The recursive group expansion is just a "best guess" as to what's going on here. Drake, do you have any new information regarding this problem?

-- CrawfordCurrie - 25 Mar 2010

I do not know if this is your problem, but our pages took 20s to load, because there was a call to %WEBLIST in the left margin of the skin, andd we had 250 webs, most of them read-protected. Just listing all the webs and checking each 250 ones for access right took 20s. We removed the %WEBLIST

-- ColasNahaboo - 25 Mar 2010

For what it's worth, we had the problem that even with MongoDBPlugin, a particular query - to get the first page of results (there were ~4 hits total, pagesize=25) that a WikiGuest user had permission to see (in a web with ~35,000 topics in it), took minutes because every single topic had to be shuttled over the network from the database to Foswiki, which parsed out prefs to learn that the user running the SEARCH shouldn't see it in their results.

So the current version uses Foswiki::Access API to allow MongoDB to pre-filter the result set. It basically tacks on an ... AND (<ACL ID> IN <ACL Profiles>) onto the end of every query.

This has the remarkable benefit that:
  • You never have to run some async job to keep an database-side ACL cache up-to-date. It's always up-to-date, each topic has a list of ACL IDs that apply to it.
  • Avoids having to resolving nested/hierarchical set questions at the database.
  • Pre-filtering the query to hide topics from the resultset is just a matter of ... AND (this OR that OR other) which is easy for the DB to resolve

Query times went from several minutes down to ~1.6s

There is some explanation here on IRC.

-- PaulHarvey - 28 Nov 2011
Topic revision: r15 - 28 Nov 2011, PaulHarvey
The copyright of the content on this website is held by the contributing authors, except where stated elsewhere. See Copyright Statement. Creative Commons License    Legal Imprint    Privacy Policy