How does one mitigate abuse of a public SPARQL endpoint?

I'd like to make our SPARQL endpoint public, but am a little concerned that it could easily be abused¹ by malicious — or maybe just enthusiastic — parties.

¹ Specifically, be subjected to a denial-of-service attack.

Some queries (like "SELECT * WHERE { ?s ?p ?o }") will bring the thing to its knees. We're running Fuseki, and I'm experimenting with validating the AST of a query by hooking in my own QueryExecutionFactory. However, this is probably a losing battle as there are any number of queries that might be unreasonably greedy.

I've noted https://issues.apache.org/jira/browse/JENA-47, but incessant querying might still be a problem.

Other options include rate-limiting and adding additional delays for enthusiastic users.

So, is endpoint abuse a problem you encounter? Do you use any of the above, or something else? What works for you?

On our public endpoints, we use the following to keep them stable:

  1. Caching - we use an HTTP cache for all public queries. We're using nginx for this, but any web cache (e.g. squid, varnish) would probably do.

  2. Limit results - we limit queries to returning the first 2000 results.

  3. Soft limit - this might be specific to 4store, but we set a limit on the depth the store itself will search.

  4. Query monitor daemon - we've got a daemon running which watches the CPU, memory and execution time of all query processes. Anything which exceeds certain limits is killed.

  5. Slow query analysis - all queries and their execution times are logged, and reports generated from these. For us, plenty of slow queries have been due to bad data having been imported, rather than badly formed queries.

These all apply to external requests, internal ones bypass the cache and result/store limits (we could also add API keys to do this).

One thing we're not doing that could help: Keep track of IP addresses versus query time - we can then rate limit (or block) IP addresses if an IP makes either a vast number of fast queries, or a lesser number of slower ones.

Some SPARQL store vendors differentiate themselves by offering features that address this (to some extent).

I'm thinking of OpenLink and its Virtuoso product in particular. They are running DBpedia, which has perhaps the most-(ab)used public SPARQL endpoint at the moment, so we can assume they have some experience with this issue. Their main strategies AFAICT are predicting the execution time based on a query plan and refusing complex queries; and returning partial results for queries with long execution time.

I don't think there's all that much that you can really do outside of the store implementation, except requiring an API key, but that would sort of defeat the purpose of a public SPARQL endpoint.

Edit: I should clarify that I don't really know if or how any other stores handle this; I mention Virtuoso just because we have a bit of experience with it.

ARQ is in the process of adding query timeout (JENA-29, is done, which is a precursor to JENA-47).

It does not stop the plain old denial-of-service issues you highlight but the ways to deal with that are the same as any web server - a sophisticated front-end to balance/exclude requests.

sparql.org gets these occasionally, mainly caused by errant programs and their bugs. sparql.org is not guaranteed to be up 24x7; it's best effort. It does have some internal limiting such as the size of data it will load.

In the Talis Platform we've got extensions that support query time-outs. I believe this kind of feature is going to get added to ARQ itself, making it more generally available.

In general of course, you can use all of the usual features to limit access to servers, e.g. IP rate limiting, limiting number of simultaneous requests from a server, etc.

In Kasabi we're using API keys for all APIs, including SPARQL endpoints. This will provide another point of management, although we've not put anything in place yet.