Semantics of SPARQL aggregates

cygri · June 26, 2012, 10:00pm

I have a question about SPARQL aggregates. Consider the following query:

SELECT ?x (MAX(?value) AS ?max)
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

It's pretty simple: Find all ex:p triples, group them by subject, and for each distinct subject, return the object with the maximum value.

But what happens if there are no matching subjects? Let's say, if we run this on an empty graph?

I would have expected that it would return zero solutions.

However what Jena does is this:

-----------
| x | max |
===========
|   |     |
-----------

That result seems rather strange to me. It's not what I would have expected. Where does this extra empty result come from?

Now I don't think that Jena gets this wrong; it's probably that my expectations are off. Can someone explain to me why this is the correct result?

Edit: An example that may make even clearer why this behaviour seems odd to me:

SELECT ?x (SAMPLE(?value) AS ?some_value)
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

If there's no ex:p triples, then there are no values, so there's nothing to sample from. But again, the result is a row with variables ?x and ?some_value, and both unbound.

AndyS · June 26, 2012, 10:00pm

When there is a GROUP BY and no matches we have:

Group(ExprList, Ω) = { ... | μ in Ω }

and Ω is empty so

Group(ExprList, {}) = { }

In the case of an implicit group (the grouping when no explicit GROUP BY is used). It's introduced in the algebra translation at 18.2.4.1

If Q contains GROUP BY exprlist
   Let G := Group(exprlist, P)
Else If Q contains an aggregate in SELECT, HAVING, ORDER BY
   Let G := Group((1), P)

so where there is no GROUP BY the key is (1) - it's just some value to hang everything off and gets removed in AggregateJoin where keys are used to map to the aggregate evaluations but do not get passed on.

In both cases:

Group(ExprList, Ω) = { ... | μ in Ω }

so when Ω is empty ...

Group(ExprList, Ω) = { }
Aggregation(exprlist, func, scalarvals, { key1→Ω1, …, keym→Ωm } )

= { (key, F(Ω)) | key → Ω in { key1→Ω1, …, keym→Ωm } }

= {}

which makes:

M(Ω) = { ListEval(exprlist, μ) | μ in Ω }
   = {}
F(Ω) = func(M(Ω), scalarvals)
   = func({}, scalarvars)

The count(*) case is count of {} = cardinality of Flatten({}) = 0.

The max(?x) is

Max(M) = Max(ToList(Flatten(M))).
Max({}) = Max(ToList({}))
        = undefined so it's an error.

Error in select expressions means unbound variable.

RobVesse · June 26, 2012, 10:00pm

I am struggling to find some evidence to back this up but my intuition and understanding is that an aggregate always returns a value (which may be unbound) even when operating over empty results. As that value has to go somewhere engines are obliged to create a single result row to return that value.

Intuitively this makes sense, consider a simple query like the following:

SELECT (COUNT(*) AS ?triples) WHERE { ?s ?p ?o }

On an empty dataset I still want to be returned a single row containing zero rather than no rows. Even with a GROUP BY involved I should still be able to be told that there were zero results.

I tested with a bunch of implementations and all exhibit the same behavior as Jena, for the record I tried dotNetRDF (my own implementation), Sesame and Virtuoso

Signified · June 26, 2012, 10:00pm

In the aggregate algebra for SPARQL 1.1, the Group algebra is defined with:

Group(exprlist, Ω) = { ListEval(exprlist, μ) → { μ' | μ' in Ω, ListEval(exprlist, μ) = ListEval(exprlist, μ') } | μ in Ω }

If the solution set Ω is empty, then the set should be empty. ~~Sounds like a bug.~~

EDIT: As per Andy's answer, it's not a bug. Though the set is empty, expressions such as MAX, SAMPLE will return an error and a row of UNBOUNDs when the argument is an empty list.

EDIT: The following is unimportant and can be ignored.

I'm not so familiar with the algebra, but if for some reason the solution set Ω contains a tuple of UNBOUNDs (it won't), then the following would apply:

ListEval((expr1, ..., exprn), μ) returns a list (e1, ..., en), where ei = expri(μ) or error.

ListEval retains errors resulting from the evaluation of the list elements.

ListEval essentially just performs the evaluation of variables whose results are used for grouping.

It then states:

Note that, although the result of a ListEval can be an error, and errors may be used to group, solutions containing error values are removed at projection time.

ListEval((unbound), μ) = (error), as the evaluation of an unbound expression is an error.

So it seems that error/unbound can be used to group.

However, it also states that "solutions containing error values are removed at projection time". This last statement is a little ambiguous. The values projected in this case are unbounds, not errors (it's the evaluation of unbounds that are errors, not the unbounds themselves). It's unclear (to me) whether these can still be projected.