Design Patterns For Scalability

I'm wondering what graph partitioning patterns can assist me with scalability and performance. What ones are commonly used, and when should I choose one over another?

[As a mainstream software engineer,] when I briefly think about the number of ways I could partition my data to assist with scaling, the following come to mind:

  • Hashing: Partition by some completely arbitrary scheme, like alphanumeric ranges on a hashcode
  • Location: Partition into geographical regions - assuming a pattern of spatial affinity between queryers and locations
  • Recency: Partition into temporal regions - a pattern of temporal affinity between queryers and when data is added
  • Activity: Partitioning according to activity - where similar transactions require similar data
  • Similarity: Partitioning according to data affinity - related data lives together

I'm sure there are loads that I haven't thought of that arise out of the unique technology and its capabilities, or out of the problem domain. I can also imagine generic solutions (e.g. involving caching) that might have a big impact on my performance.

Have you tried any of the schemes here? What were your experiences? What are the criteria you used to choose one pattern over another? Are there any good references, or discussions, on the trade-offs to be made? Did any of them make administration of the data (and its integrity) easier or harder? Are there any non-linearities that you witnessed in the performance of your platforms as they grew?

It something I have been discussing quite frequently over the last few months. In summary, you are talking about a sharding convention or strategy and the examples you give generally fall into two categories: Sharding by physical characteristics (activity, hashing) and sharding by logical charaterstics (affinity, location etc).

In my eyes one important factor that influences this is the whole inference and reasoning aspect, specifically in the context of using a triplestore that supports reasoning and implementing a solution that relies on this (in some cases the reasoning tier might be separated from the triplestore).

Another thing to consider is how a query actually gets processed. So in the example above, you might have a SPARQL federation tier that sits on top of SPARQL clusters, sharded by affinity (starting to like that concept now). From a performance POV it might not make sense to question all the shards with the query, so before the query gets executed, you can question a separate registry service and ask 'which shards have information about people'. There is nothing to prevent the registry tier also acting as the SPARQL description service and actually you can extend this to provide info about the shards.

I can only comment on one:

Hashing: Partition by some completely arbitrary scheme, like alphanumeric ranges on a hashcode

In terms of data partitioning strategies, beware of hashing schemes for RDF. Simple schemes would hash terms and do something like:

 machineId <- hash(ex:term) % numberOfMachines

to place triples containing ex:term in some position on a given machine. If you receive a query about ex:term, you use the same procedure to find which machine to ask.

However, the variation in the frequency distribution of terms can cause serious skew. In particular, your data is probably going to have a lot of triples with rdf:type in the predicate position (IIRC, about 20% in real-world Web data as a rough estimate). Sending all rdf:type triples to one machine is not good for scalability.

In general, avoid hashing predicates (on their own) at all costs. Also avoid hashing objects since the values for rdf:type can also exhibit skew (e.g., don't want to give a machine all triples with rdf:type foaf:Person if dealing with Web data). Under common patterns of publishing, subjects are generally safest to hash, and often hashing objects which are not values for rdf:type is okay.

Another option for hashing is to combine terms in the hash, like to hash over concatenated subject--object terms. However, this restricts the kinds of lookups you will be able to send directly to a given machine (the lookup will need to have the same positions used for the hash set as constants) and doesn't have the same locality features (joins cannot be done so easily on one machine).

There's a lot of workarounds and other possible solutions, like range-based partitioning of the lexical space based on frequency (e.g., machine 1 handles terms starting with A-D, machine 2 E, machine 3 F-J, where each range specifies an even load).

On the general topic of linearity, skew will prevent scaling out (adding more machines).

Joins will prevent scaling out and up (more data). Running anything more than the most trivial of joins over the network is painful (unless you can make heavy use of caching). Thus, designing your shards/data placement to keep as much join processing local is crucial.

But yeah, a good PhD topic for someone to look into.

Refs re: hash partitioning and skew for RDF data:

Andreas Harth, J├╝rgen Umbrich, Aidan Hogan, Stefan Decker: YARS2: A Federated Repository for Querying Graph Structured Data from the Web. ISWC/ASWC 2007: 211-224

Spyros Kotoulas, Eyal Oren, Frank van Harmelen: Mind the data skew: distributed inferencing by speeddating in elastic regions. WWW 2010: 531-540

(I would not be as pessimistic about hashing as the latter paper is.)