I'm wondering what graph partitioning patterns can assist me with scalability and performance. What ones are commonly used, and when should I choose one over another?
[As a mainstream software engineer,] when I briefly think about the number of ways I could partition my data to assist with scaling, the following come to mind:
- Hashing: Partition by some completely arbitrary scheme, like alphanumeric ranges on a hashcode
- Location: Partition into geographical regions - assuming a pattern of spatial affinity between queryers and locations
- Recency: Partition into temporal regions - a pattern of temporal affinity between queryers and when data is added
- Activity: Partitioning according to activity - where similar transactions require similar data
- Similarity: Partitioning according to data affinity - related data lives together
I'm sure there are loads that I haven't thought of that arise out of the unique technology and its capabilities, or out of the problem domain. I can also imagine generic solutions (e.g. involving caching) that might have a big impact on my performance.
Have you tried any of the schemes here? What were your experiences? What are the criteria you used to choose one pattern over another? Are there any good references, or discussions, on the trade-offs to be made? Did any of them make administration of the data (and its integrity) easier or harder? Are there any non-linearities that you witnessed in the performance of your platforms as they grew?