Can the economic problem of shared SPARQL endpoints be solved?

database_animal · February 11, 2012, 11:00pm

One of the most common themes I see on this site is that people are using the DBpedia public endpoint and are having trouble with queries that time out. I don't want to single them out, because the reason people are complaining about them is that they've got the most compelling product out there. The Talis/Kasabi people have been talking about offering a paid hosting service for RDF data and they haven't quite made it happen either, I believe, because they face the same problems.

To fill in the background, I'm currently working on a product which is like DBpedia but a lot better. It will contain information on more than 22 million different topics, and for the most part have much better recall and precision of facts. It will be aligned with DBpedia so it can be joined with DBpedia to exploit areas of DBpedia which will be superior. I'd like to deliver this product to customers in whatever way they'll find useful, and it's clear that many people don't want the burden of running their own triple store, but...

The usual model of charging people $N for Y API calls just won't work, where the "API Call" is a SPARQL query, because the cost of a SPARQL query can vary drastically. Even though I can build an elastic system based on AWS, I need to know that $N will pay for the hardware to do those Y queries. I'd like to offer a very good SLA, because if people are paying money, they deserve to get a good service. However, I think some users (particularly people like me) could be very expensive to serve because we like to write very complex and expensive queries -- I'd have to quote a very high price and I'm still not sure I can offer a good SLA for that price. On the other hand, that price will be way too high for people who mostly write simple queries, so I lose their business.

The part of me which is a technical guy would just like to sell an RDF dump and let people worry about their own triple store, but the business part of me has reservations about that, and the "voice of the customer" seems to say that people really like having somebody else run a triple store for them.

So what's the way out of this predicament?

RobVesse · February 11, 2012, 11:00pm

So the way I see it is that you are essentially weighing up two quite different business models. Both have pros and cons which I'll get into but I think there is also a hybrid model in there which perhaps you should consider which I'll also try and explain.

Shared SPARQL Endpoint

Now as you've said there are clear problems with this model most notably how to ensure QoS for all users when different users may be using the endpoint very differently. Plus there is the complication of how you charge a user for their query, as a naive start point maybe you define the cost of the query purely in raw time terms? The longer a query takes to run the more you charge the user, couple this with the ability for the user to control their maximum runtime (maybe also throw in some partial results capability) and you start to get something that may seem fair.

Even if you have this kind of charging model it doesn't solve the basic problem of contention, how do you ensure one hard query doesn't make the easy queries of other users slower (and thereby more expensive to them) and vice versa - do lots of easy queries make it difficult for the system to handle a hard query at the same time? With my previous point of charging based on runtime is it fair to do this if contention may cause a normally cheap query to become very expensive?

Another issue to consider is what triple store(s) are you going to use to host your endpoint(s)? Different stores have different performance characterstics and some stores may lend themselves to certain kinds of queries over others so it may be useful to have multiple stores some of which expose additional features or capabilities to users e.g. full text indexing

RDF Dumps

Now RDF Dumps may seem attractive in that you essentially just become a seller of data but this poses other problems. Are people paying a one-off cost to get a dump or are they paying a subscription to always get the latest dump as you make updates to the system? If the latter then you have the problem of the delivery, assuming a dump of gigabytes in size your bandwith/data transfer costs for your CDN could become very expensive.

Also with this model what post-sale support (if any) do you provide? Is it a case of you've given the customer the RDF and you wash your hands of them or do you give them support and guidance to get their triple stores loaded and running?

Hybrid Model

So the hybrid model I propose is that you leverage the power of the cloud to make things both affordable and scalable. Rather than build out an endpoint or an RDF dump build out server images for various cloud compute platforms. You could build images that had the data preloaded into a variety of different triple stores and then have some dynamic provisining capability.

Essentially a user comes along and says I need an endpoint using X store for Y amount of time on a machine with Z power (i.e. machine instance size) and you calculate the cost, charge them appropriately and bring up the store ready to use and say here's your endpoint. When their Y time runs low you drop them an email - "Do you want to extend your time?". If they don't care what X is then maybe you have some default store you prefer that they get automatically.

Along those lines you could think of the Y time as credits, if the user wants to buy 10 hours now but only use 1 hour immediately they should be able to do that. And I'm sure you could come up with a nice pricing model that embodies bulk discounts etc while still allowing you to make money on top of the actual cost to you of provisioning and using whatever cloud compute provider you prefer.

Main downside to this model is probably going to be licensing costs, are you using commercial triple stores and if so how does the licensing for those work when you may be dynamically provisioning many instances of them? On the flipside are the open source/free versions of some triple stores sufficiently powerful to allow you to run with this model?

AndyS · February 11, 2012, 11:00pm

Long ago, I worked on a timesharing machine (it was a long time ago).

The charging model was that based on load of the machine at the time and the amount of CPU units used ("mill seconds"). This has the effect of encouraging at unusual working hours. People got a lot done at 3am.

While "$N for Y API calls" is one charging model there are others such as charging for resources used. $N buys you X seconds of elapsed execution time. Run out of credit and your query is killed.

You don't even have to factor in the notional a value of the data - the load factor of the system will do that for you dynamically - but you can value-price by including a constant for the value of the data. A real system will want factor other limited shared resources. Just look at AWS - network and IO ops are charged for. Some about of "$N for Y API calls" to stop flooding if that is a concern (charging for query compile time).

The advantage of the scheme is that outside the query system. Analyse the logs for much of the work. all it requires is the ability to kill running queries. If you add help from the query engine, such as cost estimation, then you can soften the user experience of killing queries.

seralf · February 11, 2012, 11:00pm

Hi the scenario you are talking about it's really interesting, but i don't understand clearly your question.

If you want to share an ontology which is some kind of more-specific-model or superset of dbpedia you could offer an RDF dump and a SPARQL endpoint for it with the idea of using endpoint for test and development and the dump for the development and production of third-party libraries.

if you are worried about end-users i thinl -it's my point of view- you should develop a service with some user experience on in order to navigate the data: if the user is a semantic web developer or a domain expert it has much more interest on dumps and endpoints, but the end-users are probably not interested in that kind of services.

Finally: have today and endpoint with good SLA for such a huge dataset it's not trivial: i think moving to some cloud development could be a good idea, please consider the issue related to the multi-threaded request also. Mayvbe you should consider using bigowlim or virtuoso, that scales very well, or have some experiment with the tinkerpop stack, over neo4j or OrientDb