Caveat: the following are observations about RL vs. DL in general – not based on quantitative comparison of multi-hop vs. ConvE, etc.
One benefit (or distinction) of using RL for a problem is that in lieu of producing a model, instead there’s a policy developed. The exploration/exploitation trade-offs allow for policies to get updated (in many cases) while being used, so there’s an aspect of continuous learning. Also, RL approaches tend to have more of a notion of “resilience” or “robustness” given their optimal control theory origins – more so than DL, and are focused on long-term rewards, so they can be optimized for guarantees of certain behaviors, etc. OTOH, RL can require more computation.
Also, there may be some advantages in terms of representation for embeddings. In RL, there are 1+ agents interacting with an environment. In this case, components of the KG may define the environment – or it may also include other aspects of the KG use case outside of the graph, whether that’s built as a simulation or offline learning with data, or a combination. RL has two main forms of feedback loops: (1) observations about the effect of agents’ actions on the environment, and (2) the reward structure. Relatively simple rules for the latter can lead to quite interesting behaviors, and this might simplify some challenges for embedding. Since there can be multiple agents, these can be structured into hierarchies (learning different portions of the problem, at different rates) or possibly even communicating with each other in terms of population learning, curriculum learning, etc. [re: my question to Victoria at KnowCon] These have proven to be powerful techniques for learning (see Danny Lange, et al., @ Unity3D AI) and some appear to be difficult-to-impossible to achieve merely through DL. Michael Jordan @ UC Berkeley RISElab has proposed that some of these aspects of RL may show that markets in general represent a form of evolutionary intelligence, which could broaden the application space considerably.
IMO, if the structure of a KG is relatively static, then RL might not be as useful in comparison. However, if the use case is dynamic (new nodes and relations getting added or removed over time) then RL could provide some advantages.
Yesterday when I was watching the presentation the first thing that came to mind was the Competing Bandits in Matching Markets paper(Lydia T. Liu, Horia Mania, Michael I. Jordan) and we might be able to use it in this context, but maybe I am just looking for some way to use it!