[Builder 400] The Multi-faceted Nature of Developing and Deploying Data-driven Solutions

KGCLearning · November 24, 2024, 4:46am

Considering the multi-faceted nature of developing and deploying data-driven solutions presented in level 400,

For Part 1 (3 Points): Using your own field of expertise, discuss one challenge presented by the complexities inherent in the the gap between theory and practical implementation of solutions that benefit individuals, businesses, and society as a whole.
For Part 2 (2 Points): After submitting your initial response, select a peer’s response. Write a comment or review on your peer’s exploring how to turn their challenge into an opportunity to contribute to a more informed and responsible use of data-driven technologies across various fields.

ankuku2002 · January 18, 2025, 12:10am

My first crack at this based on the capstone project as I dont have professional experience in this domain – One challenge I’ve encountered in the field of knowledge graph (KG) development, particularly when moving from theory to practical implementation, is the difficulty of aligning complex domain-specific knowledge with existing public datasets. In theory, KGs are powerful tools that enhance interoperability and the ability to integrate disparate data sources. However, when building KGs for highly specialized fields, many entities and relationships are not well represented in widely used public databases like DBpedia or Wikipedia. This creates significant challenges in practical implementations, as it becomes difficult to connect named KGs (those specific to a particular domain or application) to a master KG or broader knowledge graph. Inconsistent naming conventions and entity misalignment often hinder the seamless integration between these specialized KGs and the overarching master KG.

Another challenge I encountered was with the limitations of existing frameworks in supporting reification (or nested triples) to the extent needed to represent complex relationships. During my work on an energy storage system KG development project, I found that most frameworks used for building KGs, including RDF and GraphDB, did not natively support this capability. As a workaround, I had to use blank nodes, but this reduced flexibility and limited one of the primary benefits of KGs—interoperability. This gap between theoretical potential and practical implementation is a significant hurdle when trying to develop solutions that can benefit businesses and society, particularly in specialized domains where comprehensive, standardized data sources are lacking.

Additionally, designing and implementing an effective ontology for such systems without domain expertise further complicates the practical application of knowledge graphs. In a real-world setting, successful KG deployment requires not only the technical tools but also a deep understanding of the domain to ensure accuracy and utility, which is often overlooked in theoretical models.

Lu · January 20, 2025, 4:23am

The challenge I am meeting is in my capstone project “Hack Climate Models”. In climate science, as described, the theoretical models aim to create a centralized infrastructure that integrates diverse and dispersed datasets, which involves Entity Extraction and Knowledge Graph Construction. In detail, we need to identify climate models and datasets from unstructured sources using AI and then establish interconnections between these entities for meaningful insights.

I feel the challenges are:

Data Heterogeneity. Climate data comes in various formats and standards. However, I got inspration from this paper: Automatic Sustainability Objective Detection in Heterogeneous Reports
Real-time Updating. Practical implementations must account for constant updates and the influx of new data.
Usability. Are they easy to use for stakeholders?

Lu · January 20, 2025, 4:33am

Your reflections on the challenges in knowledge graph development are insightful. It seems that these challenges can actually be seen as opportunities to push the field forward in meaningful ways. For instance, the limitations you’ve experienced with frameworks for handling nested triples point to an opportunity to advance the technology itself. Developers may need to put this issue on the desk.

For your last concern, I feel nowadays the multidisciplinary research has advanced a lot, there are many multidespline research projects and researchers know each other well. For example, in my case, I have a computer science background, after participating in a environmental project, I learnt a lot from collaborators and expand my research interests.

ankuku2002 · January 20, 2025, 3:19pm

Your observations are highly relatable. Integrating heterogeneous data in an automated and real-time manner is indeed a significant challenge. The paper you shared, which proposes using graph transformer models and fine-tuning them, resonates with my experience. In my experiments following a similar approach, I also found it to be the most effective method so far.

On the note of usability, I believe this is where GraphRAGs can play a pivotal role. With AI-assisted chatbots and well-designed visualization tools, usability for stakeholders can be significantly enhanced. These tools make complex data interactions more intuitive, bridging the gap between technical solutions and practical applications. Very good insights and questions.

ankuku2002 · January 20, 2025, 3:26pm

I completely agree that the lack of established frameworks and methods presents valuable opportunities for future development. However, as someone at the beginner level, I sometimes wonder if these challenges reflect actual gaps in the field or are simply a result of my current understanding. Either way, it’s an area that seems ripe for exploration and innovation.

agreen · January 21, 2025, 1:12am

As someone who works in open-source technology, a primary challenge is that open source projects while beneficial for implementation, often have a higher technical barrier for entry. It requires that the user understand and be able to troubleshoot incomplete or overly technical documentation efforts. For instance, as a graduate student I wanted to better understand how to map blue carbon efforts for a capstone project. This required me to have a technical understanding of an open source software to map mangrove forests and provide the necessary calculations. While in theory, this should be a low barrier to entry with appropriate datasets, R, and QGIS, incomplete documentation, or one that is not both free and open source did not allow for a conducive ramp forward. I was able to understand the technologies and interpret my results, but when I arrived at an error, I did not have the proficiency to troubleshoot that myself as a non-expert user. A framework that allows for complete documentation, references, and training resources would allow for a more successful transition from theory to practice.

agreen · January 21, 2025, 1:15am

I think that you’ve articulated some real challenges in the climate space, and the usability for stakeholders is an important point. I think that the same framework for knowledge management of climate data models should evolve into data hubs with a UI that allows for stakeholders, politicians, and citizens alike to be able to access and interpret data. I think that even being able to visualize data from last week/month/year might help better “hook” stakeholders into the why of the data and promote efforts that might fund real time data.

Vijay · January 21, 2025, 1:07pm

One of the graph-analytical products I contributed to, which focused on measuring IT product delivery effectiveness and streamlining SDLC processes, utilized Neo4j Community Edition and open-source tools like Grafana. This platform replaced costly reporting tools, significantly reducing expenses and eliminating the need for manual reporting across teams.

One particularly challenging and less-discussed issue was the varying data accommodation periods across different SDLC tools, which made data correlation a tedious and manual process. I encountered this challenge while working on the aforementioned graph-analytical product and addressed it by designing and implementing fail-safe graph algorithms for data correlation and enrichment—an achievement I am proud of. Automating these processes minimized manual effort, enhanced data integrity, and enabled near real-time dashboards, delivering actionable insights with exceptional efficiency.

Vijay · January 21, 2025, 1:24pm

Agree with you, I’ve also faced challenges multiple times when setting up open-source software on my local machine. What I’ve learned from these experiences is to always look for well-maintained repositories and try to use them. Additionally, it’s helpful to check the “Issues” tab of the repository to see if your issue has already been addressed or to raise a new issue requesting assistance.

Vijay · January 21, 2025, 1:35pm

Yes, there is always room for exploration and innovation, and I believe there’s no right or wrong way when it comes to organizing data in graph models. I’ve worked on an ECM where we used a hybrid graph model that combined the best aspects of both LPG and RDF models. And on many occasions, I encountered situations where the existing metadata framework wasn’t adaptable to all types of data. In these cases, we had to fit the data to align with the constraints of that metadata framework, knowing that not all data could conform to every level of the framework.

mmw · January 22, 2025, 2:06am

For this question, I will also assume the question refers more specifically to knowledge graph technology even though it is not directly mentioned. I can discuss a few challenges relevant to implementation in knowledge graph building. There are a number of challenges that involve teaching users how to formalize informal - natural language - image descriptions into graph language.
A major feature of the ImageSnippets system is that it contains an interface for users to build named graphs from image descriptions linked to their images using no-code. Images and image regions are stored as subjects and their descriptions are structured as linked data stored in a large triple-store graph written in the RDF syntax. The triple-editor provides a set of properties that can be used to relate the object entities to the subject. In general, users learn to construct triple statements in the form of annotations using a kind of hybrid RDF-English pidgin language. This has been published about academically in a few places
There is a lot of value and utility in replacing what would normally be keywords with linked data entities and the triple-editor hides the RDF syntax from the user, but that doesn’t mean it can’t sometimes be difficult to conceptualize how best to model an image description in structured language. When formalizing informal language, it can be quite difficult to decide how to structure it in workable RDF.
For example, It is one thing to say that an image depicts a house; this is pretty straightforward. But it is harder, in triples and with RDF to say something like: this image depicts a house lived in by Abraham Lincoln in 1849. Part of the reason this is hard is because of a well known problem in RDF syntax called, reification. With reification, you are trying to write RDF statements about RDF statements. In fact, reification is still a problem that is in the process of being worked on by RDF working groups. One of the things that makes it even harder, is that it is difficult to know exactly what you are reifying and how to notate the statements that are being reified.
An example of this is that you can either be modifying the contents of a statement with more metadata about that statement OR by describing something about the statement itself.
We cannot actually use reification in ImageSnippets, which is one thing that makes formalization harder, but even without that consideration, it can also be difficult to understand when you are talking about classes or prototypical instances of things those classes. For example, am I talking about all kitchens (the class of kitchens), a typical example of a kitchen in a particular culture or a specific person’s kitchen; i.e. Margaret’s Kitchen in her house in Florida. All of these statements need to be formalized in different ways in RDF.
Understanding the nuances of constructing triple-constructions is a type of data literacy that will be important regardless of whether you are building a personal knowledge graph or a graph for any other kind of domain.
Fortunately, as a company, we have done pretty well in training a number of users in how to construct the basic RDF statements using the ImageSnippets triple editor. In the system we also added a no-code entity search function which can be used quite effectively in experimental ways to help understand and validate the data modeling choices made by those who are building triple statement descriptions.

mmw · January 22, 2025, 2:50am

I can really relate to your frustrations. I also mentioned reification (nested triples) in my answer regarding triple construction in RDF frameworks. You also make some other good points and extended the conversation into blank nodes and interoperability and you raised questions about finding usable entities. I had not elucidated on the challenges in finding usable entities in linked data datasets, but is a huge consideration in my use case also. When our users try to describe images in a formalized structure, we have a look-up feature that is very handy for the users to use to look up entities for object values in any number of datasets like DBpedia and Wikidata. Our look-up system helps a lot with the process of finding usable entities. Many of these entities have been obtained from Wikipedia or constructed from various organizations. We often find entities not found in DBpedia or Wikidata in a dataset like the Getty Museum Art and Architecture Thesaurus. But that being said, when a user is trying to describe entities related to local instances, for example, local place names or specialty equipment or domain specific things, those entities need to be created. We do give users a way to create entities when appropriate entities cannot be found. In many ways, this allows users to build ontologies using the images as the source material, but nevertheless the entire process of modeling a domain - even with useful tools and techniques for finding usable entities - can be time consuming. Despite a process that can sometimes be a struggle, there are many use cases in which the value to a domain can ultimately overcome the challenges.

kmc · January 27, 2025, 3:58pm

@Lu I appreciate your insight into multi-disciplinary exchanges among CS and other fields. As a practitioner (software engineer) with no formal CS education, I used to struggle to keep up with the field, and I was always more comfortable with the humanities, arts, and other sciences. Now I find that LinkedIn is a great place to keep up with CS experts as well as content managers and other KG practitioners.

kmc · January 27, 2025, 6:56pm

@mmw I agree that thinking in graphs is a type of literacy that is critical for success in any knowledge engineering project.

Builders face conceptual and practical hurdles, e.g., shifting beyond OOP and RDBMS, acquiring new skills, and demonstrating accomplishments (or explaining challenges and failures).

People with little or no understanding of the KG paradigm may be the ones who set expectations and measure progress. Stakeholders and explorers need to understand, at minimum, the reasons for investing in KG technology.

Our problems as KG builders aren’t unique - developers in every field have similar issues - but our teammates and managers may not have the fundamental knowledge and experience to collaborate with us. There’s a need and (I hope) demand for mentors and facilitators.

kmc · January 28, 2025, 7:49pm

I’ve worked in banking, insurance, and healthcare using knowledge graphs and semantic web software. In these fields, the gap between the theory and practice of data-driven solutions can be traced to many factors, most of which arestrong text not unique to knowledge graph applications.

A major factor slowing down the adoption of KGs is the tech communities’ lack of widespread experience and adoption.

(Yes, this is a circular argument. It’s a positive feedback loop: lack of adoption over there affects consideration here.)

Even when KG projects begin, we struggle to reach a consensus on whether to consider RDF or property graph solutions. Property graph technology has performed better recently than RDF. This trend may be due to a (possibly misleading) similarity between property graphs and more widespread tech, especially relational data models and object-oriented programming.

Explorers and builders (buyers and developers) may struggle to understand RDF and related standards. Other vendors dwarf the vendors of RDF stores and OWL IDEs.

Winds of change blowing in “our” direction:

Samsung’s recent acquisition of Oxford Semantic Technologies promises to put personal KGs in mobile devices. Will Apple take note?
Microsoft GraphRAG is shining light (indirectly) on KG solutions. Large enterprises and software companies are paying attention.
Amazon Neptune supports RDF natively. AWS shops are trying and buying.

KGCLearning · February 4, 2025, 10:37pm

KGCLearning · February 4, 2025, 10:37pm

KGCLearning · February 5, 2025, 9:02pm