Created by Hevok on Dec. 16, 2012, 1:06 a.m.
Denigma is constructed to break the genetic code of life and therefore pave the way to find effective interventions to make aging negligible. The vast amount of biological data however is hidden in the scientific literature and unaccessible for computations.
The concept is based on the assumption that it is be possible to describe what an scientist has been working in order to support collaboration. Theoretically this can be achieved by:
The first questions that arise is to how to map the documents she/he reads to the ontology terms? The solution to this is to use document to data entry similarity for the mapping.
The second question is how to aggregate to get a shorter list? The answer is to use spreading activation algorithm for aggregation
Denigma data entries and categories are used as ontology terms. Categories as generalized concepts are itself defined by data entries.
What a certain document is about can be approached in two ways:
The first approach is flexible and does not require creating and maintaining an ontology, while the second approach can tie documents to a rich knowledge base and make it accessible for computation.
Using Denigma's data entries as an ontology offers the best of both approaches.
Each data entry is a concept in the ontology.
Terms are linked via Denigma's tag, category and hierarchy system as well as by inter-data entry links and data entry relations.
It is a consensus ontology created, kept current and maintained by a diverse community. The overall content quality is high. Terms have unique IDs (URLs) and are "self-describing" for people as well as machines. The underlying graphs provide the structure of data entry tags, categories, hierarchy, links and relations.
Denigma data entry graph is a thesaurus. The graph composed of data entry links is similar to the world-wide-web network, but highly systematic and structured in an unified fashion (i.e. easy accessible for computation).
goal is given one or more documents, compute a ranked list of the top N data entries and/or categories that describe it.
basic metric is document similarity between data entries and document(s).
Variants to explore are the following:
Associative retrieval means that it is possible to retrieve relevant documents if they are associated with other documents that have been considered relevant by the user.
The document can be represented as nodes and their associations as links in a network. At each pulse/iteration, spread activation to adjacent nodes. Some nodes will have higher activation than others.
The constraints are:
The first method is to use Denigma data entry text and categories to predict concepts:
The output are ranked categories:
The second method is similar to the first but uses spreading activation on category links graph to get aggregated concepts. The output are ranked concepts based on final activation score.
It is possible to predict concepts that are NOT present in the category hierarchy by using the data concepts. For this use spreading activation on data entry links graph.
As threshold ignore spreading activation to articles with less than 0.4 cosine similarity score. The edge weights are the cosine similarity between linked articles. The output are ranked concepts based on final activation score.
initial informal evaluation the results are compared against our own judgments. Download scientific articles from
internet and predict concepts. Then use single documents and group of related documents.
For a single document inn general more pulses lead to more generalized concepts.
For the prediction of a set of test documents (e.g. data entries) concepts can be discovered that are not in the category hierarchy.
Select data entries randomly and predict their categories, links, and relations:
Query doc(s) -similar to (Cosine similarity)-> Average Similarity
It is observed that data entries are linked often with both super and sup categories.
It the system predicts a category three levels higher in hierarchy than the original category the predictions is considered to be correct.
Spreading activation with two pulses works the best. Only considering data entries with similarity > 0.5 is a good threshold.
Spreading activation with one pulse works the best and againg only considering data entries with similarity > 0.5 is a threshold.
The prediction accuracy is affected by three issues:
Therefore two possible solutions are suggested:
There are two immediately obvious applications for the described approach:
The links in Denigma can be classified with machine learning techniques in order to:
To speed the computation up into the time-frame of a few seconds execution the heterogeneous parallel programming on multiple processors / clusters shall be exploited.
The data entry corpus and ontology should be redefined.
Lastly, the gap between Denigma and a formal ontologies need to be bridged.
Document expansion with Denigma derived ontology terms.
The fundamental data unit (data entry) can be used to describe documents and different methods employing the data entry text, tags, categories, links and relations. The average similarity should be used to judge the accuracy of prediction. The method is easily extendable to other data units.
It deals with what is real in the World. Therefore, the basic Question is what does really exists and what can be said to exist? It is a question of general Metaphysics in Philosophy. It is in contrast to the Epistemology that only deals with things of our perceptions, so what we see, what we hear and so on. Often our perceptions are betraying us. We can only experience the world with out perception but sometimes the perceptions might betray us so one has to know what is real in the world, i.e. what is True. To define what is really True and independent of our Perception is what Ontology original was intended to define.
An Ontology is an
explicit, formal specification of a shared conceptualization. The Term is borrowed from Philosophy, where an Ontology is a systematic account of Existence. For Artificial Intelligence Systems, what "exists" is that which can be represented.
A Conceptualization is nothing else than a Model. One tries to form a model about a domain one is talking about. Inside this domain one tries to identify relevant Concepts and how this Concepts are related to each other. This model (i.e. the conceptualization) has to be explicit which means all Meanings of all Concepts has to be defined, nothing has to be left out. Everything need to be defined. This Definition must be formal, which means it must be understood by the Machine, i.e. it has to be Machine-Understandable, not only Machine-Readable but must be interpreted correctly. Only if you read it and interpret it correctly means that you understand it. One of the most important thing is that the things one is referring to must be shared among communication partner, so this model of conceptualization must be a shared conceptualization, there must be consensus about the Ontology. This is required otherwise one can not communicate.
For Communication the Semantic Triangle applies. In Language on has a Symbol that stands for a certain Object. However language is ambiguous a term might have multiple Meanings. One can only communicate with other if two or ore communication partners apply a shared Concept (i.e. the same Concept). Then communication and understanding is possible.
Ontology is the most critical enabling Technology in Semantic Web Applications. Basically an Ontology describes Terms, and Types of Relationships between Pairs of Terms. In such an Ontology can be expressed/represented by a List of Tuples in the form of (
term y). For instance,
Normally an Ontology is developed by a small Group of Experts. However, this Approach does not scale with the ever increasing amount of Information. Specifically Experts have difficulty keeping up with Advances in Knowledge in the open dynamic World Wide Web Environment. Crowd Sourcing has the potential to be the most influential way to solve the problem of Ontology Development, by outsourcing a Task traditionally done by Experts to also non-Experts (typically a large Group of People) in the form of an open call (
The Call of Duty).
An expression is correct if the majority of the users agree on it.
An Ontology is a Data Model that represents a Domain and is utilized to reason both the Object in this Domain and the Relations between them. The application of Ontologies includes Artificial Intelligence, Semantic Web, Software Engineering and Information Architecture, where it is used as a form of Knowledge Representation about the World. An Ontology can also be understand as a set of Definitions of a formal Vocabulary with a huge potential in Information Technology.
Ontology is a Semantic skeleton of a Domain, i.e. it defines what we can have in Annotations. It defines Categories, Properties, Rules, etc. We can have many Ontologies to describe one Domain. It is only an Idea. We are not there.