Created on Dec. 16, 2012, 1:01 a.m. by Hevok & updated by Hevok on May 2, 2013, 5:01 p.m.
Denigma is constructed to break the genetic code of life and therefore pave the way to find effective interventions to make aging negligible. The vast amount of biological data however is hidden in the scientific literature and unaccessible for computations.
The identification of topics and concepts associated with a document or collection of documents is a common task for Denigma and can help in:
The concept is based on the assumption that it is be possible to describe what an scientist has been working in order to support collaboration. Theoretically this can be achieved by:
The first questions that arise is to how to map the documents she/he reads to the ontology terms? The solution to this is to use document to data entry similarity for the mapping.
The second question is how to aggregate to get a shorter list? The answer is to use spreading activation algorithm for aggregation
Denigma data entries and categories are used as ontology terms. Categories as generalized concepts are itself defined by data entries.
What a certain document is about can be approached in two ways:
The first approach is flexible and does not require creating and maintaining an ontology, while the second approach can tie documents to a rich knowledge base and make it accessible for computation.
Using Denigma's data entries as an ontology offers the best of both approaches.
Each data entry is a concept in the ontology.
Terms are linked via Denigma's tag, category and hierarchy system as well as by inter-data entry links and data entry relations.
It is a consensus ontology created, kept current and maintained by a diverse community. The overall content quality is high. Terms have unique IDs (URLs) and are "self-describing" for people as well as machines. The underlying graphs provide the structure of data entry tags, categories, hierarchy, links and relations.
Denigma data entry graph is a thesaurus. The graph composed of data entry links is similar to the world-wide-web network, but highly systematic and structured in an unified fashion (i.e. easy accessible for computation).
The goal
is given one or more documents, compute a ranked list of the top N data entries and/or categories that describe it.
The basic metric
is document similarity between data entries and document(s).
Variants to explore are the following:
Associative retrieval means that it is possible to retrieve relevant documents if they are associated with other documents that have been considered relevant by the user.
The document can be represented as nodes and their associations as links in a network. At each pulse/iteration, spread activation to adjacent nodes. Some nodes will have higher activation than others.
The constraints are:
The first method is to use Denigma data entry text and categories to predict concepts:
Input Query doc(s) -similar to (Cosine similarity)-> Similar data entries -> Denigma category graph
The output are ranked categories:
The second method is similar to the first but uses spreading activation on category links graph to get aggregated concepts. The output are ranked concepts based on final activation score.
It is possible to predict concepts that are NOT present in the category hierarchy by using the data concepts. For this use spreading activation on data entry links graph.
As threshold ignore spreading activation to articles with less than 0.4 cosine similarity score. The edge weights are the cosine similarity between linked articles. The output are ranked concepts based on final activation score.
In an initial informal
evaluation the results are compared against our own judgments. Download scientific articles from internet
and predict concepts. Then use single documents and group of related documents.
For a single document inn general more pulses lead to more generalized concepts.
For the prediction of a set of test documents (e.g. data entries) concepts can be discovered that are not in the category hierarchy.
Select data entries randomly and predict their categories, links, and relations:
Query doc(s) -similar to (Cosine similarity)-> Average Similarity
It is observed that data entries are linked often with both super and sup categories.
It the system predicts a category three levels higher in hierarchy than the original category the predictions is considered to be correct.
Spreading activation with two pulses works the best. Only considering data entries with similarity > 0.5 is a good threshold.
Spreading activation with one pulse works the best and againg only considering data entries with similarity > 0.5 is a threshold.
The prediction accuracy is affected by three issues:
Therefore two possible solutions are suggested:
There are two immediately obvious applications for the described approach:
The links in Denigma can be classified with machine learning techniques in order to:
To speed the computation up into the time-frame of a few seconds execution the heterogeneous parallel programming on multiple processors / clusters shall be exploited.
The data entry corpus and ontology should be redefined.
Lastly, the gap between Denigma and a formal ontologies need to be bridged.
Document expansion with Denigma derived ontology terms.
The fundamental data unit (data entry) can be used to describe documents and different methods employing the data entry text, tags, categories, links and relations. The average similarity should be used to judge the accuracy of prediction. The method is easily extendable to other data units.
Data Mining Protocol -belongs to-> Ontology
Comment on This Data Unit