Denigma: Data Mining Protocol

Data Mining Protocol

Created on Dec. 16, 2012, 1:01 a.m. by Hevok & updated by Hevok on May 2, 2013, 5:01 p.m.

Denigma is constructed to break the genetic code of life and therefore pave the way to find effective interventions to make aging negligible. The vast amount of biological data however is hidden in the scientific literature and unaccessible for computations.

Introduction

The identification of topics and concepts associated with a document or collection of documents is a common task for Denigma and can help in:

Annotation and categorization of documents in a corpus of scientific literature
Modelling biological processes related to aging
Artificial intelligence
Selecting effective anti-aging interventions

Concept

The concept is based on the assumption that it is be possible to describe what an scientist has been working in order to support collaboration. Theoretically this can be achieved by:

track data document she/he reads
map these terms in an ontology
aggregate to produce a short list of topics

The first questions that arise is to how to map the documents she/he reads to the ontology terms? The solution to this is to use document to data entry similarity for the mapping.

The second question is how to aggregate to get a shorter list? The answer is to use spreading activation algorithm for aggregation

Approach

Denigma data entries and categories are used as ontology terms. Categories as generalized concepts are itself defined by data entries.

What a certain document is about can be approached in two ways:

Statistically Select words and phrases using TF-IDF that characterizes the document
Controlled Vocabulary / Ontology Map a document to a list of terms from a controlled vocabulary

The first approach is flexible and does not require creating and maintaining an ontology, while the second approach can tie documents to a rich knowledge base and make it accessible for computation.

Using Denigma's data entries as an ontology offers the best of both approaches.

Each data entry is a concept in the ontology.

Terms are linked via Denigma's tag, category and hierarchy system as well as by inter-data entry links and data entry relations.

It is a consensus ontology created, kept current and maintained by a diverse community. The overall content quality is high. Terms have unique IDs (URLs) and are "self-describing" for people as well as machines. The underlying graphs provide the structure of data entry tags, categories, hierarchy, links and relations.

Data Graph

Denigma data entry graph is a thesaurus. The graph composed of data entry links is similar to the world-wide-web network, but highly systematic and structured in an unified fashion (i.e. easy accessible for computation).

Methods

The goal is given one or more documents, compute a ranked list of the top N data entries and/or categories that describe it.

The basic metric is document similarity between data entries and document(s). Variants to explore are the following:

Role of categories
Eliminating uninteresting data entries
Use of spreading activation
Using similarity scores for weighting links
Number of spreading activation pulses
Individual or set of query documents, etc.

Spreading Activation

Associative retrieval means that it is possible to retrieve relevant documents if they are associated with other documents that have been considered relevant by the user.

The document can be represented as nodes and their associations as links in a network. At each pulse/iteration, spread activation to adjacent nodes. Some nodes will have higher activation than others.

The constraints are:

Distance
Fan out
Path constraints
Activation threshold

1. Method: Ranking Categories Directly

The first method is to use Denigma data entry text and categories to predict concepts:

Input Query doc(s) -similar to (Cosine similarity)-> Similar data entries -> Denigma category graph

The output are ranked categories:

Links
Cosine similarity

2. Method: Spreading Activation on Category Links Graph

The second method is similar to the first but uses spreading activation on category links graph to get aggregated concepts. The output are ranked concepts based on final activation score.

3. Method: Spreading Activation on Entry Links Graph

It is possible to predict concepts that are NOT present in the category hierarchy by using the data concepts. For this use spreading activation on data entry links graph.

As threshold ignore spreading activation to articles with less than 0.4 cosine similarity score. The edge weights are the cosine similarity between linked articles. The output are ranked concepts based on final activation score.

Evaluation

In an initial informal evaluation the results are compared against our own judgments. Download scientific articles from internet and predict concepts. Then use single documents and group of related documents.

For a single document inn general more pulses lead to more generalized concepts.

For the prediction of a set of test documents (e.g. data entries) concepts can be discovered that are not in the category hierarchy.

Select data entries randomly and predict their categories, links, and relations:

Query doc(s) -similar to (Cosine similarity)-> Average Similarity

It is observed that data entries are linked often with both super and sup categories.

It the system predicts a category three levels higher in hierarchy than the original category the predictions is considered to be correct.

Category Prediction Evaluation

Spreading activation with two pulses works the best. Only considering data entries with similarity > 0.5 is a good threshold.

Data Entry Prediction Evaluation

Spreading activation with one pulse works the best and againg only considering data entries with similarity > 0.5 is a threshold.

Prediction Accuracy

The prediction accuracy is affected by three issues:

To what extent the concept is represented in Denigma.
Presence of links between semantically related concepts.
Presence of links between irrelevant data entries (term definitions, announcements)

Therefore two possible solutions are suggested:

Use average similarity score to measure the extent of concept representation within Denigma
Use existing semantic relatedness measures to handle presence or absence of semantically related links

Potential Applications

There are two immediately obvious applications for the described approach:

Recommending categories and links for new data entries.
Automate the process of building a knowledge base from a corpus (scientific literature)

Further Enhancements

The links in Denigma can be classified with machine learning techniques in order to:

Predict semantic type of data entries
Control the flow of spreading activation

To speed the computation up into the time-frame of a few seconds execution the heterogeneous parallel programming on multiple processors / clusters shall be exploited.

The data entry corpus and ontology should be redefined.

Lastly, the gap between Denigma and a formal ontologies need to be bridged.

Document expansion with Denigma derived ontology terms.

Conclusion

The fundamental data unit (data entry) can be used to describe documents and different methods employing the data entry text, tags, categories, links and relations. The average similarity should be used to judge the accuracy of prediction. The method is easily extendable to other data units.

Tags: documents, text, programming, information, ontology
Categories: Quest
Parent: Quests

Update entry (Admin) | See changes

Relations

Data Mining Protocol -belongs to-> Ontology

Facets

Professions

Achievements