Collaboration Semantic Web Against Aging

Involved Labs/Organisations (1): Denigma
Participating Members (7): Anton Kulaga, Danila Andreevich Medvedev, Daniel Wuttke, Kir Repnikov, Dmitry Borisoglebsky, Ilja Orlovs, Eugene Zimin

Semantic Web for Longevity Research

Contents

Semantic Web for Longevity Research
- Preface
- General considerations
- Theoretical background or why it may work
- Practical considerations
  - Knowledge management
  - Collaboration
  - -Gangman- Git-style wiki-editing
- Goals and roadmap
- Architecture
  - Semantic web services
  - Interoperability
  - Graphs/hypergraphs and ontologies
  - Collaborative editing
- Specific part

Authors : Daniel Wutske (Hevok), Anton Kulaga (antonkulaga), Dmitry Borisoglebsky

Preface

Let us start from the original description of Denigma Project and then move forward.

“By integrating all the heterogeneous types of biological data and applying a robust unification schema as well as utilizing the increasingly computational power for logical inference, it will be possible to solve biological problems such as aging, diseases and suffering due to other reasons” [http://www.denigma.de/data/entry/denigma-description].

General considerations

In order to boost longevity science drastically a big and complex system must be build that comprises many elements.

That is why it is wise to choose a small pieces of functionality (so called bootstrapping Project) that can be done by a small group of people and can encourage others to join or provide resources for us.

Semantic Web may be a good choice for the bootstrapping Project, because it provides a foundation for a lot of other functionalities, can be used for a lot of different purposes and can attract many people.

That is why the system is defined as having at least two parts:

general part that may be used for every Science
specific part that is specific for Biogerontology

For the general part it is easy to attract people from outside (i.e. a lot of folks are interested in Semantic Web but are indifferent to longevity). In the same time the specific part must be done by us and other members of longevity community.

In fact the system may consists from many separate components, many “bricks”. Each of them is an Open Source Application and can have its own team (teams may intersect of course).

Theoretical background or why it may work

In social Science there are several theories that may explain why our system can boost scientific progress. That does not matter much from practical point of view but anyway it may be interesting for you.

According to Collective Intelligence Theory [http://en.wikipedia.org/wiki/Collective_intelligence] collective intelligence of the group does not equal a sum of individual intelligences and is heavily dependent on relationships, communication channels and other aspects (details are omitted). So what we are doing is enhancing collective intelligence of the system by merging human and machine intelligence (applying machine learning techniques) and creating tools and procedures that allow users to collaborate in a more effective ways.

According to Extended Mind Theory [http://en.wikipedia.org/wiki/The_Extended_Mind] the mind is seen to encompass every level of the cognitive process, which will often include the use of environmental aids. So the system may be kind of external mind for researchers (and other users) and the community as a whole.

According to New Institutional Economics [http://en.wikipedia.org/wiki/New_institutional_economics] the society is graph of people modeled as Agents with bounded rationality and limited awareness that interact with each other according to their values, formal and informal rules and rule control mechanisms that are called “Institutions”. Every interaction, every connection and transaction has its cost, so called “transaction cost” (time, efforts for negotiation and analysis finding partners, coordinating efforts, collaborating etc.). Where transaction costs are high hierarchies are formed (less connections in the social graph - less transactions and transaction expenses), otherwise networks are created. So that we are making a system that drastically lower different kinds of transaction costs and provide new rules for interactions and new semi-automatic transactions (done by automatic agents).

It is all about theories, so let us now move to the more practical part.

Practical considerations

Knowledge management

In order to utilize power of machine intelligence and other above mentioned sophisticated techniques we have to get out the knowledge out of people heads and transform descriptions into machine readable formats (for instance: reStructuredText as well as database entries) that allow to build up graphs and hypergraphs.

Actually this are Tasks that Semantic Web solves:

Automation of information processing by agents (computers, humans, organizations).
Annotation of information sources (transforming an information source into the system of “knowledge atoms” that is readable both by man and machine as like).

In the long term, when we have enough high quality well structured (mainly because of utilizing Semantic Web) information we may apply superior machine learning techniques like:

Complex Semantic Querying
Text Mining
Bayes networks
Logical Inference (where possible)
Markov Chains
Support Vector Machines
Neural Networks
and a lot of other interesting implementations

Collaboration

In the same time we have to admit that the main task of our system is letting Researches and other users be involved to produce better results and move Science forward faster. For exactly this purpose we need not only better data and tools to work with but also better work-flow, easier (with less transaction cost) Collaboration, better stimulus (to increase motivation and involvement) and better decision making.

A lot of interesting technologies and tools may be used. But for now, it will be focused on the core and the easiest to implement features.

-Gangman- Git-style wiki-editing

Wiki-wiki (i.e. Wikimedia) provides new rules/procedures for Collaboration by its change editions and Markup, new institutions (in terms of new institutional economics) that are hard to reach without IT. That is why Wikipedia/mediawiki was so successful in encyclopedic field. In the same time it is not enough for Science, its leading edge. If it was we would write our Academic Articles to Wikis instead of academic journals.

The main problem is its overcentralization. It is good for simple and well researched subjects where there are one dominant theory in Science. But all is too vague on the edge of Science where there are different alternative Theories, new discoveries and debates, speculations, a lot of information that ought to be checked and refined etc. In such situations “Central Repository” is not suitable.

The other problem of mediawiki is its weak semantic features. There are a lot of add-ons like semantic wiki but they are limited (mainly because mediawiki architecture was made for other purposes and a lot of crutches should be used to let it behave in another way) and often difficult to implement.

That is why an alternative is required. There is great collaborative editing (as well as collaborative filtering) mechanism in Git that can be borrowed and adapted for our purposes [tech talk video that explains Git development in Linux: http://www.youtube.com/watch?v=4XpnKHJAok8]

In Git there are no central repository. If you want to change something you make a fork of it, implement changes and make a pull request. Collaborative filtering works well in Git. There are a lot of different independent repos and people pull from repos they trust and consider good written, owners of trusted repos in their turns pull only from repos the find valuable and so on. So there is a workflow that filters out bad code and accepts the most qualitative one.

The other things that we may implement much better than in mediawiki is information storage and retrieval. We may use hybrid (hyper-graph-document database with schema less/full/mixed mode) and store and traverse relationships easier. How it may work in our system is defined in a scheme [http://denigma.de/url/3f].

Goals and roadmap

We are defining clear goals and provide and Roadmap.

Architecture

The architecture can undergo rapid drastic changes, but one thing is clear. We must have a kind of Semantic Web ecosystem that consists of different interconnected services (i.e. Ontology builder, query builder, graph/hypergraph storage engine, data visualization, wiki-like editing system, etc.) that are connected together by means of open protocols.

Semantic web services

Regarding the Semantic Web Services important issues to consider are:

Annotation - what services do in a for that is readable by machine (in addition to how question of “normal” web services).
Service discovery - before one can use any service, one must find it.
Composition - a combination of web services into another one.
Interoperability - data from one service may be used by another one.
Invocation - calling an individual web service and making use of the results.
Privacy and security - encryption and digital signature information about how input/outputs will be passed and stored; to where information may be sent and for what purpose.

Interoperability

For the first attempt we will use JSON rest services and Websockets.

Websockets will be used for connecting front-ends (there may be several of them) to the backends in a responsive way, so we can make push notifications from server to user and use features like Auto-completion (e.g. when querying) and Search without page reloading.

REST webservers may be used for communication between different server-side parts especially if they are written in different languages (like Python and Scala).

Graphs/hypergraphs and ontologies

We will use OrientDB for storage and editing of the Ontology and data. It is Open Source and very flexible. We can create different knowledge structures with it: graphs, hypergraphs, relationships, documents etc. and traverse them. It can work in schema-less and schema-full mode, so that we will be able to create Ontologies using its native form.

Collaborative editing

How collaborative editing will be implemented is described in the yEd scheme.

Specific part

Here we describe parts of the Project that are specific to Biogerontology

The initial goal setting need to be very clear. The question is would full scale "Digital Decipher Machine" include:

the majority of Research results that are anyhow related to Aging and longevity Research?
a library of the past, current and future Research Projects related to Aging and longevity Research? May be a control or management framework?
a Research data storage?
a collection of software used by an average Researcher in the related fields of Science?
a collection of knowledge management tools and methods that would allow to speed up an average Research Project?
a knowledge base for these tools?

If yes for one or few points, what is the current state and why/how academia and business are struggling now? What are the trends? What is the future state and why/how academia and business would benefit from the Project?