Gensim allows to find similar texts. First a dictionary with the most interesting texts need to be created. Second all Articles need to be indexed using this dictionary. Third, new Articles can be passed and Gensim will show all similar Articles ordered by similarity.
Formulate criteria for searching LifespanArticles and create Dictionary for Gensim
Different Articles are crawled by the Web Crawler and then checked bz Gensim, those that are higher than some similarity threshold are indexed
In this way we can find new and more relevant Articles from an existing set of Articles in an Automated way.
We have 1 billion of papers
We build dictionary of n phrases and words
Web build two-dimensional matrix model from a dictionary X words from document. If document contains word from dictionary we put 1 in cell, otherwise 0. So in the end we have 1 billion n-dimensional vectors.
Next, we have 1 document and we want to find similar documents. Following the same steps as in point 3 we build one n-dimensional vector for this Article.
Compare vector from point 4 with all vectors from point 3. Vectors which give smallest angel are the most similar.
That is how it works in naive way. In real implementation there are a lot of tricks to decrease amount of computations.
Comment on This Data Unit