What are some efficient ways to find the differences between two large corpuses of text that have similar, but differently ordered content?

Question

I have two large files containing paragraphs of English text:

The first text is about 200 pages long and has about 10 paragraphs per page (each paragraph is 5 sentences long).
The second text contains almost precisely the same paragraphs and text as the first. It is also 200 pages long with 10 paragraphs per page. However, the paragraphs are randomized and in a different order when compared to the first text. Also, a large percentage of the paragraphs have small changes in wording compared to similar paragraphs. For example, a paragraph in the first text might have a sentence like Like Jimmy, I wanted to go to the palace while the corresponding sentence in the paragraph of the second text would read Like Jimmy, I really wanted to go to the castle.

I want to be able to capture the changes here like the addition of really and the deletion of palace with replacement of castle. If the paragraphs were roughly aligned, then this would be pretty trivial as there are plenty of ways to diff text. However, since the paragraphs aren't aligned, that isn't the case.

If the files were small (handful of paragraphs), Levenshtein Distance probably would work fine, but because the files are huge, it would be inefficient to compare each paragraph of text 1 to each paragraph of text 2 to find out which paragraphs match.

What would be some other approaches to this problem to handle it efficiently?

Are the paragraphs at least close to each other, say within a "radius" of 10 or so? A general idea would be to preprocess in some way. For instance, figure out words that rarely change (names?) and only compare those that share at least these. — Raphael, Oct 07 '15 at 07:03
You could try a clone detection tool. They are meant to be used for programming languages, but other than that, designed for this problem. CCFinder would probably work. — reinierpost, Oct 07 '15 at 07:38
Here's a similar problem with some answers: http://cs.stackexchange.com/questions/47794/how-to-speed-up-process-of-finding-duplicates-similar-items-in-a-large-amount-of — wvxvw, Oct 07 '15 at 09:26
@Raphael Can you expand on what you mean by preprocessing here? Also, the paragraphs occur in "sections" of the document, a section can be pretty long (like 50-60 paragraphs) and unordered. — vikram7, Oct 07 '15 at 14:47
@usul Unfortunately the "diff" utility doesn't work well here because the paragraphs in the second text are in a different order than the first. — vikram7, Oct 07 '15 at 14:57
@reinierpost and usul: tool requests are offtopic here, so let's assume that this is an algorithmic question. — Raphael, Oct 07 '15 at 18:40
My example would come down to "1) Find all names. 2) Group paragraphs by the names that occur. 3) Only compare paragraphs that are in the same group." — Raphael, Oct 07 '15 at 18:42
My approach would be to build an index of all substrings in the text, something like a suffix tree, and then cluster them by similarity, by computing some edit distance on all strings in parallel. And to preprocess, e.g. run a lexical scan in which you replace each word with a unique token id. — reinierpost, Oct 08 '15 at 08:13
If you do find an efficient, useful algorithm, let Wikipedia know :-) — Mark Hurd, Oct 09 '15 at 05:06

score 1 · Answer 1 · answered Aug 19 '16 at 17:27

Comparing 2000 paragraphs to 2000 paragraphs is only four million comparisons.

The key to the problem is not to use a function that computes the Levenshtein distance but to use one that computes the Levenshtein distance if the distance is less than a certain threshold, and fails (or, rather, returns +∞) if the distance is greater than the threshold.

This is because you are only interested in closely similar paragraphs. You have no interest at all in the precise distance between paragraphs that are different enough to be unrelated. So as soon as a distance has got high enough to be uninteresting, the function can exit at once; and this will mostly happen very early on indeed during the execution of the function.

The higher the threshold, the longer the running time but the smaller the proportion of false negatives.

If you know something more about the documents (such as that each paragraph matches at most one paragraph in the other document) then you could make one pass with a low threshold, exclude the matched paragraphs from further consideration, make one pass over your now reduced corpus with a higher threshold, exclude those reduced paragraphs, and so on.

Implementation detail: Presumably you would be calculating a Levenshtein distance on words rather than on characters. If that is the case, you should first assign a number to each word - for example, by sorting the entire corpus, calling the first word '1', the second word '2', and so on. That way your paragraph comparisons would be done by comparing numbers rather than words, which is faster.

score -1 · Answer 2 · answered Aug 18 '16 at 05:05

It might be possible to use a compound approach. Perhaps someone can build on this...

Hash the contents of the paragraph in such a way that paragraphs with only slight differences have similar hashes, then order the hashes to determine which paragraphs to compare via a more exact method (diff or something similar).

For example, as a rudimentary hash algorithm, what if you added up the ascii values of the characters and then modulo the sum by a large number like 2,000,000,000? This would cause 2 paragraphs with only a few added or subtracted words to have hash values that are likely closer together than paragraphs with very different words, and thus, they will be much closer together on the list than the very different paragraphs (you might say nearby hashes in this case are necessary but not sufficient for similar paragraphs). Obviously you have to account for the wrap-around caused by modulo and consider a paragraph with the hash value 1,999,999,999 as only being a distance of 1 from one with a has value of 0, etc.

As a result, could reduce the number of comparisons between paragraphs that you need to perform by a substantial amount (you wouldn't have to compare each paragraph in one text to every paragraph in the other text) -- you could compare a paragraph to paragraphs in text 2 in order of how close their hashes are (do the closest hash-valued ones first) and invoke a more expensive algorithm here to determine if they're "similar enough" to be considered the same.

If you're talking about paragraphs of text, the sum of the ASCII values mod two billion is the sum of the ASCII values. Unless your paragraph is more than about eight million characters, that is... So this answer looks rather hacked up, based on what you happened to think of at the time. Do you have any evidence that the approach you suggest is effective? Is it backed up by experiments or published research? — David Richerby, Aug 18 '16 at 08:26

What are some efficient ways to find the differences between two large corpuses of text that have similar, but differently ordered content?

2 Answers2