I have two large files containing paragraphs of English text:
- The first text is about 200 pages long and has about 10 paragraphs per page (each paragraph is 5 sentences long).
- The second text contains almost precisely the same paragraphs and text as the first. It is also 200 pages long with 10 paragraphs per page. However, the paragraphs are randomized and in a different order when compared to the first text. Also, a large percentage of the paragraphs have small changes in wording compared to similar paragraphs. For example, a paragraph in the first text might have a sentence like
Like Jimmy, I wanted to go to the palace
while the corresponding sentence in the paragraph of the second text would readLike Jimmy, I really wanted to go to the castle
.
I want to be able to capture the changes here like the addition of really
and the deletion of palace
with replacement of castle
. If the paragraphs were roughly aligned, then this would be pretty trivial as there are plenty of ways to diff text. However, since the paragraphs aren't aligned, that isn't the case.
If the files were small (handful of paragraphs), Levenshtein Distance probably would work fine, but because the files are huge, it would be inefficient to compare each paragraph of text 1 to each paragraph of text 2 to find out which paragraphs match.
What would be some other approaches to this problem to handle it efficiently?