nlp - How to check for unreadable OCRed text with NLTK -
I am using NLTK which analyzes a corpus of ORCR. I am new to NLTK Most OCR is good - but sometimes I come in those lines which are clearly junk. For example: I want to identify (and filter) such lines from my analysis. How do NLP doctors control this situation? Something like this: If 70% of the words in the sentence are not verbatim, discard them. Or if you can not identify part of speech for 80% of the word of NLT, then leave it? Do algorithms work for this? Is there a "gold standard" way to do this? The use of NGram is probably your best choice. You can use Google en-gramme or you can use it. This idea is to make a language model and see what might be the chances of any punishment. You can define a probability limit, and all the sentences are removed with the scores given below. A proper language model will give very little scores for example sentences. If you feel that some words may be slightly corrupted, then you can try it before testing with NGram. In Edit: Here it is some sample Aneltike code: nltk.corpus to nlrk import import Anjiarmmdel import import Mathematics: Results appear: Less is better (of course, you can play with parameters). Omphi au ba wmnondmam BE wBwHo & lt; obobm as bowman: ham: 8 ooww om $ 5
Brown from nltk.util import LidstoneProbDist n = 2 est = lambda fdist, cans: Lidstonprbodist (fdist, 0.2) lm = NgramModel (n, brown.words (categories = from ngrams nltk.probability import News'), estimator = A) DAP sentence (sentence): bigam = enzyram (sentence.), N) sentence = sentence. Big (large) = For villages in large villages: score = lm.logprob (village [-1], gram [: - 1]) tot + = score return amount sentence 1 = "this is a standard english sentence" sentence 2 = "Omphi ow ba wmnondmam BE wBwHo & lt; oBoBm. As Bowman: ham: 8 ooww om $ 5" print sentence (sentence 1) print sentence (sentence 2)
& gt; & Gt; & Gt; Python lmtest.py 42.7436688972 158.850086668
Comments
Post a Comment