nlp - How to check for unreadable OCRed text with NLTK -

April 15, 2013

I am using NLTK which analyzes a corpus of ORCR. I am new to NLTK Most OCR is good - but sometimes I come in those lines which are clearly junk. For example: Omphi au ba wmnondmam BE wBwHo & lt; obobm as bowman: ham: 8 ooww om $ 5

I want to identify (and filter) such lines from my analysis.

How do NLP doctors control this situation? Something like this: If 70% of the words in the sentence are not verbatim, discard them. Or if you can not identify part of speech for 80% of the word of NLT, then leave it? Do algorithms work for this? Is there a "gold standard" way to do this?

  The use of NGram is probably your best choice. You can use Google en-gramme or you can use it. This idea is to make a language model and see what might be the chances of any punishment. You can define a probability limit, and all the sentences are removed with the scores given below. A proper language model will give very little scores for example sentences.  
 If you feel that some words may be slightly corrupted, then you can try it before testing with NGram. In  
 Edit: Here it is some sample Aneltike code: nltk.corpus to nlrk import import Anjiarmmdel import import Mathematics:  
  Brown from nltk.util import LidstoneProbDist n = 2 est = lambda fdist, cans: Lidstonprbodist (fdist, 0.2) lm = NgramModel (n, brown.words (categories = from ngrams nltk.probability import News'), estimator = A) DAP sentence (sentence): bigam = enzyram (sentence.), N) sentence = sentence. Big (large) = For villages in large villages: score = lm.logprob (village [-1], gram [: - 1]) tot + = score return amount sentence 1 = "this is a standard english sentence" sentence 2 = "Omphi ow ba wmnondmam BE wBwHo & lt; oBoBm. As Bowman: ham: 8 ooww om $ 5" print sentence (sentence 1) print sentence (sentence 2)    Results appear:  
  & gt; & Gt; & Gt; Python lmtest.py 42.7436688972 158.850086668    Less is better (of course, you can play with parameters).




















Get link





Facebook





X





Pinterest





Email





Other Apps




Comments





Post a Comment



Popular posts from this blog




Java - Error: no suitable method found for add(int, java.lang.String) -






April 15, 2013








    I'm in the middle of homework work and I'm stuck. At this point in my code, I think that I should have a GUI window that opens and allows me to type "inserted text number". Notice is not going anywhere but at this point, If I pass through the problem then it will be in a linked list. I am getting two of the same error for the lines. Add (index, element); And I can not seem to get past it, there is no suitable method for the error "add (int, java. string string)". Code is below, please advise. To clarify - this will not be a method error because it is a linked list. There should not be any way involved.    import java.awt. *; Import java.awt.event. *; Import javax.swing. *; Import java.util. *; Import java.util.Scanner; Import java.util.LinkedList; Public class TopTenList JFrame {Private TopTenList tt; See the Private JTextArea list; Private JTextField cmdTextField; Private JTextField resultsTestfield; // This GUI window is the code for the public toptist...





Read more





java - JPA TypedQuery: Parameter value element did not match expected
type -






January 15, 2015








I am using JPA 2.0 and getting the following code in the DAO layer:    Public Zero Test () {string key = "status"; String [] Conditions = {"A", "B"}; & Lt; TestTable & gt; Results = Search (Keys, Conditions); } Public listing & lt; TestTable & gt; Search (string key, object value error) {string sql = "test ndf to testlet and jade nand." + Key + "in:" + key; TypedQuery & LT; TestTable & gt; Query = entityManager.createQuery (SQL, TestTable.class); Query.setParameter (key, Arrays.asList (valueArr)) / *** Error Line *** Return query.getResultList (); }    In the above error line, it throws the following exception:    java.lang.IllegalArgumentException: parameter value element [[Ljava.lang.String; @ cbe5bc] did not match the expected type [java.lang.String]    Why is this expected type string while actually this string []? Please help!   Note: This is derived from the normal routine and is a simplified code. I can no...





Read more





c++ - static template member variable has internal linkage but is not
defined -






March 15, 2013








    Yes, I know, there is a question with almost the same title, but it refers to a different situation. Error message) In my case, I have a  .cpp  file with a big named designation name (implementation details). There is a property class template with a static data member in that name space, which I need to access from outside the unknown namespace. I give it a little bit of meat:    file.hpp namespace Bar {template & lt; Typename A & gt; Struct foo {static_assert (is_same & lt; a, float & gt; :: value} is_same & lt; a, double>: value, ""); Fixed zero set_static_var (a const & x); // ...}; }    and    file.cpp namespace {template & lt; Typename A & gt; Struct foo_traits {// supports the implementation of several static code bars: foo & lt; & Gt; Fixed A data; }; The template's & lt; & Gt; Float foo_traits & lt; Float & gt; :: datum; // No change if this is in the global namespace template & lt; & Gt; Doub...





Read more

Search This Blog

SET RT

nlp - How to check for unreadable OCRed text with NLTK -

Comments

Post a Comment

Popular posts from this blog

Java - Error: no suitable method found for add(int, java.lang.String) -

java - JPA TypedQuery: Parameter value element did not match expected type -

c++ - static template member variable has internal linkage but is not defined -