zoukankan      html  css  js  c++  java
  • pytesseract代码优化

    上述程序在windows平台运行时,会发现有黑色的控制台窗口一闪而过的画面,不太友好。

    略微修改了pytesseract.py(C:Python27Libsite-packagespytesseract目录下),把上述过程进行了隐藏。

    # modified by zhongtang hide console window
    # new code
    IS_WIN32 = 'win32' in str(sys.platform).lower()
    if IS_WIN32:
       startupinfo = subprocess.STARTUPINFO()
       startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
       startupinfo.wShowWindow = subprocess.SW_HIDE
       proc = subprocess.Popen(command,
            stderr=subprocess.PIPE,startupinfo=startupinfo)
    '''
    # old code
    proc = subprocess.Popen(command,
       stderr=subprocess.PIPE)
    '''
    # modified end

    为了方便初学者,把pytesseract.py也贴出来,高手自行忽略。

    #!/usr/bin/env python
    '''
    Python-tesseract is an optical character recognition (OCR) tool for python.
    That is, it will recognize and "read" the text embedded in images.
    
    Python-tesseract is a wrapper for google's Tesseract-OCR
    ( http://code.google.com/p/tesseract-ocr/ ). It is also useful as a
    stand-alone invocation script to tesseract, as it can read all image types
    supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
    and others, whereas tesseract-ocr by default only supports tiff and bmp.
    Additionally, if used as a script, Python-tesseract will print the recognized
    text in stead of writing it to a file. Support for confidence estimates and
    bounding box data is planned for future releases.
    
    
    USAGE:
    ```
     > try:
     >   import Image
     > except ImportError:
     >   from PIL import Image
     > import pytesseract
     > print(pytesseract.image_to_string(Image.open('test.png')))
     > print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
    ```
    
    INSTALLATION:
    
    Prerequisites:
    * Python-tesseract requires python 2.5 or later or python 3.
    * You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
     the package "python-imaging" or "python3-imaging" for python3.
    * Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ .
     You must be able to invoke the tesseract command as "tesseract". If this
     isn't the case, for example because tesseract isn't in your PATH, you will
     have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
     Under Debian/Ubuntu you can use the package "tesseract-ocr".
     
    Installing via pip:  
    See the [pytesseract package page](https://pypi.python.org/pypi/pytesseract)   
    $> sudo pip install pytesseract  
    
    Installing from source:  
    $> git clone git@github.com:madmaze/pytesseract.git  
    $> sudo python setup.py install  
    
    
    LICENSE:
    Python-tesseract is released under the GPL v3.
    
    CONTRIBUTERS:
    - Originally written by [Samuel Hoffstaetter](https://github.com/hoffstaetter) 
    - [Juarez Bochi](https://github.com/jbochi)
    - [Matthias Lee](https://github.com/madmaze)
    - [Lars Kistner](https://github.com/Sr4l)
    
    '''
    
    # CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
    tesseract_cmd = 'tesseract'
    
    try:
      import Image
    except ImportError:
      from PIL import Image
    import subprocess
    import sys
    import tempfile
    import os
    import shlex
    
    __all__ = ['image_to_string']
    
    def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
      '''
      runs the command:
        `tesseract_cmd` `input_filename` `output_filename_base`
      
      returns the exit status of tesseract, as well as tesseract's stderr output
    
      '''
      command = [tesseract_cmd, input_filename, output_filename_base]
      
      if lang is not None:
        command += ['-l', lang]
    
      if boxes:
        command += ['batch.nochop', 'makebox']
        
      if config:
        command += shlex.split(config)
        
      # modified by zhongtang hide console window
      # new code
      IS_WIN32 = 'win32' in str(sys.platform).lower()
      if IS_WIN32:
        startupinfo = subprocess.STARTUPINFO()
        startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
        startupinfo.wShowWindow = subprocess.SW_HIDE
      proc = subprocess.Popen(command,
          stderr=subprocess.PIPE,startupinfo=startupinfo)
      '''
      # old code
      proc = subprocess.Popen(command,
          stderr=subprocess.PIPE)
      '''
      # modified end
      
      return (proc.wait(), proc.stderr.read())
    
    def cleanup(filename):
      ''' tries to remove the given filename. Ignores non-existent files '''
      try:
        os.remove(filename)
      except OSError:
        pass
    
    def get_errors(error_string):
      '''
      returns all lines in the error_string that start with the string "error"
    
      '''
    
      lines = error_string.splitlines()
      error_lines = tuple(line for line in lines if line.find('Error') >= 0)
      if len(error_lines) > 0:
        return '
    '.join(error_lines)
      else:
        return error_string.strip()
    
    def tempnam():
      ''' returns a temporary file-name '''
      tmpfile = tempfile.NamedTemporaryFile(prefix="tess_")
      return tmpfile.name
    
    class TesseractError(Exception):
      def __init__(self, status, message):
        self.status = status
        self.message = message
        self.args = (status, message)
    
    def image_to_string(image, lang=None, boxes=False, config=None):
      '''
      Runs tesseract on the specified image. First, the image is written to disk,
      and then the tesseract command is run on the image. Resseract's result is
      read, and the temporary files are erased.
      
      also supports boxes and config.
      
      if boxes=True
        "batch.nochop makebox" gets added to the tesseract call
      if config is set, the config gets appended to the command.
        ex: config="-psm 6"
    
      '''
    
      if len(image.split()) == 4:
        # In case we have 4 channels, lets discard the Alpha.
        # Kind of a hack, should fix in the future some time.
        r, g, b, a = image.split()
        image = Image.merge("RGB", (r, g, b))
      
      input_file_name = '%s.bmp' % tempnam()
      output_file_name_base = tempnam()
      if not boxes:
        output_file_name = '%s.txt' % output_file_name_base
      else:
        output_file_name = '%s.box' % output_file_name_base
      try:
        image.save(input_file_name)
        status, error_string = run_tesseract(input_file_name,
                           output_file_name_base,
                           lang=lang,
                           boxes=boxes,
                           config=config)
        if status:
          #print 'test' , status,error_string
          errors = get_errors(error_string)
          raise TesseractError(status, errors)
        f = open(output_file_name)
        try:
          return f.read().strip()
        finally:
          f.close()
      finally:
        cleanup(input_file_name)
        cleanup(output_file_name)
    
    def main():
      if len(sys.argv) == 2:
        filename = sys.argv[1]
        try:
          image = Image.open(filename)
          if len(image.split()) == 4:
            # In case we have 4 channels, lets discard the Alpha.
            # Kind of a hack, should fix in the future some time.
            r, g, b, a = image.split()
            image = Image.merge("RGB", (r, g, b))
        except IOError:
          sys.stderr.write('ERROR: Could not open file "%s"
    ' % filename)
          exit(1)
        print(image_to_string(image))
      elif len(sys.argv) == 4 and sys.argv[1] == '-l':
        lang = sys.argv[2]
        filename = sys.argv[3]
        try:
          image = Image.open(filename)
        except IOError:
          sys.stderr.write('ERROR: Could not open file "%s"
    ' % filename)
          exit(1)
        print(image_to_string(image, lang=lang))
      else:
        sys.stderr.write('Usage: python pytesseract.py [-l language] input_file
    ')
        exit(2)
    
    if __name__ == '__main__':
      main()
    

      

  • 相关阅读:
    ISAPI Extension和ISAPI Filter
    Windows Debugging之二
    Windows Debugging之三
    举例一些可执行文件的后缀名
    Windows Debugging 之一
    [转]C# 日期函数
    在DataGridView中的CellEndEdit事件中删除当前行报错
    一个奇怪的问题
    [转载]如何发送和接收 Windows Phone 的 Toast 通知
    【转载】如何播放 Windows Phone 的后台音频
  • 原文地址:https://www.cnblogs.com/pythonClub/p/14903249.html
Copyright © 2011-2022 走看看