zoukankan      html  css  js  c++  java
  • wxPython利用pytesser模块实现图片文字识别

    pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。

    pytesser 调用了 tesseract。在python中调用pytesser模块,pytesser又用tesseract识别图片中的文字。

    下面是整个过程的实现步骤:


    这个是免安装的,可以放在python安装文件夹的\Lib\site-packages\  下直接使用

    pytesser里包含了tesseract.exe和英语的数据包(默认只识别英文),还有一些示例图片,所以解压缩后即可使用。
    可通过以下代码测试:
    >>> from pytesser import *
    >>> image = Image.open('fnord.tif')  # Open image object using PIL
    >>> print image_to_string(image)     # Run tesseract.exe on image
    fnord
    >>> print image_file_to_string('fnord.tif')
    fnord
    from pytesser import * 
    #im = Image.open('fnord.tif') 
    #im = Image.open('phototest.tif') 
    #im = Image.open('eurotext.tif')
    im = Image.open('fonts_test.png')
    text = image_to_string(im) 
    print text
    注:该模块需要PIL库的支持。

    2、解决识别率低的问题
    可以增强图片的显示效果,或者将其转换为黑白的,这样可以使其识别率提升不少:

    enhancer = ImageEnhance.Contrast(image1)
    image2 = enhancer.enhance(4)

    可以再对image2调用 image_to_string识别

    3、识别其他语言
    tesseract是一个命令行下运行的程序,参数如下:

    tesseract  imagename outbase [-l  lang]  [-psm N]  [configfile...]

    imagename是输入的image的名字
    outbase是输出的文本的名字,默认为outbase.txt
    -l  lang  是定义要识别的的语言,默认为英文

    通过以下步骤可以识别其他语言:

    (1)、下载其他语言数据包:
    将语言包放入pytesser的tessdata文件夹下
    接下来修改pytesser.py的参数,下面是一个例子:

    """OCR in Python using the Tesseract engine from Google
    http://code.google.com/p/pytesser/
    by Michael J.T. O'Kelly
    V 0.0.2, 5/26/08"""
    
    import Image
    import subprocess
    import os
    import StringIO
    
    import util
    import errors
    
    
    tesseract_exe_name = 'dlltest' # Name of executable to be called at command line
    scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
    scratch_text_name_root = "temp" # Leave out the .txt extension
    _cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation
    _language = "" # Tesseract uses English if language is not given
    _pagesegmode = "" # Tesseract uses fully automatic page segmentation if psm is not given (psm is available in v3.01)
    
    _working_dir = os.getcwd()
    
    def call_tesseract(input_filename, output_filename, language, pagesegmode):
            """Calls external tesseract.exe on input file (restrictions on types),
            outputting output_filename+'txt'"""
            current_dir = os.getcwd()
            error_stream = StringIO.StringIO()
            try:
                    os.chdir(_working_dir)
                    args = [tesseract_exe_name, input_filename, output_filename]
                    if len(language) > 0:
                            args.append("-l")
                            args.append(language)
                    if len(str(pagesegmode)) > 0:
                            args.append("-psm")
                            args.append(str(pagesegmode))
                    try:
                            proc = subprocess.Popen(args)
                    except (TypeError, AttributeError):
                            proc = subprocess.Popen(args, shell=True)
                    retcode = proc.wait()
                    if retcode!=0:
                            error_text = error_stream.getvalue()
                            errors.check_for_errors(error_stream_text = error_text)
            finally:  # Guarantee that we return to the original directory
                    error_stream.close()
                    os.chdir(current_dir)
    
    def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag):
            """Converts im to file, applies tesseract, and fetches resulting text.
            If cleanup=True, delete scratch files after operation."""
            try:
                    util.image_to_scratch(im, scratch_image_name)
                    call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm)
                    result = util.retrieve_result(scratch_text_name_root)
            finally:
                    if cleanup:
                            util.perform_cleanup(scratch_image_name, scratch_text_name_root)
            return result
    
    def image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True):
            """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
            converts to compatible format and then applies tesseract.  Fetches resulting text.
            If cleanup=True, delete scratch files after operation. Parameter lang specifies used language.
            If lang is empty, English is used. Page segmentation mode parameter psm is available in Tesseract 3.01.
            psm values are:
            0 = Orientation and script detection (OSD) only.
            1 = Automatic page segmentation with OSD.
            2 = Automatic page segmentation, but no OSD, or OCR
            3 = Fully automatic page segmentation, but no OSD. (Default)
            4 = Assume a single column of text of variable sizes.
            5 = Assume a single uniform block of vertically aligned text.
            6 = Assume a single uniform block of text.
            7 = Treat the image as a single text line.
            8 = Treat the image as a single word.
            9 = Treat the image as a single word in a circle.
            10 = Treat the image as a single character."""
            try:
                    try:
                            call_tesseract(filename, scratch_text_name_root, lang, psm)
                            result = util.retrieve_result(scratch_text_name_root)
                    except errors.Tesser_General_Exception:
                            if graceful_errors:
                                    im = Image.open(filename)
                                    result = image_to_string(im, cleanup)
                            else:
                                    raise
            finally:
                    if cleanup:
                            util.perform_cleanup(scratch_image_name, scratch_text_name_root)
            return result
            
    
    if __name__=='__main__':
            im = Image.open('phototest.tif')
            text = image_to_string(im, cleanup=False)
            print text
            text = image_to_string(im, psm=2, cleanup=False)
            print text
            try:
                    text = image_file_to_string('fnord.tif', graceful_errors=False)
            except errors.Tesser_General_Exception, value:
                    print "fnord.tif is incompatible filetype.  Try graceful_errors=True"
                    #print value
            text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False)
            print "fnord.tif contents:", text
            text = image_file_to_string('fonts_test.png', graceful_errors=True)
            print text
            text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True)
            print text
    
    


    这个是source里面提供的,其实若只要识别其他语言只要添加一个language参数就行了,下面是我的例子:

    """OCR in Python using the Tesseract engine from Google
    http://code.google.com/p/pytesser/
    by Michael J.T. O'Kelly
    V 0.0.1, 3/10/07"""
    
    import Image
    import subprocess
    import util
    import errors
    
    tesseract_exe_name = 'tesseract' # Name of executable to be called at command line
    scratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible format
    scratch_text_name_root = "temp" # Leave out the .txt extension
    cleanup_scratch_flag = True  # Temporary files cleaned up after OCR operation
    
    def call_tesseract(input_filename, output_filename, language):
    	"""Calls external tesseract.exe on input file (restrictions on types),
    	outputting output_filename+'txt'"""
    	args = [tesseract_exe_name, input_filename, output_filename, "-l", language]
    	proc = subprocess.Popen(args)
    	retcode = proc.wait()
    	if retcode!=0:
    		errors.check_for_errors()
    
    def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):
    	"""Converts im to file, applies tesseract, and fetches resulting text.
    	If cleanup=True, delete scratch files after operation."""
    	try:
    		util.image_to_scratch(im, scratch_image_name)
    		call_tesseract(scratch_image_name, scratch_text_name_root,language)
    		text = util.retrieve_text(scratch_text_name_root)
    	finally:
    		if cleanup:
    			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
    	return text
    
    def image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):
    	"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,
    	converts to compatible format and then applies tesseract.  Fetches resulting text.
    	If cleanup=True, delete scratch files after operation."""
    	try:
    		try:
    			call_tesseract(filename, scratch_text_name_root, language)
    			text = util.retrieve_text(scratch_text_name_root)
    		except errors.Tesser_General_Exception:
    			if graceful_errors:
    				im = Image.open(filename)
    				text = image_to_string(im, cleanup)
    			else:
    				raise
    	finally:
    		if cleanup:
    			util.perform_cleanup(scratch_image_name, scratch_text_name_root)
    	return text
    	
    
    if __name__=='__main__':
    	im = Image.open('phototest.tif')
    	text = image_to_string(im)
    	print text
    	try:
    		text = image_file_to_string('fnord.tif', graceful_errors=False)
    	except errors.Tesser_General_Exception, value:
    		print "fnord.tif is incompatible filetype.  Try graceful_errors=True"
    		print value
    	text = image_file_to_string('fnord.tif', graceful_errors=True)
    	print "fnord.tif contents:", text
    	text = image_file_to_string('fonts_test.png', graceful_errors=True)
    	print text
    
    

    在调用image_to_string函数时,只要加上相应的language参数就可以了,如简体中文最后一个参数即为 chi_sim, 繁体中文chi_tra,
    也就是下载的语言包的 XXX.traineddata 文件的名字XXX,如下载的中文包是 chi_sim.traineddata, 参数就是chi_sim :
    text = image_to_string(self.im, language = 'chi_sim')

    至此,图片识别就完成了。

    额外附加一句:有可能中文识别出来了,但是乱码,需要相应地将text转换为你所用的中文编码方式,如:
    text.decode("utf8")就可以了

  • 相关阅读:
    C# DES加密
    C#Base64编码
    从原理上搞定编码(四)-- Base64编码
    IIS CS0016: 未能写入输出文件“c:WINDOWSMicrosoft.NETFramework.。。”--“拒绝访问
    [转]mysql 数据类型
    [转]Spring MVC 教程,快速入门,深入分析
    [转]SSH和SSM对比总结
    [转]SpringMVC<from:form>表单标签和<input>表单标签简介
    【转】Oracle 自定义函数语法与实例
    【转】Lombok:让JAVA代码更优雅
  • 原文地址:https://www.cnblogs.com/javawebsoa/p/3106857.html
Copyright © 2011-2022 走看看