Tesseract的简单使用

zoukankan html css js c++ java

Tesseract的简单使用
tesseract下载地址：https://digi.bib.uni-mannheim.de/tesseract/

添加中文的识别库：

https://github.com/tesseract-ocr/tessdata/find/master

这个网址中下载chi_sim.traineddata，下载后放到Tesseract-OCR essdata文件夹内。

设置环境变量：

安装完成后在Windows下把tesseract.exe所在的路径添加到PATH环境变量中。

另外一个环境变量我自己电脑中是没有添加，也可以正常运行程序。做个参考：

**********************************************************************************************************

在使用tesseract命令行进行测试时，报以下的错误

Error opening data file Program Files (x86)Tesseract-OCR essdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

报错是意思是缺少环境变量TESSDATA_PREFIX，导致无法加载任何语言，就不能初始化tesseract。

解决的方法也很简单，在环境变量中，添加一个变量名为TESSDATA_PREFIX,变量值为teseractdata目录地址。

**************************************************************************************************************************

在命令行中使用tesseract识别图像：

如果想要在cmd下能够使用tesseract命令，那么需要把tesseract.exe所在的目录放到PATH环境变量中。然后使用命令：tesseract 图片路径文件路径。
示例：
```
tesseract a.png a
```
那么就会识别出a.png中的图片，并且把文字写入到a.txt中。

如果识别中文的，需要添加个参数：
```
tesseract a.png a -l eng 默认的是eng，中文的就改成chi_sim。
```
关于快速的在当前文件夹内打开cmd的方法，是按住shift键，然后右键，就可以有“在此处打开命令行窗口”的选项，并且直接定位到当前文件夹内。

在代码中使用tesseract识别图像：

pip install pytesseract安装

使用代码：
from PIL import Image import pytesseract text = pytesseract.image_to_string(Image.open('captcha.png') , lang='chi_sim') print(text)
从网页中下载图片的简单代码：
from urllib import request img_url = 'https://u.baidu.com/ucweb/?module=Reguser&controller=reg&action=image&appid=12&_=1551428462677' request.urlretrieve(img_url, 'captcha.png')
查看全文

相关阅读:
面向对象编程总结Python
垃圾收集器与内存分配策略
 自定义异常、异常处理注意点
 关于线程【一】——线程创建、停止、interrupted（）和isInterrupted（）区别
 Java内存区域
 HotSpot虚拟机对象
 异常——try、catch、finally、throw、throws
关于线程【二】——线程同步和异步
 fillder代理调试
 新鲜出炉的Asp.Net MVC电子书

原文地址：https://www.cnblogs.com/weiwei2016/p/10457863.html

Tesseract的简单使用

设置环境变量：

在命令行中使用tesseract识别图像：

在代码中使用tesseract识别图像：