摘要:
centos7安装依赖库
tesseract配置
代码例子
centos7安装依赖库
-
安装centos系统依赖
yum install -y automake autoconf libtool gcc gcc-c++ yum install -y libpng-devel libjpeg-devel libtiff-devel
-
安装leptonica
wget http://www.leptonica.org/source/leptonica-1.72.tar.gz tar xvzf leptonica-1.72.tar.gz cd leptonica-1.72/ ./configure make && make install
-
安装tesseract-ocr
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.zip unzip 3.04.zip cd tesseract-3.04/ ./configure make && make install sudo ldconfig
-
部署模型
- 在https://github.com/tesseract-ocr/tessdata 下载对应语言的模型文件
- 将模型文件移动到/usr/local/share/tessdata
-
安装requirements.txt中的python依赖库
pip install -r requirements.txt
tesseract配置
-
在/usr/local/share/tessdata创建eng.user-patterns写入
表示识别6位字符(或数字)
-
在/usr/local/share/tessdata/configs创建myconfig写入
#识别白名单 tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123546789 #用户正则模式匹配 user_patterns_suffix user-patterns
-
psm参数说明
-psm N Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character.
代码例子
1 import pytesseract 2 from PIL import Image 3 4 image = Image.open('code.png') 5 code = pytesseract.image_to_string(image) 6 print code