一、解决方案:
1、http://www.ddooo.com/softdown/94968.htm 打开下载的压缩包,找到“tesseract-ocr-setup-3.02.02.exe”,双击运行;
2、python报错的地方,有pytesseract.py的连接,点开,修改pytesseract.py。如图:
注意:要在路径前加一个r。
二、此文字识别引擎,里面有一些训练好的数据库,也可自己fit-tunning。
使用和训练:
https://www.cnblogs.com/Leo_wl/p/5556620.html
http://www.cnblogs.com/cnlian/p/5765871.html
三、准确率一直提不上去,自己训练标注不现实,时间不允许。使用腾讯云
腾讯ocr免费1000次每天,可以使用,准确率自然高!
密钥地址:https://console.cloud.tencent.com/cam/overview
# coding=UTF-8 # !/usr/bin/env python # -*- coding: utf-8 -*- # import docx import requests import hmac import hashlib import base64 import time import random import re appid = "1257122374"#写入自己的腾讯云号码 bucket = "你的bucket" #不要也可以 secret_id = "XXXXXXXXXXXXXXXXXX" # 写入自己的账号里面的地址 secret_key = "EXXXXXXXXXXXXXXX" # 同上 expired = time.time() + 2592000 onceExpired = 0 current = time.time() rdm = ''.join(random.choice("0123456789") for i in range(10)) userid = "0" fileid = "tencentyunSignTest" info = "a=" + appid + "&b=" + bucket + "&k=" + secret_id + "&e=" + str(expired) + "&t=" + str(current) + "&r=" + str( rdm) + "&u=0&f="#去掉bucket signindex = hmac.new(secret_key, info, hashlib.sha1).digest() # HMAC-SHA1加密 sign = base64.b64encode(signindex + info) # base64转码 url = "http://recognition.image.myqcloud.com/ocr/general" headers = {'Host': 'recognition.image.myqcloud.com', "Authorization": sign, } files = {'appid': (None, appid), 'bucket': (None, bucket), 'image': ('15.jpg', open('G:\360Downloads\15.jpg', 'rb'), 'image/jpeg') } r = requests.post(url, files=files, headers=headers) responseinfo = r.content # 创建内存中的word文档对象 # file=docx.Document() #r_index = r'itemstring":"(.*?)"' # 做一个正则匹配
r_index = r'itemstring":"(w+)"' #我的只匹配数字和字母 result = re.findall(r_index, responseinfo) for i in result: # file.add_paragraph(i) print i # file.save("D:\writeResult.docx")