zoukankan      html  css  js  c++  java
  • docker中基于centos7.3搭建tesseract5训练环境以及进行训练

    1. 准备资料:

    (1) https://github.com/tesseract-ocr/tesseract 项目,到linux中安装

    ==这一步可以理解linux为安装tesseract5环境,这里直接用安装tesseract的镜像启动之后测试。

    关于镜像安装tesseract参考https://www.cnblogs.com/qlqwjy/p/13028194.html

    这里我把centos升级为8.也就是tesseract环境部署在centos8,我一开始用的centos7.3,会报错ICU版本过低。下面的安装也是在centos8镜像中。

     

    启动镜像并且进入容器查看内核和发行版:

    [root@0f76915a8f71 ocrtemplate]# uname -a
    Linux 0f76915a8f71 4.14.154-boot2docker #1 SMP Thu Nov 14 19:19:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
    [root@0f76915a8f71 ocrtemplate]# cat /etc/redhat-release
    CentOS Linux release 8.1.1911 (Core)

    (2) 训练的字体库,以黑体字体库为例,需要字体库simhei.ttf。
    linux下面:
    列出已安装字体文件fc-list
    列出中文字体文件fc-list :lang=zh
    Windows的目录C:WindowsFontssimhei.ttf 拷贝到linux的目录:/usr/share/fonts/zh

    fc-list不是命令需要安装一下:

    yum install -y fontconfig mkfontscale

    字体上传之后如下:

    [root@f7744ac1467d zh]# fc-list :lang=zh
    /usr/share/fonts/zh/SIMHEI.TTF: SimHei:style=Normal

    (3) langdata_lstm项目克隆到本地:
    https://github.com/tesseract-ocr/langdata_lstm

    也可以只下载chi_sim目录与eng目录外加下面五个文件:
    common.punc
    font_properties
    Latin.unicharset
    Latin.xheights
    radical-stroke.txt

      tessdata官方训练好的字库,这里我们训练的是中文,所以去下载chi_sim.traineddata以及eng.traineddata,eng.traineddata是必须的。

      用的时候我们可以用fast版本字体库,但是训练时的版本必须用best版本的字体库。

    字体库下载地址:

    https://github.com/tesseract-ocr/tessdata_best
    https://github.com/tesseract-ocr/tessdata_fast

    (4)安装需要的依赖库:

    参考文档:https://tesseract-ocr.github.io/tessdoc/Compiling.html#linux

     第一步:查看以及安装cairo

    [root@0f76915a8f71 zh]# yum list cairo
    Last metadata expiration check: 0:45:55 ago on Thu 04 Jun 2020 10:05:43 AM UTC.
    Available Packages
    cairo.i686                                  1.15.12-3.el8                                 AppStream
    cairo.x86_64                                1.15.12-3.el8                                 AppStream
    [root@0f76915a8f71 zh]# yum install cairo.i686

    第二步:查看以及安装pango(安装pango的时候也需要安装pango-devel)

    [root@0f76915a8f71 zh]# yum list pango
    Last metadata expiration check: 0:48:03 ago on Thu 04 Jun 2020 10:05:43 AM UTC.
    Available Packages
    pango.i686                                   1.42.4-6.el8                                 AppStream
    pango.x86_64                                 1.42.4-6.el8                                 AppStream
    [root@0f76915a8f71 zh]# yum install pango.i686

    第三步:查看以及安装icu:

    [root@0f76915a8f71 zh]# yum list icu
    Last metadata expiration check: 0:49:26 ago on Thu 04 Jun 2020 10:05:43 AM UTC.
    Available Packages
    icu.x86_64                                   60.3-2.el8_1                                    BaseOS
    [root@0f76915a8f71 zh]# yum install icu.x86_64

    第四步:安装如下依赖

    yum install asciidoc.noarch
    libicu-devel.x86_64
    yum install libtiff
    yum install pango-devel.x86_64

    (5)重新进入tesseract-master进入安装:

    cd /opt/tesseract/tesseract-master

     第一步:

    ./autogen.sh

    第二步:

    ./configure

    这一步configure完成会出现提示安装Trainging Tools:(中间如果缺少哪个依赖会在checking for 的时候提示)

    checking for off_t... yes
    checking for mbstate_t... yes
    checking for pkg-config... /usr/bin/pkg-config
    checking pkg-config is at least version 0.9.0... yes
    checking for libcurl... no
    checking for LEPTONICA... yes
    checking for libarchive... no
    checking for ICU_UC... yes
    checking for ICU_I18N... yes
    checking for pango... yes
    checking for cairo... yes
    checking for pangocairo... yes
    checking for pangoft2... yes
    checking that generated files are newer than configure... done
    configure: creating ./config.status
    config.status: creating include/tesseract/version.h
    config.status: creating Makefile
    config.status: creating tesseract.pc
    config.status: creating tessdata/Makefile
    config.status: creating tessdata/configs/Makefile
    config.status: creating tessdata/tessconfigs/Makefile
    config.status: creating unittest/Makefile
    config.status: creating java/Makefile
    config.status: creating java/com/Makefile
    config.status: creating java/com/google/Makefile
    config.status: creating java/com/google/scrollview/Makefile
    config.status: creating java/com/google/scrollview/events/Makefile
    config.status: creating java/com/google/scrollview/ui/Makefile
    config.status: creating doc/Makefile
    config.status: creating src/training/Makefile
    config.status: creating config_auto.h
    config.status: executing depfiles commands
    config.status: executing libtool commands
    
    Configuration is done.
    You can now build and install tesseract by running:
    
    $ make
    $ sudo make install
    $ sudo ldconfig
    
    Documentation will not be built because asciidoc or xsltproc is missing.
    
    Training tools can be built and installed with:
    
    $ make training
    $ sudo make training-install

    第三步:执行安装训练环境

    make training
    make training-install

    第四步:执行text2image测试

    [root@0f76915a8f71 tesseract-master]# text2image -v
    Using CAIRO_FONT_TYPE_FT.
    Pango version: 1.42.3
    5.0.0-alpha

    2.开始训练

    1. 生成待训练数据(官方 best版本的chi_sim 中文字库训练样本为 12MB 左右的文本,全部生成图片的话会过大,所以这里指定最多生成 5 页文字)

    (1)best版本的traineddata移动到/usr/local/share/tessdata/ 目录,eng必须存在。而且必须用best版本

    (2)生成待训练数据(官方 langdata 中中文字库训练样本为 25MB 左右的文本,全部生成图片的话会过大,所以这里指定最多生成 5 页文字)

      1700页以后报错,没有找到原因,所以每次提取我最多到1700页。而且/usr/local/share/tessdata下面的chi_sim.traineddata一定是best版本的。

    /opt/tesseract/tesseract-master/src/training/tesstrain.sh 
      --fonts_dir /usr/share/fonts/zh 
      --lang chi_sim --linedata_only   
      --noextract_font_properties 
      --langdata_dir /opt/tesstrain/langdata_lstm   
      --tessdata_dir /usr/local/share/tessdata   
      --save_box_tiff --maxpages 5  
      --fontlist "SimHei" 
      --output_dir /opt/tesstrain/imgs

    参数解释:

     --exposures EXPOSURES      # A list of exposure levels to use (e.g. "-1 0 1").
     --fontlist FONTS           # A list of fontnames to train on.
     --fonts_dir FONTS_PATH     # Path to font files.
     --lang LANG_CODE           # ISO 639 code.
     --langdata_dir DATADIR     # Path to tesseract/training/langdata directory.
     --linedata_only            # Only generate training data for lstmtraining.
     --output_dir OUTPUTDIR     # Location of output traineddata file.
     --overwrite                # Safe to overwrite files in output_dir.
     --run_shape_clustering     # Run shape clustering (use for Indic langs).
     --maxpages                 # Specify maximum pages to output (default:0=all)
     --save_box_tiff            # Save box/tiff pairs along with lstmf files.
     --xsize                    # Specify width of output image (default:3600)  

    解释:这一步操作,实际上是利用 text2image 将要训练的样本文字转换为 tif 图片,生成的图片中的文字都是校正好的,可以自己打开看一下。这样一来,我们知道生成的图片对应的正确文字,我们就可以拿来与官方现有训练好的数据进行评估。即:用官方现有的训练字库来识别生成的图片,将识别结果,与我们生成时已知的文字进行对比,就能知道当前训练库对于当前字体的识别率了,当然,这不需要我们自己手动去识别比对计算识别率了,有专门的工具。  

    生成的训练文件都在/opt/tesstrain/imgs目录下。提取的文件列表如下:

    [root@0f76915a8f71 imgs]# ls -l
    total 4492
    drwxr-x--- 2 root root    4096 Jun  5 01:50 chi_sim
    -rw-r--r-- 1 root root  481134 Jun  5 01:50 chi_sim.SimHei.exp0.box
    -rw-r--r-- 1 root root 3062484 Jun  5 01:50 chi_sim.SimHei.exp0.lstmf
    -rw-r--r-- 1 root root 1041136 Jun  5 01:50 chi_sim.SimHei.exp0.tif
    -rw-r--r-- 1 root root      46 Jun  5 01:50 chi_sim.training_files.txt

    2. 提取 chi_sim.lstm 文件: 

    [root@0f76915a8f71 imgs]# combine_tessdata -e /usr/local/share/tessdata/chi_sim.traineddata /opt/tesstrain/chi_sim.lstm
    Extracting tessdata components from /usr/local/share/tessdata/chi_sim.traineddata
    Wrote /opt/tesstrain/chi_sim.lstm
    Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
    0:config:size=89, offset=192
    17:lstm:size=1546461, offset=281
    18:lstm-punc-dawg:size=282, offset=1546742
    19:lstm-word-dawg:size=590634, offset=1547024
    20:lstm-number-dawg:size=82, offset=2137658
    21:lstm-unicharset:size=258834, offset=2137740
    22:lstm-recoder:size=72494, offset=2396574
    23:version:size=84, offset=2469068

    3.用提取的lstm文件进行评估:

    lstmeval --model /opt/tesstrain/chi_sim.lstm 
     --traineddata /usr/local/share/tessdata/chi_sim.traineddata 
     --eval_listfile /opt/tesstrain/imgs/chi_sim.training_files.txt

     上面的评估过程可以省略,只是用来查看当前的识别精度而已,不执行也行。

    4.开始训练

     OMP_THREAD_LIMIT=8 lstmtraining --model_output /opt/tesstrain/simhei 
      --continue_from /opt/tesstrain/chi_sim.lstm 
      --traineddata /usr/local/share/tessdata/chi_sim.traineddata 
      --train_listfile /opt/tesstrain/imgs/chi_sim.training_files.txt 
      --max_iterations 5

        continue_from参数,该参数指定了从前面提取出来的神经网络,也就是在现有的训练字库基础上进行训练,不仅仅可以是.lstm文件,也可以是chexbox文件,chexbox是训练生成的文件,比如上面的命令,训练输出文件夹~/tesstutorial/output/simhei就会生成chexbox文件。
      如果训练指定次数后结果还不满意,可以继续训练,这时可以重复执行上面命令,如果chi_sim.lstm不小心删除了,可以指定--continue_from改成~/tesstutorial/output/simhei中的checkbox文件,需要注意的是,-max_iterations需要比上一次大,因为它是接着之前的训练结果开始训练的。

    ==========训练结果格式如下:=======================
    At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char train=1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.
    14615: learning_iteration 学习迭代
    695400: training_iteration    训练迭代
    698614: sample_iteration    样本迭代次数
    
    learning_iteration :
        “Number of iterations that yielded a non-zero delta error and thus provided significant learning. (learning_iteration <= training_iteration). learning_iteration_ is used to measure rate of learning progress.” So it uses the delta value to assess it the iteration has been useful.
        “产生非零delta误差从而提供重要学习的迭代次数。(学习迭代<=训练迭代)。“学习迭代”是用来衡量学习进度的,“所以它使用delta值来评估它,迭代是有用的
    training_iteration 
        “Number of actual backward training steps used.” It is how many times a training file has been SUCCESSFULLY passed into the learning process. So every time you get an error : “Image too large to learn!!” - “Encoding of string failed!” - “Deserialize header failed”, the sample_iteration increments but not the training_iteration. Actually you have 1 - (695400 / 698614) = 0.4% which is the skip ratio : proportion of files that have been skipped because of an error
        “实际使用的反向培训步骤数”。这是一个训练文件成功传递到学习过程中的次数。所以每次你得到一个错误:“图像太大,无法学习!!字符串的“-”编码失败!“-”反序列化头失败“,示例迭代递增,但不是训练迭代。实际上,您有1-(695400/698614)=0.4%,这是跳过率:由于错误而被跳过的文件的比例
    sample_iteration :
        “Index into training sample set. (sample_iteration >= training_iteration).” It is how many times a training file has been passed into the learning process.
        “索引到训练样本集中。(sample_iteration>=training_iteration)。“这是一个训练文件传递到学习过程中的次数。
                                    
    skip ratio:由于错误而被跳过的文件
        1-(training_iteration/sample_iteration) = 1- (695400 / 698614) = 0.4%

    5.合并训练结果:

    lstmtraining --stop_training 
      --continue_from /opt/tesstrain/simhei_checkpoint 
      --traineddata /usr/local/share/tessdata/chi_sim.traineddata 
      --model_output /usr/local/share/tessdata/chi_sim_simhei.traineddata

      最终生成我们训练好的chi_sim_simhei.traineddata文件。

      如果想像官方那样生成fast版本,可用如下命令将chi_sim_simhei.traineddata转成fast版本:

    combine_tessdata -c chi_sim_simhei.traineddata

    再次查看语言列表:

    [root@0f76915a8f71 tessdata_best-master]# tesseract --list-langs                                   List of available languages (5):
    chi_sim
    chi_sim_fast
    chi_sim_simhei
    eng
    eng_fast

    6.测试:

    #只指定语言。指定语言需要用到chi_sim_vert.traineddata
    tesseract ./normal.png chi_sim__simhei_result -l chi_sim_simhei
    tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1
    tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300
    tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1
    
    # 指定 --psm 1 需要用到 osd.traineddata
    tesseract ./normal.png chi_sim__simhei_result -l chi_sim -c preserve_interword_spaces=1 --dpi 300 --oem 1 --psm 1

    3.容器commit为镜像

    Administrator@MicroWin10-1535 MINGW64 /e/tesseractTrain/docker制作tesseract5Train镜像
    $ docker ps -a
    CONTAINER ID        IMAGE                                COMMAND             CREATED             STATUS              PORTS               NAMES
    0f76915a8f71        daocloud.io/library/centos:centos8   "/bin/bash"         16 hours ago        Up 16 hours                             friendly_cannon
    
    Administrator@MicroWin10-1535 MINGW64 /e/tesseractTrain/docker制作tesseract5Train镜像
    $ docker commit 0f76915a8f71 tesseracttraining  #容器导出为镜像
    sha256:0c80e881e0d2ba32ff930d2d6ec2645b9d2414c8d4c744e9efacf5be35c5f37a
    
    Administrator@MicroWin10-1535 MINGW64 /e/tesseractTrain/docker制作tesseract5Train镜像
    $ docker images | grep training
    tesseracttraining                                                      latest               0c80e881e0d2        28 seconds ago      2.15GB

    4.镜像save为tar包便于迁移

    docker save -o tesseracttraining.tar tesseracttraining

      接下来tar包可以随便迁移,离线安装。

      当然可以提交到镜像仓库。

    补充:如果yum源上述命令没有的话需要修改yum源为阿里:果yum源上述命令没有的话需要修改yum源为阿里:

    (1)备份:
    mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup 
    (2)下载新的CentOS-Base.repo 到/etc/yum.repos.d/
    CentOS 6
    curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-6.repo
     
    CentOS 7
    curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
     
    CentOS 8
    curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-8.rep
    (3)清除缓存,重新生成缓存 
    清除缓存
    yum clean all
    
    重新设置缓存
    yum makecache
  • 相关阅读:
    nfs目录权限
    14.5.5 AUTO_INCREMENT Handling in InnoDB 在InnoDB AUTO_INCREMENT处理
    Tk 表格的宽度
    化工企业数据分析平台项目之应收款分析
    化工企业数据分析平台项目之应收款分析
    14.5.3 Grouping DML Operations with Transactions 分组DML 事务操作
    perl | 匹配多个
    struts的控制器组件
    解决Thinkpad开启飞行模式无法连接无线网络
    如何解决Thinkpad连接wifi经常断线
  • 原文地址:https://www.cnblogs.com/qlqwjy/p/13028200.html
Copyright © 2011-2022 走看看