zoukankan      html  css  js  c++  java
  • Tesseract-OCR 训练过程 V3.02

    软件:
    jTessBoxEditor Version 0.9 (30 April 2013)
    Tesseract-OCR win32 v3.02 with Leptonica
     
    训练步骤:
     
    1.使用jTessBoxEditor,tools->merge_tif,产生tif文件
    2.产生box文件
    tesseract.exe eng.arial.01.tif eng.arial.01 batch.nochop makebox
    3.使用jTessBoxEditor打开,Insert或Delete,添加删除字符,并通过xywh调整对应的坐票
    4.训练(如果遇到不可识别的字符,couldn t find a matching blob,尝试换位置或调坐标)
    tesseract.exe eng.arial.01.tif eng.arial.01 nobatch box.train
    5.字体预处理
    unicharset_extractor.exe eng.arial.01.box
    6.创建font_properties.txt,内容为:arial 0 0 0 0 0
    7.字体处理
    mftraining.exe -F font_properties.txt -U unicharset eng.arial.01.tr
    8.cntraining.exe eng.arial.01.tr
    9.把unicharset, inttemp, normproto, pffmtable这四个文件加上前缀“eng.arial.01.”
    10.combine_tessdata.exe eng.arial.01.
     
    显示:
    Combining tessdata files
    TessdataManager combined tesseract data files.
    Offset for type 0 is -1
    Offset for type 1 is 108
    Offset for type 2 is -1
    Offset for type 3 is 1660
    Offset for type 4 is 327545
    Offset for type 5 is 327781
    Offset for type 6 is -1
    Offset for type 7 is -1
    Offset for type 8 is -1
    Offset for type 9 is -1
    Offset for type 10 is -1
    Offset for type 11 is -1
    Offset for type 12 is –1
     
    必须确定的是第2、4、5、6行的数据不是-1,那么一个新的字典就算生成了。
     
    11.此时目录下“eng.arial.01.traineddata”的文件拷贝到tesseract程序目录下的“tessdata”目录
    12.
    #tesseract.exe test.jpg result -l eng.arial.01
    #tesseract.exe a.bmp result2 -l eng.arial.01
     
    指定布局识别方式
    tesseract.exe 42.png result2 -l eng.arial.01 -psm 7
     
     
    布局参数描述:
     
    -psm N
        Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
     
        0 = Orientation and script detection (OSD) only.
        1 = Automatic page segmentation with OSD.
        2 = Automatic page segmentation, but no OSD, or OCR.
        3 = Fully automatic page segmentation, but no OSD. (Default)
        4 = Assume a single column of text of variable sizes.
        5 = Assume a single uniform block of vertically aligned text.
        6 = Assume a single uniform block of text.
        7 = Treat the image as a single text line.
        8 = Treat the image as a single word.
        9 = Treat the image as a single word in a circle.
        10 = Treat the image as a single character.
     
  • 相关阅读:
    【新特性速递】数字输入框的前缀和后缀(位于输入框内部)
    【新特性速递】进度条,进度条,进度条
    【新特性速递】当法语遇上FineUI(Bonjour)!
    【新特性速递】自定义数字输入框的小数分隔符和千分位分隔符
    【经验分享】FineUICore中如何处理文件导出异常?
    【网友作品】服装分销系统架构与界面分享(基于FineUICore基础版)
    FineUIPro/Mvc/Core v6.3.0 正式发布了!
    星球居民突破 1700 人!
    【新特性速递】开关样式复选框增强!
    【新特性速递】为RenderField新增QuickSortField属性!
  • 原文地址:https://www.cnblogs.com/waw/p/5495350.html
Copyright © 2011-2022 走看看