zoukankan      html  css  js  c++  java
  • tesseract 字体训练资料篇

    tesseract 字体训练资料篇

     1.制作.box档案文件.

    tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox

    2.开始培训

    tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train

    tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr

     

    set_unicharset_properties 

    不知道什么来的

    training/set_unicharset_properties -U input_unicharset -O output_unicharset --script_dir=training/langdata

    font_properties 

    字体属性文件

    <fontname> <italic> <bold> <fixed> <serif> <fraktur>

    在<字体>一个字符串命名的字体 ; <斜体><加粗><固定><衬线><哥特体>都是简单的0或1标志指示字体是与否的属性

    Example:

    timesitalic 1 0 0 1 0

    ----在3.03,有一个默认的font_properties文件涵盖3000字体不一定准确培训/langdata / font_properties

    Clustering

    shapeclustering 创建主控形状表的聚类形状并将其写入一个文件shapetable。

    shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

    ----如果你得到错误信息,像这样的 "index >= 0 && index < size_used_:Error:Assert failed in genericvector.h, line 512" 添加shapetable文件到您的语言数据文件

    mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr ...

    你的文件是通过unicharset_extractor以上产生的unicharset,和lang.unicharset是输出unicharset将给予combine_tessdata。mftraining将输出两个数据文件:inttemp(形状的原型)和pffmtable(每个字符的预期功能)。

    输出normproto数据文件 

    cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr ...

     

    数据字典(可选)

    Name Type Description
    word-dawg dawg A dawg made from dictionary words from the language.
    freq-dawg dawg A dawg made from the most frequent words which would have gone into word-dawg.
    punc-dawg dawg A dawg made from punctuation patterns found around words. The "word" part is replaced by a single space.
    number-dawg dawg A dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
    fixed-length-dawgs dawg Several dawgs of different fixed lengths —— useful for languages like Chinese.
    bigram-dawg dawg A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?.
    unambig-dawg dawg TODO: Describe.
    user-words text A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract(1).
    wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
    wordlist2dawg words_list lang.word-dawg lang.unicharset

    参考资料:

    WIKI

    https://code.google.com/p/tesseract-ocr/wiki/FAQ

    Introduction

    https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_(new_in_3.01)

    WORDLIST2DAWG(1) Manual Page

    http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html

    COMBINE_TESSDATA(1) Manual Page

     http://tesseract-ocr.googlecode.com/svn-history/r800/trunk/doc/combine_tessdata.1.html

  • 相关阅读:
    自增主键强制修改
    网页速度优化
    JS 获取字符串长度, 区别中英文
    SQL时间相关 SQL日期,时间比较
    关于document.cookie的使用
    php文件头部空白影响CSS布局 2
    FlvPlayer 播放器代码
    JAVA——继承、多态、重载和重写转
    JS星级评分,带提示(转)
    web 启动 本地应用程序 Activity
  • 原文地址:https://www.cnblogs.com/mjorcen/p/3818687.html
Copyright © 2011-2022 走看看