zoukankan      html  css  js  c++  java
  • Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine

    http://www.pixel-technology.com/freeware/tessnet2/

    Tessnet2 a .NET 2.0 Open Source OCR assembly using Tesseract engine

    Keywords: Open source, OCR, Tesseract, .NET, DOTNET, C#, VB.NET, C++/CLI

    Current version : 2.04.0, 02SEP09 (see version history)

    The big picture
    Tesseract is a C++ open source OCR engine. Tessnet2 is .NET assembly that expose very simple methods to do OCR.
    Tessnet2 is multi threaded. It uses the engine the same way Tesseract.exe does. Tessdll uses another method (no thresholding).

    License
    Tessnet2 is under Apache 2 license (like tesseract), meaning you can use it like you want, included in commercial products. You can read full license info in source file.

    Quick Tessnet2 usage

    1. Download binary here, add a reference of the assembly Tessnet2.dll to your .NET project.

    2. Download language data definition file here and put it in tessdata directory. Tessdata directory and your exe must be in the same directory.

    3. Look at the Program.cs sample

    Note: Tessnet2.dll needs Visual C++ 2008 Runtime. When deploying your application be sure to install C++ runtime (x86, x64)

    Tessnet2 usage

    Bitmap image = new Bitmap("eurotext.tif");
    tessnet2.
    Tesseract ocr = new tessnet2.Tesseract();
    ocr.SetVariable(
    "tessedit_char_whitelist", "0123456789"); // If digit only
    ocr.Init(@"c: emp", "fra",
    false); // To use correct tessdata
    List<tessnet2.Word> result = ocr.DoOCR(image, Rectangle.Empty);
    foreach (tessnet2.Word word in result)
        Console.WriteLine("{0} : {1}", word.Confidence, word.Text);

    Tessnet2 source code and recompiling

    1. Download Tesseract source code here and expand it in a directory

    2. Download Tessnet2 source code here and expand it in Tesseract source code root directory (it should create dotnet sub directory)

    3. Open the project solution tessnet2.sln. It's a Visual Studio 2008 C++/CLI project

    Memory leak

    Tesseract C++ source code is full of memory leak. Using tessnet2 assembly several time will cause memory overflow. This is not tessnet2 leak, this is tesseract leak and I spent two days in tesseract source code trying to improve this with no success. See what I think about this.

    Tessnet2 demo
    In the Tessnet2 source code you have two C# demo project. TesseractOCR is a multi-tread WinForm demo with a progression bar. TesseractConsole is a console demo.


    The confidence score is between braquets. < 160 mean not bad

    Version History

    07JUN08: First release on Tesserect 2.03

    10JUN08: Version 2.03.1. Change Confidence behavior, now it's calculated from each word letter and not from the first letter. Type change from byte to double. 0 = perfect, 100 = reject

    13JUN08 : Version 2.03.2

    After 3 days in Tesseract code (urgh), here is Tessnet2 version 2.03.2
    The corrections deals with the following problems
    * Confidence was not very useful, the value was strange. This has been corrected, setting the variable tessedit_write_ratings=true. After many test I found this mode is the best for confidence accuracy. Value range from 0 (perfect) to 255 (reject) . When value goes over 160 this really mean the OCR was bad.
    * Calling DoOCR twice was not giving the same result. It was, as expected, a problem with global variables. The problem is almost fixed, sometime it doesn’t work but right now I can’t find what is not correctly reinitialized.
     

  • 相关阅读:
    参考选择屏幕(控制选择屏幕两个屏幕,单值输入……通过函数实现单值输入)
    json串转化成xml文件、xml文件转换成json串
    创建xml文件、解析xml文件
    CDATA(不应由XML解析器进行解析的文本数据)、CDATA的使用场景
    python添加、修改、删除、访问类对象属性的2种方法
    类对象序列化为json串,json串反序列化为类对象
    python对象转化为json串、json串转化为python串
    windows下安装Mysql—图文详解
    用列表实现一个简单的图书管理系统 python
    列表去重几种方法 python
  • 原文地址:https://www.cnblogs.com/qqhfeng/p/3629282.html
Copyright © 2011-2022 走看看