zoukankan      html  css  js  c++  java
  • How to extract text from PDF(Image) files, OCR

    Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

    1. Registry a API key throw https://ocr.space/OCRAPI

    There are limitations for Free Plan

    2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

    var importFile = attachments[indexAtt];importFile.setIsOnline(true);
    var intFileId = nlapiSubmitFile(importFile);
    var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
    strInvFileUrl = encodeURIComponent(strInvFileUrl);

     

    3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

    var response = nlapiRequestURL(strReqUrl, null, a);
    There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.


    4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

    var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
    var objVndBillData = parseDataFromInvPdf(arrParsedLines);

  • 相关阅读:
    JqGrid常用示例
    jqGrid无刷新分页,添加行按钮
    C#两个实体之间相同属性的映射
    Log4Net日志记录
    C#压缩图片
    ASP.Net MVC4.0+无刷新分页
    世界各个国家中英文sql建表
    ASP.NET多语言
    分布式事务处理中的幂等性
    分布式事务前瞻-接口幂等性
  • 原文地址:https://www.cnblogs.com/backuper/p/How_to_extract_text_from_PDF_or_Image_files.html
Copyright © 2011-2022 走看看