How to extract text from PDF(Image) files, OCR - 走看看

zoukankan html css js c++ java

How to extract text from PDF(Image) files, OCR

Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

1. Registry a API key throw https://ocr.space/OCRAPI

There are limitations for Free Plan

2. Save the email attachment(PDF file) to NetSuite FileCabinet, set it to available without login, get the full url address, encode it.

var importFile = attachments[indexAtt];importFile.setIsOnline(true);
var intFileId = nlapiSubmitFile(importFile);
var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
strInvFileUrl = encodeURIComponent(strInvFileUrl);

3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

var response = nlapiRequestURL(strReqUrl, null, a);
There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.

4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
var objVndBillData = parseDataFromInvPdf(arrParsedLines);

查看全文

相关阅读:
Java-马士兵设计模式学习笔记-代理模式-动态代理修改成可以任意修改代理逻辑
 Java-马士兵设计模式学习笔记-代理模式--动态代理修改成可以代理任意接口
 Java-马士兵设计模式学习笔记-代理模式-动态代理调用Proxy.newProxyInstance()
Java-马士兵设计模式学习笔记-代理模式-聚合与继承方式比较
 Java-马士兵设计模式学习笔记-责任链模式-模拟处理Reques Response
cmder的下载和使用
 poj 1067 取石子游戏
 1026 Table Tennis (30)（30 分）
poj 1088 滑雪
 1131 Subway Map（30 分）

原文地址：https://www.cnblogs.com/backuper/p/How_to_extract_text_from_PDF_or_Image_files.html

Copyright © 2011-2022 走看看