5.22.2019

pdf2txt

pdf2txt.py

pdf2txt.pyPDF文件中提取文本内容。它提取所有要以编程方式呈现的文本,即以ASCIIUnicode字符串表示的文本。它无法识别绘制为需要光学字符识别的图像的文本。它还为每个文本部分提取相应的位置,字体名称,字体大小,书写方向(水平或垂直)。访问受到限制时,您需要为受保护的PDF文档提供密码。您无法从没有提取权限的PD文档中提取任何文本。
$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)
有关详细信息,请参阅/docs/index.html

PDFMiner.six

PDFMiner.six is a fork of PDFMiner using six for Python 2+3 compatibility
Build Status PyPI version
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.


from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
import sys, os

os.environ["PYTHONIOENCODING"] = 'utf-8'
sys.stdout.reconfigure(encoding='utf-8')
#print(sys.stdout.encoding)
#print(os.environ["PYTHONIOENCODING"])
            
document = open('Pfizer1.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBoxHorizontal):
            obj = element._objs[0]
            print("x_cor: %.2f " % obj.bbox[0])
            print("y_cor: %.2f" % obj.bbox[1])
            print("length: %.2f" % obj.bbox[2])
            print("height: %.2f" % obj.bbox[3])
            text = obj.get_text().replace('\n','')
            #btext = text.encode(encoding='utf-8')
            print("text: ", text)
            print("--------------------")