pdf2txt.py
pdf2txt.py
从PDF
文件中提取文本内容。它提取所有要以编程方式呈现的文本,即以ASCII
或Unicode
字符串表示的文本。它无法识别绘制为需要光学字符识别的图像的文本。它还为每个文本部分提取相应的位置,字体名称,字体大小,书写方向(水平或垂直)。访问受到限制时,您需要为受保护的PDF文档提供密码。您无法从没有提取权限的PD
文档中提取任何文本。$ pdf2txt.py -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)
$ pdf2txt.py -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)
$ pdf2txt.py -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file)
有关详细信息,请参阅/docs/index.html。
PDFMiner.six
PDFMiner.six is a fork of PDFMiner using six for Python 2+3 compatibility
PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.
- Webpage: https://github.com/pdfminer/
- Download (PyPI): https://pypi.python.org/pypi/pdfminer.six/
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
import sys, os
os.environ["PYTHONIOENCODING"] = 'utf-8'
sys.stdout.reconfigure(encoding='utf-8')
#print(sys.stdout.encoding)
#print(os.environ["PYTHONIOENCODING"])
document = open('Pfizer1.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
obj = element._objs[0]
print("x_cor: %.2f " % obj.bbox[0])
print("y_cor: %.2f" % obj.bbox[1])
print("length: %.2f" % obj.bbox[2])
print("height: %.2f" % obj.bbox[3])
text = obj.get_text().replace('\n','')
#btext = text.encode(encoding='utf-8')
print("text: ", text)
print("--------------------")
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal
import sys, os
os.environ["PYTHONIOENCODING"] = 'utf-8'
sys.stdout.reconfigure(encoding='utf-8')
#print(sys.stdout.encoding)
#print(os.environ["PYTHONIOENCODING"])
document = open('Pfizer1.pdf', 'rb')
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBoxHorizontal):
obj = element._objs[0]
print("x_cor: %.2f " % obj.bbox[0])
print("y_cor: %.2f" % obj.bbox[1])
print("length: %.2f" % obj.bbox[2])
print("height: %.2f" % obj.bbox[3])
text = obj.get_text().replace('\n','')
#btext = text.encode(encoding='utf-8')
print("text: ", text)
print("--------------------")