Làm việc với các tệp PDF bằng Python?

Python là một ngôn ngữ rất linh hoạt vì nó cung cấp một bộ thư viện khổng lồ để hoạt động theo các yêu cầu khác nhau. Tất cả chúng ta đều làm việc trên các tệp Định dạng Tài liệu Di động (PDF). Python cung cấp các cách khác nhau để làm việc với các tệp pdf. Trong phần này, chúng tôi sẽ sử dụng thư viện python có tên là PyPDF2 để làm việc với tệp pdf.

PyPDF2 là một thư viện PDF thuần python có khả năng chia nhỏ, hợp nhất với nhau, cắt và chuyển đổi các trang của tệp PDF. Nó cũng có thể thêm dữ liệu tùy chỉnh, tùy chọn xem và mật khẩu vào tệp PDF. Nó có thể truy xuất văn bản và siêu dữ liệu từ các tệp PDF cũng như hợp nhất toàn bộ các tệp với nhau.

Vì chúng tôi có thể thực hiện nhiều thao tác trên tệp PDF với PyPDF2, vì vậy nó hoạt động giống như một con dao của quân đội Thụy Sĩ.

Bắt đầu

Vì pypdf2 là một gói python tiêu chuẩn, vì vậy chúng ta cần cài đặt nó. Điều tốt là nó rất dễ dàng, chúng tôi có thể sử dụng pip để cài đặt nó. Chỉ cần chạy lệnh dưới đây trên dòng lệnh của bạn:

C:\Users\rajesh>pip install pypdf2
Collecting pypdf2
Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB)
100% |████████████████████████████████| 81kB 83kB/s
Building wheels for collected packages: pypdf2
Building wheel for pypdf2 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\53\84\19\35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7
Successfully built pypdf2
Installing collected packages: pypdf2
Successfully installed pypdf2-1.26.0

Để xác minh, hãy nhập pypdf2 từ python shell

>>> import PyPDF2
>>>
Successful, Great.

Trích xuất siêu dữ liệu

Chúng tôi có thể trích xuất một số dữ liệu hữu ích quan trọng từ bất kỳ bản pdf nào. Ví dụ:chúng tôi có thể trích xuất thông tin về tác giả của tài liệu, tiêu đề, chủ đề và số trang có trong tệp pdf.

Dưới đây là chương trình python để trích xuất thông tin hữu ích từ tệp pdf bằng gói pypdf2.

from PyPDF2 import PdfFileReader
def extract_pdfMeta(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      info = pdf.getDocumentInfo()
      number_of_pages = pdf.getNumPages()
   print("Author: \t", info.author)
   print()
   print("Creator: \t", info.creator)
   print()
   print("Producer: \t",info.producer)
   print()
   print("Subject: \t", info.subject)
   print()
   print("title: \t",info.title)
   print()
   print("Number of Pages in pdf: \t",number_of_pages)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfMeta(path)

Đầu ra

Author: Nikhil Buduma,Nicholas Locascio

Creator: AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST)

Producer: Antenna House PDF Output Library 6.2.609 (Linux64)

Subject: None

title: Fundamentals of Deep Learning

Number of Pages in pdf: 298

Vì vậy, không cần mở tệp pdf, chúng tôi có thể nhận được một số thông tin hữu ích từ tệp pdf.

Trích xuất văn bản từ PDF

Chúng tôi có thể trích xuất văn bản từ các pdf. Mặc dù nó có hỗ trợ tích hợp để trích xuất hình ảnh.

Hãy thử trích xuất văn bản từ một trang cụ thể (ví dụ:trang 50) của tệp pdf mà chúng tôi đã tải xuống ở trên.

#Import pypdf2
from PyPDF2 import PdfFileReader
def extract_pdfText(path):
   with open(path, 'rb') as f:
      pdf = PdfFileReader(f)
      # get the 50th page
      page = pdf.getPage(50)
      print(page)
      print('Page type: {}'.format(str(type(page))))
      #Extract text from the 50th page
      text = page.extractText()
      print(text)
if __name__ == '__main__':
   path = 'DeepLearning.pdf'
   extract_pdfText(path)

Đầu ra

{'/Annots': IndirectObject(1421, 0),
'/Contents': IndirectObject(179, 0),
'/CropBox': [0, 0, 595.3, 841.9],
'/Group': {'/CS': '/DeviceRGB', '/S': '/Transparency', '/Type': '/Group'},
'/MediaBox': [0, 0, 504, 661.5],
'/Parent': IndirectObject(4863, 0),
'/Resources': IndirectObject(1423, 0),
'/Rotate': 0,
'/Type':
'/Page'
}

Page type: <class 'PyPDF2.pdf.PageObject'>
time. In inverted dropout, any neuron whose activation hasn†t been silenced has its
output divided by p before the value is propagated to the next layer. With this
fix, Eoutput=p⁄xp+1ƒ
p⁄0=
x, and we can avoid arbitrarily scaling neuronal
output at test time.

SummaryIn this chapter, we†ve learned all of the basics involved in training feed-forward neural
networks. We†ve talked about gradient descent, the backpropagation algorithm, as
well as various methods we can use to prevent overfitting. In the next chapter, we†ll
put these lessons into practice when we use the TensorFlow library to efficiently
implement our first neural networks. Then in
Chapter 4

, we†ll return to the problem
of optimizing objective functions for training neural networks and design algorithmsto significantly improve performance. These improvements will enable us to process
much more data, which means we†ll be able to build more comprehensive models.
Summary | 37

Mặc dù chúng tôi có thể lấy một số văn bản từ trang 50 nhưng nó không rõ ràng. Thật không may, pypdf2 hỗ trợ rất hạn chế cho việc trích xuất văn bản từ pdf.

Xoay trang cụ thể của tệp pdf

>>> import PyPDF2
>>> deeplearningFile = open('DeepLearning.pdf', 'rb')
>>> pdfReader = PyPDF2.PdfFileReader(deeplearningFile)
>>> page = pdfReader.getPage(0)
>>> page.rotateClockwise(90)
{
'/Contents': [IndirectObject(4870, 0), IndirectObject(4871, 0), IndirectObject(4872, 0), IndirectObject(4873, 0), IndirectObject(4874, 0), IndirectObject(4875, 0), IndirectObject(4876, 0), IndirectObject(4877, 0)],

'/CropBox': [0, 0, 595.3, 841.9],

'/MediaBox': [0, 0, 504, 661.5], '/Parent': IndirectObject(4862, 0), '/Resources': IndirectObject(4889, 0),
'/Rotate': 90,
/Type': '/Page'
}
>>> pdfWriter = PyPDF2.PdfFileWriter()
>>> pdfWriter.addPage(page)
>>> resultPdfFile = open('rotatedPage.pdf', 'wb')
>>> pdfWriter.write(resultPdfFile)
>>> resultPdfFile.close()
>>> deeplearningFile.close()

Đầu ra

Làm việc với các tệp PDF bằng Python?