PyPdf2 is a python library used to perform functions such as splitting, extracting content, cropping, and merging PDF files.
PDF stands for portable document format, and it is one of the most widely used document formats.
It uses .pdf extension. PyPDF2 is also very good for creating documents from scratch because it has a robust support system for it.
It also has the option of customizing and adding data, viewing options, and password to PDF files.
Table of Contents
- Installing PyPDF2
- Extracting document details using PyPDF2
- Extracting Text from PDF files
- Rotating the pages of the PDF
- Merging PDF files
- Splitting PDF files
- Encrypting PDF files
- Adding watermark to PDF file
PyPDF2 can be installed using pip. Pip is a python package installer that can be used to install PyPDF2. PyPDF2 requires Python 3.6+ to run.
If you are using an anaconda environment instead of regular python, then the conda command can be used to install PyPDF2. You can run the following command on your command prompt.
pip install PyPDF2
Furthermore, if you are planning to use PyPDF2 for encryption and decryption purposes, then some extra dependencies are supposed to be installed.
pip install PyPDF2[crypto]
And for anaconda users:
conda install -c conda -forge pypdf2
The PyPDF2 library can be used for several purposes. Given below are some of the uses of the PyPDF module.
Extracting document details using PyPDF2
PyPDF can be used to extract metadata and text from a PDF file. Different categories of information can be extracted like the author, creator, subject, title, number of pages, etc.
You need to have a sample PDF file on your machine to perform the data extraction process.
This method can be useful when you want to automate the process of information extraction. To extract information like author, creator, and title, we can run the following code.
# Importing library from PyPDF2 import PdfFileReader # Specifying the path to the pdf document pdf_path=r"c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf" with open(pdf_path, 'rb') as f: pdf = PdfFileReader(f) information = pdf.getDocumentInfo() number_of_pages = pdf.getNumPages() print(information) print(number_of_pages)
The output of the above code is:
In the above method,
PdfFileReader is imported from the
PdfFileReader is a class with different methods for interacting with PDF files and extracting data from them.
For instance, in the above case, we will be calling the .getDocumentInfo() method, which will provide us with the instance of DocumentInformation.
This method will provide important information like the author’s name, creation date, creator, etc. We can also call the .getNumPages() method, which will return the total number of pages in the document.
We can also apply another method to extract the information about the given PDF document. Given below is the code for the second method.
# Importing library from PyPDF2 import PdfReader # Specifying the path to the PDF file info = PdfReader("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf") data = info.metadata print(len(info.pages)) print(data.author) print(data.creator) print(data.producer) print(data.subject) print(data.title)
PyPDF2 has the method
.extractText(), which can be used to extract text from the PDF, it’s not effective.
In some cases, you would yield text while in other cases you would get an empty string.
Extracting Text from PDF files
PyPDF has very limited support for extracting text from PDF files. Due to this reason, there will be errors in the extracted files.
You may not get the text in the proper format or there will be some other issues due to the limited support of PyPDF for text extraction.
# Importing library from PyPDF2 import PdfFileReader # Specifying the path to the PDF file PDF_path=r"c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf" PDF_object = open(PDF_path, 'rb') PDF_reader = PdfFileReader(PDF_object) text='' for i in range(0,PDF_reader.numPages): # Creating a page object Page_object = PDF_reader.getPage(i) # Extracting text from page text=text+Page_object.extractText() print(text)
In the above method, firstly, the PdfFileReader is imported from PyPDF2. After importing the module, we locate the path for the PDF file. We then read the file and create the PDF object of the file.
After that, we create the PDF_reader object and pass PDF_object to it. And finally, we will extract the text content of each page and concatenate the text together.
Rotating the pages of the PDF
Sometimes you receive PDF files with pages in landscape mode instead of portrait mode or even sometimes upside down.
This issue can be resolved by using the manipulating power of the PyPDF module. PyPDF allows the rotation of the page by multiples of 90 degrees.
In this method, we will obtain a new PDF file, which will be the modified version of the original file.
import PyPDF2 PDF_input = open('c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf', 'rb') pdf_reader = PyPDF2.PdfFileReader(PDF_input) pdf_writer = PyPDF2.PdfFileWriter() for page_number in range(pdf_reader.numPages): page = pdf_reader.getPage(page_number) page.rotateClockwise(90) pdf_writer.addPage(page) PDF_output = open('c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\2.pdf', 'wb') pdf_writer.write(PDF_output) PDF_output.close() PDF_input.close()
The above code will result in a new and transformed version of the original PDF file.
Merging PDF files
There are several situations when you would want to combine two or more PDFs into one PDF file. This task can be achieved using the append method of the PyPDF module.
# Importing library from PyPDF2 import PdfFileReader, PdfFileMerger # Specifying path of first PDF file first_file = PdfFileReader("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\3.pdf") # Specifying path of first PDF file second_file = PdfFileReader("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\4.pdf") output = PdfFileMerger() output.append(first_file) output.append(second_file) with open("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\merged.pdf", "wb") as output_stream: output.write(output_stream)
The above code can add multiple PDF files and writes them into a single file named merged using the PdfFileMerger’s append function.
Furthermore, if you want to specify a location for your pages to go, then use the merge function because the append function always adds new pages to the end of the PDF file.
On the contrary, the merge function allows you to insert the pages at the desirable location.
Splitting PDF files
PyPDF can also be used to split a given PDF or extract a particular page from the PDF file and create it as a separate PDF file.
PdfFileReader method of the PyPDF library can be used to read and access a specific page and then extract it from the original PDF file.
# Importing library from PyPDF2 import PdfFileWriter, PdfFileReader # Applying PdfFileReader and specify the location of the PDF file input_pdf = PdfFileReader("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf") output = PdfFileWriter() output.addPage(input_pdf.getPage(0)) # Specifying the location of the split PDF page with open("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\page.pdf", "wb") as output_stream: output.write(output_stream)
The above code can be used to extract a single PDF page out of a PDF file.
Encrypting PDF files
Encrypting a PDF file is adding a password to a PDF file. The encryption method is chosen with the creation of a PDF file.
Each time there is an attempt to open the PDF file, it prompts to give the password for the file. Anyone with a password can open, edit, and print the password-protected PDF file.
# Importing Library from PyPDF2 import PdfReader, PdfWriter # Specifying the path to the targeted PDF PDF_reader = PdfReader("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf") PDF_writer = PdfWriter() for pages in PDF_reader.pages: PDF_writer.add_page(pages) # Adding password to the new PDF file PDF_writer.encrypt("encryption") # Saving the new file and specifying the path to it with open("c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\encrypted.pdf", "wb") as f: PDF_writer.write(f)
The above method will give you administrator privileges over the given PDF file and thus enabling you to secure your PDF file.
In the above code, we are specifying the input and output paths as well as the password you want to set on the PDF.
Moreover, we are looping over all the pages and adding them to the writer to encrypt the whole PDF document.
Adding watermark to PDF file
Watermarks are specific patterns, images, and logos that appear on every page of the document.
Watermark plays a pivotal role in protecting your intellectual property. The below code can be used to demonstrate how we can add watermarks to a PDF file.
# Importing library from PyPDF2 import PdfFileWriter, PdfFileReader # Specifying the path to the original PDF file PDF_file = r"c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\1.pdf" # Specifying the path to the watermark PDF watermark_pdf = r"c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\aaa.pdf" # Specifying the path to the watermarked page to be formed Watermarked file = r"c:\\Users\\tariq.aziz\\OneDrive - University of Central Asia\Desktop\\333.pdf" watermark_pdf = PdfFileReader(watermark_pdf) marked_page = watermark_pdf.getPage(0) PDF_reader = PdfFileReader(PDF_file) PDF_writer = PdfFileWriter() for page in range(PDF_reader.getNumPages()): pdfpage = PDF_reader.getPage(page) pdfpage.mergePage(marked_page) PDF_writer.addPage(pdfpage) with open(watermarkedfile, 'wb') as fh: PDF_writer.write(fh)