Extract text from PDF using Python
In this post, you will learn how to extract text from PDF using the Python programming language.
PDF stands for Portable Document Format. PDF is secure, compact, and read-only document that cannot be modified by users until they have the right electronic footprint. It is widely used for sharing read-only documents in a manner independent of application software, hardware, and operating systems.
Python PDF Packages
There are different packages available in Python that can be used to extract text. Those tools are PyPDF2, pdfminer, and PyMuPDF. Each package has their own advantages. Here, we will use the PyPDF2 package.
Python PyPDF2 package
PyPDF2 is a pure Python library, mostly used for extracting document information. It has features for splitting and merging documents page by page. It also has the ability to merge multiple pages into a single page. We can also encrypt and decrypt PDF files using this and do much more. So, it is a powerful tool for websites that manage or manipulate PDFs. So, it is a powerful tool for websites that manage or manipulate PDFs.
Python install PyPDF2
First, we open the terminal window and run the following command to install the PyPDF2.
pip install PyPDF2
Step by step process for extracting text from PDF
Here, we have explained the procedure for extracting the text from PDF.
Importing PyPDF2
Initially, we need to import the PyPDF2 module.
import PyPDF2
Creating a PDF object
The following code creates a PDF file object.
pdfFileObj = open(r"C:\data.pdf", 'rb')
Python provides the open() method to open a file. It accepts two parameters- filename and modes. The above code opens the file for reading. In the first parameter, we have passed the file name with the file path, and in the second parameter, we have passed the 'rb' mode ( r for read and b for binary) for opening the file. As PDF is considered as a binary file.
Creating a PDF reader object
Here, we have created an object of the PdfFileReader class and passed the pdf file object.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Creating a page object
The getPage() method of the PyPDF2 module is used to read content from a particular page. The page number is passed to the getPage() function as an argument.
pageObj = pdfReader.getPage(0)
Extracting text from page
The extractText() function of the PyPDF2 module is used to extract the text from PDF.
print(pageObj.extractText())
Closing the PDF file object
At last, we should close the file object.
pdfFileObj.close()
Complete code to extract text from PDF
Let's merge all the above code and execute it:
import PyPDF2
pdfFileObj = open(r"c:\etp.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print("Number of Pages :", pdfReader.numPages)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())
pdfFileObj.close()
We get the following output:
Number of Pages : 1
Welcome to etutorialspoint
https://www.etutorialspoint.com/
Our mission is to provide good educational resources to technical
students, learner, web developers. In this platform, we have share our knowledge
and experience through development resources, tutorials and multiple interview
questions and exercises.
Related Articles
Find square of a number in Python
Python program to map two lists into a dictionary
Multiply all elements in list Python
Python program to list even and odd numbers of a list
Convert dataframe column to list
Python count frequency in list
Python send HTML email with attachment
Python program to input week number and print week day
Python program to list even and odd numbers of a list
Python program to print odd numbers within a given range
Python program to multiply two numbers
Program to find area of triangle in Python
Find area of rectangle in Python
Swapping of two numbers in Python
Find average of n numbers in Python
Print multiplication table in Python
Python program to multiply two matrices
Python program to find area of circle
Python iterate list with index
Python add list to list
Python random choice
Python dict inside list
Count consonants in a string Python
Convert array to list Python