Python PDF Searcher - Search through your PDF ebook library using Python

For issue 144 of The MagPi magazine, I created a Python program that searches through your PDF library. It's a great time saver! I've used it to search my magazine, ebook and even bank statement PDFs.

What is Python PDF Searcher?

A photo of a beautiful multi-floor library, viewed from inside Like me, you probably have a library of PDF ebooks and magazines downloaded over the years. How can you find old tutorials and reviews?

This Python program does the hard work for you. You tell it which folder you want to search in, and it discovers all your PDFs and looks inside them to find a text match for your search term.

The results tell you which document your search term was found in, and also include a short snippet from the top of the page and from the text around your search term. That gives you a good idea of the context so you can decide which matches to follow up.

Search results are saved to a text file, so you can easily keyword-search them or edit them using a text editor.

This program shows you how to handle PDFs in Python and how to output to text files.

On this webpage, you can download the code for the project.

For more information on how the code works, get issue 144 of The MagPi.

Photo credit: Niklas Ohlrogge at Unsplash.

Installing the dependencies

You'll need to install the PyPDF2 library.

In the Thonny Python editor:

  • Create a virtual environment: Create an empty folder. In Thonny, click the three-line menu icon in the bottom right. Choose "Configure interpreter" and click "New virtual environment".
  • Install the library: From the Tools menu in Thonny, choose "Manage packages". (If you don't see the Tools menu, click "Switch to regular mode" and restart Thonny). Search for PyPDF2 and install it.

Download the Python PDF Searcher code

I've made a change to this version of the program, compared to the version that appeared in the magazine. Although I tested it extensively before publication, a reader got in touch to tell me it wasn't working for him. I was able to replicate the error, which for me was being triggered by non-breaking hyphens in some PDFs. I've fixed the code by adding a parameter to open the file for writing as utf-8 format. I've commented the line with the change below.

# PDF search, with output sent to text file
# By Sean McManus - www.sean.co.uk
import os, PyPDF2

def process_directory(path):
    for dir_or_file in os.listdir(path):
        path_plus_dir_or_file = os.path.join(path, dir_or_file)
        if os.path.isdir(path_plus_dir_or_file):
            print("\nProcessing subfolder:", dir_or_file)
            process_directory(path_plus_dir_or_file)
        elif dir_or_file.endswith('.pdf'):
            print("* Searching PDF:", dir_or_file)
            search_in_pdf(path_plus_dir_or_file)

def search_in_pdf(pdf_file):
    opened_file = open(pdf_file, 'rb')
    magazine_content = PyPDF2.PdfReader(opened_file)
    for page_number, magazine_page in enumerate(magazine_content.pages):
        page_text = magazine_page.extract_text()
        if search_string.lower() in page_text.lower():
            with open("output.txt", "a", encoding="utf-8") as output_file: # NOTE CHANGE HERE
                print("\n\n", file=output_file)
                print("#" * 40, file=output_file)
                print(f"Text found in {pdf_file} on page {page_number + 1}", file=output_file)
                print("#" * 40, file=output_file)
                print(page_text[0:200], "\n...\n", file=output_file)
                position_in_text = page_text.lower().rfind(search_string.lower())
                print(page_text[max(0, position_in_text - 450) :
                      min(position_in_text + 450, len(page_text))
                      ], file=output_file)

search_string = input(f"What term would you like to search for in the PDFs? ")
with open("output.txt", "w") as output_file:
    print(f"Ok! Searching for {search_string}", file=output_file)
process_directory("MagPi") # Change to your folder name

Simpler version of the code that outputs to screen

When I was developing the code, my first step was to create a version that outputs to screen. It's only a few small changes from the version that outputs to a file, but in case it's useful to anyone, here it is:

# PDF search, with output sent to SCREEN
# By Sean McManus - www.sean.co.uk
import os, PyPDF2

def process_directory(path):
    for dir_or_file in os.listdir(path):
        path_plus_dir_or_file = os.path.join(path, dir_or_file)
        if os.path.isdir(path_plus_dir_or_file):
            print("\nProcessing subfolder:", dir_or_file)
            process_directory(path_plus_dir_or_file)
        elif dir_or_file.endswith('.pdf'):
            print("* Searching PDF:", dir_or_file)
            search_in_pdf(path_plus_dir_or_file)

def search_in_pdf(pdf_file):
    opened_file = open(pdf_file, 'rb')
    magazine_content = PyPDF2.PdfReader(opened_file)
    for page_number, magazine_page in enumerate(magazine_content.pages):
        page_text = magazine_page.extract_text()
        if search_string.lower() in page_text.lower():
            print("\n\n")
            print("#" * 40)
            print(f"Text found in {pdf_file} on page {page_number + 1}")
            print("#" * 40)
            print(page_text[0:200], "\n...\n")
            position_in_text = page_text.lower().rfind(search_string.lower())
            print(page_text[max(0, position_in_text - 450) :
                      min(position_in_text + 450, len(page_text))
                      ])

search_string = input(f"What term would you like to search for in the PDFs? ")
print(f"Ok! Searching for {search_string}")
process_directory("MagPi") # Change to your folder name

More Raspberry Pi projects

Find more Raspberry Pi projects and tutorials here.

Credits

© Sean McManus. All rights reserved.

Visit www.sean.co.uk for free chapters from Sean's coding books (including Mission Python, Scratch Programming in Easy Steps and Coder Academy) and more!

Discover my latest books

Earworm

Earworm

New 10th anniversary edition of my acclaimed novel about what happens when AI and the music business collide.

Coding Compendium

Coding Compendium

A free 100-page ebook collecting my projects and tutorials for Raspberry Pi, micro:bit, Scratch and Python.

Web Design in Easy Steps

Web Design IES

Web Design in Easy Steps, now in its 7th Edition, shows you how to make effective websites that work on any device.

100 Top Tips: Microsoft Excel

100 Top Tips: Microsoft Excel

Power up your Microsoft Excel skills with this powerful pocket-sized book of tips that will save you time and help you learn more from your spreadsheets.

Scratch Programming in Easy Steps

Scratch Programming IES

This book, now fully updated for Scratch 3, will take you from the basics of the Scratch language into the depths of its more advanced features. A great way to start programming.

Mission Python book

Mission Python

Code a space adventure game in this Python programming book published by No Starch Press.

Walking astronaut from Mission Python book Top | Search | Help | Privacy | Access Keys | Contact me
Home | Newsletter | Blog | Copywriting Services | Books | Free book chapters | Articles | Music | Photos | Games | Shop | About