Earworm
New 10th anniversary edition of my acclaimed novel about what happens when AI and the music business collide.
For issue 144 of The MagPi magazine, I created a Python program that searches through your PDF library. It's a great time saver! I've used it to search my magazine, ebook and even bank statement PDFs.
Like me, you probably have a library of PDF ebooks and magazines downloaded over the years. How can you find old tutorials and reviews?
This Python program does the hard work for you. You tell it which folder you want to search in, and it discovers all your PDFs and looks inside them to find a text match for your search term.
The results tell you which document your search term was found in, and also include a short snippet from the top of the page and from the text around your search term. That gives you a good idea of the context so you can decide which matches to follow up.
Search results are saved to a text file, so you can easily keyword-search them or edit them using a text editor.
This program shows you how to handle PDFs in Python and how to output to text files.
On this webpage, you can download the code for the project.
For more information on how the code works, get issue 144 of The MagPi.
Photo credit: Niklas Ohlrogge at Unsplash.
You'll need to install the PyPDF2 library.
In the Thonny Python editor:
I've made a change to this version of the program, compared to the version that appeared in the magazine. Although I tested it extensively before publication, a reader got in touch to tell me it wasn't working for him. I was able to replicate the error, which for me was being triggered by non-breaking hyphens in some PDFs. I've fixed the code by adding a parameter to open the file for writing as utf-8 format. I've commented the line with the change below.
# PDF search, with output sent to text file
# By Sean McManus - www.sean.co.uk
import os, PyPDF2
def process_directory(path):
for dir_or_file in os.listdir(path):
path_plus_dir_or_file = os.path.join(path, dir_or_file)
if os.path.isdir(path_plus_dir_or_file):
print("\nProcessing subfolder:", dir_or_file)
process_directory(path_plus_dir_or_file)
elif dir_or_file.endswith('.pdf'):
print("* Searching PDF:", dir_or_file)
search_in_pdf(path_plus_dir_or_file)
def search_in_pdf(pdf_file):
opened_file = open(pdf_file, 'rb')
magazine_content = PyPDF2.PdfReader(opened_file)
for page_number, magazine_page in enumerate(magazine_content.pages):
page_text = magazine_page.extract_text()
if search_string.lower() in page_text.lower():
with open("output.txt", "a", encoding="utf-8") as output_file: # NOTE CHANGE HERE
print("\n\n", file=output_file)
print("#" * 40, file=output_file)
print(f"Text found in {pdf_file} on page {page_number + 1}", file=output_file)
print("#" * 40, file=output_file)
print(page_text[0:200], "\n...\n", file=output_file)
position_in_text = page_text.lower().rfind(search_string.lower())
print(page_text[max(0, position_in_text - 450) :
min(position_in_text + 450, len(page_text))
], file=output_file)
search_string = input(f"What term would you like to search for in the PDFs? ")
with open("output.txt", "w") as output_file:
print(f"Ok! Searching for {search_string}", file=output_file)
process_directory("MagPi") # Change to your folder name
When I was developing the code, my first step was to create a version that outputs to screen. It's only a few small changes from the version that outputs to a file, but in case it's useful to anyone, here it is:
# PDF search, with output sent to SCREEN
# By Sean McManus - www.sean.co.uk
import os, PyPDF2
def process_directory(path):
for dir_or_file in os.listdir(path):
path_plus_dir_or_file = os.path.join(path, dir_or_file)
if os.path.isdir(path_plus_dir_or_file):
print("\nProcessing subfolder:", dir_or_file)
process_directory(path_plus_dir_or_file)
elif dir_or_file.endswith('.pdf'):
print("* Searching PDF:", dir_or_file)
search_in_pdf(path_plus_dir_or_file)
def search_in_pdf(pdf_file):
opened_file = open(pdf_file, 'rb')
magazine_content = PyPDF2.PdfReader(opened_file)
for page_number, magazine_page in enumerate(magazine_content.pages):
page_text = magazine_page.extract_text()
if search_string.lower() in page_text.lower():
print("\n\n")
print("#" * 40)
print(f"Text found in {pdf_file} on page {page_number + 1}")
print("#" * 40)
print(page_text[0:200], "\n...\n")
position_in_text = page_text.lower().rfind(search_string.lower())
print(page_text[max(0, position_in_text - 450) :
min(position_in_text + 450, len(page_text))
])
search_string = input(f"What term would you like to search for in the PDFs? ")
print(f"Ok! Searching for {search_string}")
process_directory("MagPi") # Change to your folder name
© Sean McManus. All rights reserved.
Visit www.sean.co.uk for free chapters from Sean's coding books (including Mission Python, Scratch Programming in Easy Steps and Coder Academy) and more!
New 10th anniversary edition of my acclaimed novel about what happens when AI and the music business collide.
A free 100-page ebook collecting my projects and tutorials for Raspberry Pi, micro:bit, Scratch and Python.
Web Design in Easy Steps, now in its 7th Edition, shows you how to make effective websites that work on any device.
Power up your Microsoft Excel skills with this powerful pocket-sized book of tips that will save you time and help you learn more from your spreadsheets.
This book, now fully updated for Scratch 3, will take you from the basics of the Scratch language into the depths of its more advanced features. A great way to start programming.
Code a space adventure game in this Python programming book published by No Starch Press.