Webpage downloader - Use Python to download webpages into a Microsoft Office or LibreOffice document

For issue 144 of The MagPi magazine, I created a productivity tool that downloads webpages for you and puts them into an office document you can easily search or edit.

What is the webpage downloader?

Photo of three print-outs of Word documents containing web pages. This program can be a real time saver if you need to process a lot of online research. You enter a list of website addresses, and it downloads them all and puts them into a docx file you can open in Microsoft Word or LibreOffice. (I'm sure you're already familiar with one of these tools, but, if not, you can find a guide to Word in Microsoft Office for the Older and Wiser, and a guide to LibreOffice in Raspberry Pi For Dummies.)

The results aren't perfect: there are no images and there are often unnecessary navigation elements in the document. But, you can easily delete anything you don't need, and the real power is how quickly you can skim-read or search across multiple web pages.

The program shows you:

  • How to use the requests module in Python to download web pages;
  • How you can use Beautiful Soup to find content in them; and
  • How you can use the python-docx library to create Word/LibreOffice documents.

On this webpage, you can download the code for the project.

For more information on how the code works, get issue 144 of The MagPi.

Installing the dependencies

You'll need to install the bs4 and python-docx libraries.

In the Thonny Python editor:

  • Create a virtual environment: Create an empty folder. In Thonny, click the three-line menu icon in the bottom right. Choose "Configure interpreter" and click "New virtual environment".
  • Install the library: From the Tools menu in Thonny, choose "Manage packages". (If you don't see the Tools menu, click "Switch to regular mode" and restart Thonny). Search for bs4 and install it, then do the same for python-docx.

Download the Webpage downloader code

# Download web pages into a docx file
# By Sean McManus - www.sean.co.uk

import requests, sys
from bs4 import BeautifulSoup 
from docx import Document

print("Paste in the URLs (Ctrl-D to end input): ")
urls = sys.stdin.readlines()
urls = [url.strip() for url in urls]
filename = "output.docx"
doc = Document()

for source_number, url in enumerate(urls):
    print(f"Fetching {url}")
    response = requests.get(url)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    for remove_me in soup.find_all(["nav", "footer"]):
        remove_me.extract()
        
    doc.add_heading(f"{source_number + 1} - {url}", 1)
    title = soup.title.string
    doc.add_heading(f"{source_number + 1} - {title}", 0)
    for part in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "table", "li", "blockquote"]):
        if part.name in ["h1", "h2", "h3"]:
            doc.add_heading(part.text, 2)
        elif part.name == "li":
            doc.add_paragraph(part.text, style="List Bullet")
        elif part.text:
            doc.add_paragraph(part.text)
    doc.add_page_break()
doc.save(filename)
print(f"Saved as {filename}")

More Raspberry Pi projects

Find more Raspberry Pi projects and tutorials here.

Credits

© Sean McManus. All rights reserved.

Visit www.sean.co.uk for free chapters from Sean's coding books (including Mission Python, Scratch Programming in Easy Steps and Coder Academy) and more!

Discover my latest books

Earworm

Earworm

New 10th anniversary edition of my acclaimed novel about what happens when AI and the music business collide.

Coding Compendium

Coding Compendium

A free 100-page ebook collecting my projects and tutorials for Raspberry Pi, micro:bit, Scratch and Python.

Web Design in Easy Steps

Web Design IES

Web Design in Easy Steps, now in its 7th Edition, shows you how to make effective websites that work on any device.

100 Top Tips: Microsoft Excel

100 Top Tips: Microsoft Excel

Power up your Microsoft Excel skills with this powerful pocket-sized book of tips that will save you time and help you learn more from your spreadsheets.

Scratch Programming in Easy Steps

Scratch Programming IES

This book, now fully updated for Scratch 3, will take you from the basics of the Scratch language into the depths of its more advanced features. A great way to start programming.

Mission Python book

Mission Python

Code a space adventure game in this Python programming book published by No Starch Press.

Walking astronaut from Mission Python book Top | Search | Help | Privacy | Access Keys | Contact me
Home | Newsletter | Blog | Copywriting Services | Books | Free book chapters | Articles | Music | Photos | Games | Shop | About