Extract comments in PDF file

Here are step by step instructions on how to extract comments in a PDF file. The steps to do it using Adobe PRO are documented here. This posts describes how to do it if you do not have Adobe PRO.

There are two solutions I found searching online: one uses the poppler library and another uses PyMuPDF (fitz) library. The poppler solution had lot of upvotes so I tried to use that but couldn’t get it to work. The problem I ran into was how to install poppler on Mac. I installed it using conda and the installation did not give me any errors (here is the installation log) but when I tried to import the import did not work. Luckily I was able to get PyMuPDF to work. Below are step by step instructions.

Step 0: Install Anaconda

Step 1: Create new Environment

conda create -n py38 python=3.8

Step 2: Activate the Environment

conda activate py38

Step 3: Install PyMuPDF

The command that worked for me is

python -m pip install --upgrade pymupdf

Here is the output when I ran the command:

Collecting pymupdf
  Downloading PyMuPDF-1.19.1-cp38-cp38-macosx_10_9_x86_64.whl (7.6 MB)
     |████████████████████████████████| 7.6 MB 1.5 MB/s
Installing collected packages: pymupdf
Successfully installed pymupdf-1.19.1

Step 4: Create script to extract the comments

I got it from here.

#!/usr/bin/env python
import sys
import fitz
doc = fitz.open(sys.argv[1])
for i in range(doc.pageCount):
  page = doc[i]
  for annot in page.annots():
    print(annot.info["content"])
    

Step 5: Run the script

Left as exercise

This entry was posted in Software and tagged , , . Bookmark the permalink.

Leave a comment