From Paper to PDF on Linux (and others)

Computing varia － a computing blog by Serge Rosmorduc

10 Mar

From Paper to PDF on Linux (and others)

pdf digital humanities

(the following text is more a collection of random notes than a tutorial - I will rewrite it at some point).

In Egyptology, we often have to use out-of-print books and articles. Many of us have a large library of scanned documents - most of the classical works such as the Urkunden have been available for a long time. They are precious, in particular when we move. We don't travel with a large trunk of books, as Agatha Christie's husband in Come, Tell Me How You Live.

So, this page contains a list of useful tools to convert paper documents into PDF, and to improve them.

Scanning

There are lots of pages about book scanning. I'm not going to repeat them here. If you have just a smartphone, I have found that the application VFlat Scanner is quite efficient. It detects pages and dewrap them very efficiently. It's free, but if you want to directly export to PDF, you will need to pay. Often, I simply export individual pages and process them later.

Scan Tailor Improved

This is an open source software to process pictures of scanned documents, and produce dewrapped and cleaned-up individual pages. It uses traditional algorithms, not AI, but it's quite efficient. It is available for Linux, Windows and MacOS. The latest versions are available in most Linux distributions.

Cleaning pictures

Nettoyage des images

To get black and white pictures (mainly when dealing with old books with yellowed pages), you can use ImageMagick. The following command converts an image to grayscale:

magick input.jpg -colorspace gray output.jpg

To improve the contrast, one can cut the color histogram with the -level option:

magick input.jpg -level 20%,80% output.jpg

It's better to choose one or two pictures to test your choices. 20%,70% can be a good starting point.

-lat allows a local adaptative contrast, and works very well:

convert a.jpg -colorspace gray -lat 15x15-2000 b.jpg

Automated dewrapping

There are a number of tools dealing with page dewrapping. I haven't tried them (I have mainly used Scan Tailor, and VFlat). Prof. Matt Zucker has written a very interesting page about the algorithm to produce such dewrapped pages (the rest of his blog is well worth the read too).

PDF Processing and OCR

Once you have your pages, you can create a pdf quite easily on linux. Here are a number of possible steps :

From pictures to PDF

If your pictures naming system allows them to be correctly ordered:

img2pdf *.jpg -o book.pdf

will produce a pdf book.

OCR (and bonus compression!)

We can use ocrmypdf to produce a searchable PDF.

It's able to :

perform OCR (including on handwritten documents, e.g. the text volumes of the Denkmäler).
optimize the PDF size using JBIG2 compression for black and white documents; this is very useful, as a global optimization will reduce the size of the book a lot.

ocrmypdf --language deu --optimize 3 --jbig2-lossy book.pdf book_ocr.pdf

The --optimize 3 --jbig2-lossy is about size reduction. You need to have a version of ocrmypdf which handle jbig (which was protected until recently). The SNAP versions of the software for Ubuntu does.

For this, you can use pdftk, which is very versatile and does a lot of other things. You can even use it on an existing PDF to enrich it with a table of contents.

We first need to extract the information part about the PDF:

Extracting document information

pdftk book_ocr.pdf dump_data > book.info

Writing the table of contents

If you want a multi-level table of content (chapters, sections, etc.) Create a book.csv file with lines of the form:

Title @ Level @ Page (in the pdf)

Example :

Préface@1@5
Alexandria@1@11
Gizeh@1@35
Grab des iy-mry@2@49

You can get the page numbers by opening your previous pdf. Then, run the following shell script:

cat book.info > out.info
while IFS=$'@' read -r label level page; do
    echo >> out.info BookmarkBegin
    echo >> out.info BookmarkTitle: "$label"
    echo >> out.info BookmarkLevel: $level
    echo >> out.info BookmarkPageNumber: $page
done < book.csv

Adding your table of content to the PDF

The following line will incorporate the informations from out.info (including the table of contents) into the PDF:

pdftk book_ocr.pdf update_info out.info output b2.pdf

Extracting pictures from a PDF

Beware, some software will extract pictures of the pages, other will extract individual pictures, which is not what we usually want.

pdftoppm -jpeg -r 600 input.pdf output

Other OCR tools

I had some interesting results with ollama (which can run LLM) and the glm-ocr model on individual pages (as a matter of fact, I used it to extract the text from the table of content of the Denkmäler). The results depended a lot on the scan.

Final note

If you want to go as far as producing a table of content and a light, nice pdf, a warning: it takes time!

Computing varia － a computing blog by Serge Rosmorduc

From Paper to PDF on Linux (and others)

Scanning

Scan Tailor Improved

Cleaning pictures

Nettoyage des images

Automated dewrapping

PDF Processing and OCR

From pictures to PDF

OCR (and bonus compression!)

Table of contents

Extracting document information

Writing the table of contents

Adding your table of content to the PDF

Extracting pictures from a PDF

Other OCR tools

Final note

Search

Related Posts

Popular Tags

Archives

Syndicate