(the following text is more a collection of random notes than a tutorial - I will rewrite it at some point).
In Egyptology, we often have to use out-of-print books and articles. Many of us have a large library of scanned documents - most of the classical works such as the Urkunden have been available for a long time. They are precious, in particular when we move. We don't travel with a large trunk of books, as Agatha Christie's husband in Come, Tell Me How You Live.
So, this page contains a list of useful tools to convert paper documents into PDF, and to improve them.
Scanning
There are lots of pages about book scanning. I'm not going to repeat them here. If you have just a smartphone, I have found that the application VFlat Scanner is quite efficient. It detects pages and dewrap them very efficiently. It's free, but if you want to directly export to PDF, you will need to pay. Often, I simply export individual pages and process them later.
Scan Tailor Improved
This is an open source software to process pictures of scanned documents, and produce dewrapped and cleaned-up individual pages. It uses traditional algorithms, not AI, but it's quite efficient. It is available for Linux, Windows and MacOS. The latest versions are available in most Linux distributions.
Cleaning pictures
Nettoyage des images
To get black and white pictures (mainly when dealing with old books with yellowed pages), you can use ImageMagick. The following command converts an image to grayscale:
magick input.jpg -colorspace gray output.jpg
To improve the contrast, one can cut the color histogram with the -level option:
magick input.jpg -level 20%,80% output.jpg
It's better to choose one or two pictures to test your choices. 20%,70% can be a good starting point.
-lat allows a local adaptative contrast, and works very well:
convert a.jpg -colorspace gray -lat 15x15-2000 b.jpg
Automated dewrapping
There are a number of tools dealing with page dewrapping. I haven't tried them (I have mainly used Scan Tailor, and VFlat). Prof. Matt Zucker has written a very interesting page about the algorithm to produce such dewrapped pages (the rest of his blog is well worth the read too).
PDF Processing and OCR
Once you have your pages, you can create a pdf quite easily on linux. Here are a number of possible steps :
From pictures to PDF
If your pictures naming system allows them to be correctly ordered:
img2pdf *.jpg -o book.pdf
will produce a pdf book.
OCR (and bonus compression!)
We can use ocrmypdf to produce a searchable PDF.
It's able to :
- perform OCR (including on handwritten documents, e.g. the text volumes of the Denkmäler).
- optimize the PDF size using JBIG2 compression for black and white documents; this is very useful, as a global optimization will reduce the size of the book a lot.
ocrmypdf --language deu --optimize 3 --jbig2-lossy book.pdf book_ocr.pdf
The --optimize 3 --jbig2-lossy is about size reduction. You need to have a version of ocrmypdf which handle jbig (which was protected until recently). The SNAP versions of the software for Ubuntu does.
Table of contents
For this, you can use pdftk, which is very versatile and does a lot of other things. You can even use it on an existing PDF to enrich it with a table of contents.
We first need to extract the information part about the PDF:
Extracting document information
pdftk book_ocr.pdf dump_data > book.info
Writing the table of contents
If you want a multi-level table of content (chapters, sections, etc.) Create a book.csv file with lines of the form:
Title
@Level@Page (in the pdf)
Example :
Préface@1@5
Alexandria@1@11
Gizeh@1@35
Grab des iy-mry@2@49
You can get the page numbers by opening your previous pdf. Then, run the following shell script:
cat book.info > out.info
while IFS=$'@' read -r label level page; do
echo >> out.info BookmarkBegin
echo >> out.info BookmarkTitle: "$label"
echo >> out.info BookmarkLevel: $level
echo >> out.info BookmarkPageNumber: $page
done < book.csv
Adding your table of content to the PDF
The following line will incorporate the informations from out.info (including the table of contents) into the PDF:
pdftk book_ocr.pdf update_info out.info output b2.pdf
Extracting pictures from a PDF
Beware, some software will extract pictures of the pages, other will extract individual pictures, which is not what we usually want.
pdftoppm -jpeg -r 600 input.pdf output
Other OCR tools
I had some interesting results with ollama (which can run LLM) and the glm-ocr model on individual pages (as a matter of fact, I used it to extract the text from the table of content of the Denkmäler). The results depended a lot on the scan.
Final note
If you want to go as far as producing a table of content and a light, nice pdf, a warning: it takes time!