Extract text from image or pdf

10 March 2021

How can you effectively extract text from a pdf or an image ? commmonly called OCR (optical character recognition). I found 2 extremly powerfull tools based on the open source engine Tesseract (Official website).

I am using windows and can be both used on this OS. One permit to convert scanned pdf to searchable pdf (as well as copiable). The other permit to get a screenshot from an area of your screen, convert it to text and store it in your clipboard.

  • Ocrmypdf
    • you need to use Ubuntu on windows more info here
    • update your apt: sudo apt-get update
    • install it: sudo apt install ocrmypdf
    • check the documentation for the cmds
      • here an easy example for frencg pdf: ocrmypdf -l fra "input.pdf" "output.pdf"
    • To install new languages (for Ubuntu)
      • check which exists: apt-cache search tesseract-ocr
      • install what you need: sudo apt-get install tesseract-ocr-fra
  • normcap
    • easy to install, just use the exe

Have a try :)



Comments

    Join the discussion for this article on this ticket. Comments appear on this page instantly.
    Thanks to aristaht for making this static comment system possible.