Home > General > The paperless office

The paperless office

Well here we are once again. This time to conquer the pile of paper laying besides, on and under your desk. My girlfriend went nuts by the sheer load of paper laying around everywhere. I’m kind of messy but when I start to organize things I’m kind of a perfectionist. So here I went and put all the papers in one big pile.

After going through the pile I realized that there were a lot of papers worth trowing away but at the same time we’re worth keeping. A dilemma. So I had a good cold beer and started thinking about the situation. The beer worked like oil on the brains and I came up with a good solution. ‘Let’s scan this pile of toiletpaper!’ and so I provided my HP 7310 with some juice and put all the paperwork on the automatic paper input of the device. I inserted an empty SD-Card and started gaming while the HP started to do what it was made for.

After a few hours ( well actually the scanner was long done before that… plz don’t tell my girlfriend 😉 ) the scans were complete and I put the files on my debian server. So far so good. Now at least I had a backup of all the paper versions. After that I was rather satisfied and threw away a large pile of paper I no longer needed as I had a digital backup now.

But a real geek doesn’t stop here. What I wanted next was to be able to search through my digital paperwork fast and get the papers that I need. So I had a look and found that there is a great open source ocr package called tesseract. I downloaded some packages and started trying out some things. I found out that it was only capable of handling tiff images at this point and that it was best to avoid color in the images. To get the required images I installed ImageMagick to do the converting from jpg to grayscale uncompressed tiffs.

So far so good. This will result in a txt file containing the text in the file. Pretty neat. Now I can use grep to look for any string matches and then open the matching jpg. Easy as 1,2,3. 🙂

After this I created the following script that will parse any new images automatically with the ocr software.

#!/bin/bash
 
basedir="/share/downloads/Scans"
 
# Parses a scan if not already processed
# $1 = JPG file
parse_scan(){
        jpg_file=$1
        base_name=${jpg_file:0:(${#jpg_file}-4)}
        tif_file=$base_name.TIF
        txt_file=$base_name.txt
 
        # If tiff file does not exist, use imagemagick to convert
        if [ ! -e $txt_file ]; then
                echo "Converting: $jpg_file into $tif_file"
 
#               echo $base_name
#               echo $tif_file
#               echo $txt_file
 
                # Convert jpg into tiff file
                convert $jpg_file -format tiff -colorspace gray -depth 8 -compress none $tif_file &> /dev/null
 
                # Use tesseract for ocr on tiff file
                tesseract $tif_file $base_name -l nld &> /dev/null
 
                # Remove tiff file
                rm $tif_file &> /dev/null
        fi
}
 
for file in $basedir/*; do
        filename=$file
        length=${#filename}
 
        if echo $filename | grep -q '.jpg$'; then
                # Create TIF filename
                parse_scan $filename
        elif echo $filename | grep -q '.JPG$'; then
                # Create TIF filename
                parse_scan $filename
        fi
done

You can of course make the basedir an argument which can be passed into the shell script. But that is a personal choice. Now you can get rid of your pile of paper too. Let’s start recycling that pile of paper. 🙂

  1. No comments yet.
  1. No trackbacks yet.

Time limit is exhausted. Please reload CAPTCHA.