Paper Backup (2) Automation Scripts
This is the second part of a three part series describing my automatic paper backup system. In this part I explain the different automation scripts used to create a searchable PDF from a scan and upload it to the backup locations.
- Automation Scripts ← you are here
The setup is multi user capable. Eg. for my setup, files will be stored in either Kaddi's or my home directory on the NAS and Google Drive.
Note: all scripts below are available in a github repository
Script 1: Scan
The first step simply acquires the raw image data from the already set up scanner. The script requires a “job identifier” as first parameter. This will be the same for all following scripts. It's simply a unique name identifying the working folder. It will also be the name of the final output PDF.
- 01-scan.sh
#!/bin/bash BASE="/tmp" if [ -z "$1" ]; then echo "Usage: $0 <jobid>" echo echo "Please provide unique jobid name as first parameter" exit 1 fi OUTPUT="$BASE/$1" mkdir -p "$OUTPUT" echo 'scanning...' scanimage --resolution 300 \ --batch="$OUTPUT/scan_%03d.pnm" \ --format=pnm \ --mode Gray \ --source 'ADF Duplex' echo "Output in $OUTPUT/scan*.pnm"
The script automatically scans all pages in the document feeder as grayscale PNM images.
Script 2: Cleanup, OCR, PDF Generation
The next step is the more complicated one. First the input images are cropped. Then blank pages are recognized and removed. The remaining images are cleaned up and finally OCR is applied and a “sandwich” PDF1) is created. This requires a few more utilities to be installed:
$> sudo apt-get install imagemagick bc exactimage pdftk \ tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng
The following script use these tools to do all the work based on the previously scanned images (identified by job ID).
- 02-createpdf.sh
#!/bin/bash LANGUAGE="deu" # the tesseract language BASE="/tmp" if [ -z "$1" ]; then echo "Usage: $0 <jobid>" echo echo "Please provide existing jobid as first parameter" exit 1 fi OUTPUT="$BASE/$1" if [ ! -d "$OUTPUT" ]; then echo "jobid does not exist" exit 1 fi cd "$OUTPUT" # cut borders echo 'cutting borders...' for i in scan_*.pnm; do mogrify -shave 50x5 "${i}" done # check if the page is blank # http://philipp.knechtges.com/?p=190 echo 'checking for blank pages...' for i in scan_*.pnm; do echo "${i}" histogram=`convert "${i}" -threshold 50% -format %c histogram:info:-` white=`echo "${histogram}" | grep "#FFFFFF" | sed -n 's/^ *\(.*\):.*$/\1/p'` black=`echo "${histogram}" | grep "#000000" | sed -n 's/^ *\(.*\):.*$/\1/p'` blank=`echo "scale=4; ${black}/${white} < 0.005" | bc` if [ ${blank} -eq "1" ]; then echo "${i} seems to be blank - removing it..." rm "${i}" fi done # apply text cleaning and convert to tif echo 'cleaning pages...' for i in scan_*.pnm; do echo "${i}" convert "${i}" -contrast-stretch 1% -level 29%,76% "${i}.tif" done # do OCR echo 'doing OCR...' for i in scan_*.pnm.tif; do echo "${i}" tesseract "$i" "$i" -l $LANGUAGE hocr hocr2pdf -i "$i" -s -o "$i.pdf" < "$i.hocr" done # create PDF echo 'creating PDF...' pdftk *.tif.pdf cat output "$1.pdf" echo "created $OUTPUT/$1.pdf"
The language for OCR processing with Tesseract is configured at the top of the script. You might want to change that if you don't speak German. Remember to install the correct language package then.
Script 3: Copy to Synology NAS
Our Synology DiskStation NAS is our primary storage and backup server, so prepared documents shall be stored there.
For the transfer SFTP is used 2). To allow passwordless access, SSH has to be enabled on the NAS and some keys have to be exchanged.
First log into your DiskStation web interface, then enable the needed services:
- Configuration Manager
- Terminal & SNMP → Terminal → Enable SSH service
- File Services → FTP → SFTP → Enable SFTP Service
Now SSH into the NAS and login as root
3) and edit /etc/passwd
. Change the shell of all users you want to give access from /sbin/nologin
to /bin/sh
4).
Now log in back on the Raspberry and create a SSH key for the pi user. Don't use a pass phrase and make sure you create a DSA key or you will have problems with curl later!
$> ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/home/pi/.ssh/id_dsa): Created directory '/home/pi/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/pi/.ssh/id_dsa. Your public key has been saved in /home/pi/.ssh/id_dsa.pub.
Next copy this key to all the users that will use the scanner:
$> ssh-copy-id -i ~/.ssh/id_dsa.pub andi@diskstation andi@diskstation's password: Now try logging into the machine, with "ssh 'andi@diskstation'", and check in: ~/.ssh/authorized_keys to make sure we haven't added extra keys that you weren't expecting.
Now our script can copy over the created PDF files:
- 03-nascopy.sh
#!/bin/bash BASE="/tmp" HOST="diskstation" FOLDER="documents" YEAR=`date '+%Y'` if [ -z "$1" ]; then echo "Usage: $0 <jobid> <user> [<keyword>]" echo echo "Please provide existing jobid as first parameter" exit 1 fi if [ -z "$2" ]; then echo "Usage: $0 <jobid> <user> [<keyword>]" echo echo "Please provide user as second parameter" exit 1 fi OUTPUT="$BASE/$1" REMOTE="sftp://$2@$HOST/home/$FOLDER/$YEAR/$3/$1.pdf" LOCAL="$OUTPUT/$1.pdf" if [ ! -f "$LOCAL" ]; then echo "jobid does not exist" exit 1 fi echo copying to $REMOTE curl --ftp-create-dirs --insecure -T "$LOCAL" "$REMOTE"
This time the script expects two more parameters after the Job ID: A user name (one of the users whose access we just set up) and an optional keyword. This keyword will be used as a sub folder inside the documents folder configured at the top of the script. This will be our main way to categorize scans later on: a menu will allow to pick a keyword and the scan will be automatically be executed and be placed in the right folder. Additionally a sub folder for the current year is created.
Script 4: Copy to Google Drive
A backup on the NAS is good, but a second off-site backup is better. Having an excellent search on top of that is even better. That's why I want a second copy at Google Drive.
For that we make use of the excellent rclone utility. It is able to copy files from and to different cloud storage services, one of them being Google drive.
First install the Linux ARM binary:
$> wget http://downloads.rclone.org/rclone-v1.05-linux-arm.zip $> unzip rclone-v1.05-linux-arm.zip $> sudo cp rclone-v1.05-linux-arm/rclone /usr/local/bin/ $> sudo chmod 755 /usr/local/bin/rclone $> sudo mkdir -p /usr/local/man/man1 $> sudo cp rclone-v1.05-linux-arm/rclone.1 /usr/local/man/man1/
Next create a profile for every user that will use the service later. Be sure to authenticate the displayed URL with the correct Google User! Name the remote profile exactly like the user.
$> rclone --config=$HOME/.rclone.conf config
Just follow the interactive dialog to create a “remote” of type 6) drive
named after your user.
Then the following script will take care of copying the finished PDF to your Google Drive. Since I occasionally got errors from the Google API, it will retry uploading three times until giving up.
- 04-gdrivecopy.sh
#!/bin/bash BASE="/tmp" FOLDER="documents" YEAR=`date '+%Y'` if [ -z "$1" ]; then echo "Usage: $0 <jobid> <user> [<keyword>]" echo echo "Please provide existing jobid as first parameter" exit 1 fi if [ -z "$2" ]; then echo "Usage: $0 <jobid> <user> [<keyword>]" echo echo "Please provide user as second parameter" exit 1 fi OUTPUT="$BASE/$1" REMOTE="$2://$FOLDER/$YEAR/$3/" LOCAL="$OUTPUT/$1.pdf" if [ ! -f "$LOCAL" ]; then echo "jobid does not exist" exit 1 fi for X in 1 2 3; do echo "uploading to Google Drive (try $X)" if rclone --config=$HOME/.rclone.conf copy "$LOCAL" "$REMOTE"; then exit 0 fi sleep 15 # wait 15 seconds before retrying done exit 1
It takes exactly the same parameters as the NAS copy script above.
Script 5: Cleanup
Nothing to see here. Just delete the directory containing all the temporary files:
- 05-cleanup.sh
#!/bin/bash BASE="/tmp" if [ -z "$1" ]; then echo "Usage: $0 <jobid>" echo echo "Please provide existing jobid as first parameter" exit 1 fi OUTPUT="$BASE/$1" if [ ! -d "$OUTPUT" ]; then echo "jobid does not exist" exit 1 fi rm -rf "$OUTPUT"
Executing the whole chain
Finally we need a way to execute each of the steps above in one go. That's where the controller script comes into play.
- scan.sh
#!/bin/bash DIR=$( cd $( dirname "${BASH_SOURCE[0]}" ) && pwd ) JOBID=`date '+%Y-%m-%d_%H%M%S'` USER=$1 KEYWORD=$2 if [ -z "$USER" ]; then echo "Usage: $0 <user> [<keyword>]" echo "please give a user" exit 1 fi # run the scanning in foreground $DIR/01-scan.sh "$JOBID" # execute processing in background ( # lock processing to make sure only one is running at a time ( flock -x 200 # wait for lock $DIR/02-createpdf.sh "$JOBID" $DIR/03-nascopy.sh "$JOBID" "$USER" "$KEYWORD" $DIR/04-gdrivecopy.sh "$JOBID" "$USER" "$KEYWORD" $DIR/05-cleanup.sh "$JOBID" ) 200>/tmp/scan.lock ) &
What I did here is executing the scanning process in the foreground and then start the whole time consuming PDF creation in the background. The background process is locked to make sure only one of them is ever running at the same time, even when multiple processes were started.
This allows me to quickly scan a couple of documents without needing to wait for anything but the scanner. The processing then can take all the time it needs.
admin
on the web interface