This week I've mostly been trying to figure out a way to digitalise some of my bank statements to make accountancy and archiving easier.

I figured there must be a way to scan them in using a document scanner, then use OCR software to render the data into Excel.

Of course, nothing does this natively, but there are ways to do it.

There are three parts to this process…

  1. Document scanning
  2. Data format conversion
  3. Optical character recognition (OCR)

For the scanning part I used a Ricoh Aficio MP C4500, which not only acts as a Photocopier, but as a fax machine, network printer and network scanner.

The main part is it’s ability to scan with document processing. This means (once configured) that it would only take a few minutes to scan a few hundred pages, rather than a few hours using a conventional scanner.

This device has the ability to save scanned documents as PDF or TIF. I decided to use the more versatile TIF image format.

I figured it would be easier to convert a standard image format than the proprietary PDF format. I very quickly discovered that it wasn’t that simple.

To begin with I had heard good things about two leading OCR software packages…

  • ABBYY FineReader 9.0
  • Able2Extract Pro 5.0

I found that neither of these software packages would accept TIF files, so I converted each directory of TIF files into a single PDF file using the DreamSys Tiff to PDF Converter, which was a very quick and effective command-line tool for doing the task.

I started trying out Able2Extract as I’d not used it before and I had read good things about it’s PDF to Excel conversion. The problem was it kept throwing up a “Fatal Internal Error #24” error which I was unable to work around.

I had used ABBYY in the past and it turned out to be very good OCR software for converting to Word at least, however trying to configure it to scan the areas I wanted into Excel seemed almost impossible and took forever.

I decided to go back to the drawing board…

I quickly realised that there were a few “web 2.0” software as a service tools out there that could assist me…

There’s many universal “online document converters”, a few include…

My issue with most of these is that they would recognise the PDF, but not the images inside the PDF and use OCR to render them. They are only designed to take the text found in a PDF file and extract that to Excel.

I also came across a few other software packages that will also convert from PDF files to Excel…

  • deskUNPDF Professional by docudesk
  • PDF2XL Enterprise by cogniview
  • ExpressConversion Server by adlibsoftware
  • PDF To XLS by verypdf
  • Solid Converter PDF by soliddocuments
  • TotalPDFConverter by coolutils

With all this available proprietary software for converting from PDF to Excel (XLS or CSV) its obvious that it’s a service definitely in demand.

So far I’m waiting on the following:

  • Zamzar and others to email me their conversion of the PDF file I uploaded to them.
  • Able2Extract’s support team to get back to me with regards to the error.
  • For Cogniview PDF2XL Enterprise to download.

Let’s see what gets the job done best, first, if any…

Zamzar is unable to convert from PDF to XLS, directly or indirectly. While finereaderonline will only accept images, not PDF files, which is OK, but may take some time, especially at only 10 pages per day.

Able2Extract were unable to help me with the error unless I sent them the PDF, which I can’t do due to it’s content.

I sent DreamSys an email asking them to send these guys a sample output PDF instead, but I’ve heard nothing back as of yet.

Able2Extract recommended I use their Sonic PDF Creator product to convert the original scanned TIFF files to a PDF which they claim their Able2Extract product will be able to read.

I had a play with Sonic PDF Creator and appeared to be unable to import a directory of TIF files, just each TIF file individually. This seemed like a painstaking process.

I needed to merge the TIF files together into 1 file to make it easier. PTGui Pro is able to stitch TIF files together, however it gave me an error saying:

Error loading TIFF file: Unsupported number of bits per sample (only 8, 16 or 32 are supported) or unsupported sample format. Useless.

Back to square one.

I decided to try my luck with PDF2XL, which after a little teething problem getting a working copy to begin with seemed to be quite a neat package.

PDF2XL was able to detect that the PDF I had given it (the one I created using Tiff 2 PDF) was a scanned document and began performing OCR on the file.

The results were almost prefect, VERY impressive. It seems that PDF2XL is able to do what nothing else could, not even Able2Extract. Don’t waste your time with anything else. It seems that PDF2XL is all you need.

The only thing I will say is I had to tweak the OCR settings a little bit to get it to render the page correctly.

Just when I thought everything was going great, I discovered another issue I had completely overlooked. The output was not in the correct order. I took a look at the original PDF and soon discovered that the images in the PDF were not in the correct order either.

First off, I decided that I should rename all the images based on the sheet number, that way I could be sure that they were correctly ordered. I created a batch file to preview and rename the images, called “imgrename.bat“. I was also able to use this to rename an entire directory at a time.

I tried rebuilding the PDF using DreamSys Tiff to PDF Converter which I had used to create the PDF file in the first place, only to find that it was still out of order and I couldn’t influence the sorting. I had to find something else to do the job…

Enter libtiff and it’s windows counterpart Tiff for Windows by GnuWin32. Once I had a copy of this I was able to write another batch file that would not only combine the TIFF files (with pages) into a single TIFF file (using tiffcp.exe) but I was able to convert that into a PDF using (tiff2pdf.exe). I called this script “tiff2pdf.bat“.

Bingo!

All was successful…

The next task is to firstly, cleanup anything bad in the CSV file left by OCR (cleancsv.bat) then to “normalise” the data…

But that’s another story for another time. We’ve done what we set out to do which was to scan to excel and all is well on that front.

Hope you find this useful. Enjoy!