NO: Investigation of Bulk PDF (OCR) Obfuscation of Identity Information

Stephen Miller stephenmiller1958 at gmail.com
Sat Mar 18 23:03:47 UTC 2023


All

A quick summary of my investigations of a specific problem. More than a
thousand monthly bank statements in three accounts going back to 2001 and
one case  to 1993.

I wish to analyse all the data from these accounts and look at my complete
history with the Bank - in my case St George which is an excellent bank
from a service point of view.

I also wish to use these pdf statements to test some online resources,  One
is a very powerful accounting pdf product HQ'ed in California but where the
data is processed in India and you can use machine learning on your own
input to enhance the output. The other is a small operation in SE Asia. In
neither case do I want ANY identity data to be left in the pdf's when I
load them for processing online. The minimum  level of processing would
involve removing BSB and Account numbers and, for example,
replacing the true values with "MyBSB01" and "MyAccount01" in every
instance where this data appears in any pdf statement. (It will l be easy
for me to patch this back together as I will keep a table of all
references., Further I want to obfuscate payments to any account - whether
mine or a third party using the same approach., Obviously my Account Names,
My address etc will receive the same treatment. (My obfuscation will be
unique at a statement level as well.but no further on that here).


The problem is that all the commercial pdf software that I could find does
not allow global search and replace without user intervention  in one pdf
document let alone directories of pdfs.

Several solutions were mentioned to me but essentially they all required
the OCR and conversion of documents to other formats such as Word, html or
Excel.

The pdf is essentially a page layout format based originally on Postscript.
The current implementation of Ghostscript has a very powerful pdf module
built into it that has recently been rewritten. Ghostscript DOES support
Global Search and Replace and will work on directories of files. However
its OCR is not as smart as others. In particular, although NitroPdf gets an
honourable mention, it appears that Acrobat Pro is the clear winner in
accurate OCR and Conversion to Excel seems to be the key to this. and the
built in smarts it has for tables in bank statements. I am sure that in the
analysis of the bank statements Adobe is aware it is a bank statement and
is making certain assumptions in conversion to Excel which it doesnt make
in other conversions. For example, let us assume that in some instances a
"1" might be interpreted as an 'I' but in the Excel conversions it appears
that in table data it knows the importance of a table header titled
balance, debit, credit etc and knows that the probability in those columns
is what it thinks is an "I" is in reality a "1".  This means there are
virtually no errors in the OCR process using Adobe and Convert to Excel.
By "no" errors I mean in the small number of times this could be an issue
it gets it right 999 times in a thousand. (Post processing, for example a
script to create a table and populate it can be programmed to catch
this small number of errors).

So on OCR capacity Adobe Pro seems to be a clear winner. Another feature of
Adobe Pro is its capacity to handle bulk files and its amazing "Actions"
module. It also supports chaining actions and programmability through
Javascript. Pro also has plenty of on-line examples and a message board
where you will find contributors who write the javascript for a fee.

So for ease of use and minimum personal effort I highly recommend Acrobat
Pro.  My process will be to use Pro to output OCR'ed Excel files where I
will then do a global search and replace and then load those files back
into Pro for saving again as pdfs.

(Please note I have looked at many alternate solutions and if time
available was not an issue nor my desire to check out online solutions,  I
probably would have gone down the Python based OCR approach and written my
own solution).

Hope this is of help to someone - have a good look at theAcrobat  Pro
"Tools" section especially "Customize" and the awesome power available,




Kind Regards,

Stephen Miller

0455461581


More information about the omnisdev-en mailing list