NO: Saturday night and I am curious......

Stephen Miller stephenmiller1958 at gmail.com
Mon Mar 13 11:34:43 UTC 2023


Vik

Yes the cheap scanner I have at home works fine and so do the bulk machines
at Officeworks.

My issue has been finding a tool where I can remove the account numbers in
bulk throughout the transaction table in a coherent manner so that I can
patch the correct account numbers back after processing by these web based
sites of the pdfs.

You are dead right processing the pdfs is easy if they non-image based. My
problem has been finding a tool that will do a global search and replace
automatically on all files. I will look a the pdf kit again thanks.

The two products I have been looking at on-line are great as well - just
need to get the account identity info out programmatically.

Plenty of products will find a string, very few will change to another
string (some allow deletion or password font substitution).

There seems to be another approach worth investigating, that is using
Ghostscript.

Anyway onwards and upwards!




Kind Regards,

Stephen Miller

0455461581








On Mon, 13 Mar 2023 at 21:22, Vik Shah <OmnisList at keys2solutions.com.au>
wrote:

> There are a few ways PDFs are constructed that can be then deconstructed.
>
> If these are image based PDFs then OCRing is the way if not then using
> PDFkit and text extraction is the way.
>
> There are many open source PDF extraction utilities out there OR buy a
> cheap scanner and it comes with a powerful OCR software for free, you can
> always return the scanner back…. :p
>
> I’ll leave this thread with a link and a few of the OCR tools I’ve used
> https://www.hellosign.com/blog/best-free-and-open-source-optical-character-recognition-ocr-software.
>
>
> PS: if your PDFs are encrypted, there’s an app for that too…. ;-)
>
> Regards,
>
> Vik Shah
> Keys2Solutions
> AU: +61 411 493 495
>
> > On 13 Mar 2023, at 14:41, Stephen Miller <stephenmiller1958 at gmail.com>
> wrote:
> >
> > Thanks for the feedback.
> >
> > The reason I want to do this is twofold.
> >
> > Firstly I want to analyse all my records with my bank for the complete
> > history of my relationship with the bank. One account goes back to 1993!
> > and the other three commenced in 2001! So I have a lot of pdfs to deal
> > with. (More than 1500!!!!)
> >
> > Essentially, like a lot of people I guess, there are a lot of inter
> account
> > transfers as well as the usual cheques (checks) in the earlier years,
> > monthly periodical payments etc.
> >
> > In other words it will be great personally and maybe of some value
> > historically. (I don't imagine this has ever been done before).
> >
> > Thanks to accidentally being with a great bank, the St George, I am able
> to
> > access online the pdfs back to 2012. These load fine into Acrobat Pro or
> > Nitro and the OCR is perfect and saving as Excel compatible does a great
> > job of identifying the transaction table in the statements. So extracting
> > the transaction lines for these is simple to load into a database,
> >
> > The OCR issues became problematic for the earlier years, here remember we
> > are talking about doing these in bulk, not one page at a time on a home
> > scanner etc.
> >
> > That led to my research online and I discovered two amazing products.
> >
> > The first is an Chinese Australian guy with a small startup who is living
> > in Hong Kong and is doing this for a living. His OCR's and coding are
> > written for bank statements and are very accurate. The other is a
> > Californian based company which is using AI machine learning and AI with
> a
> > user controllable environment to teach the system how to interpret a bank
> > statement, or any other document for that matter.
> >
> > So I am excited to get some hands-on experience with real AI systems but
> in
> > both cases I am not interested in my personal details going to India in
> one
> > case and Hong Kong in the other.
> >
> > So that led me to the obfuscation problem and the possibility of a
> software
> > product as a result as I would not be the only one with these concerns.
> >
> > So into the rabbit hole.
> >
> > On the internet there is much complaint about Adobe not having a global
> > search and replace. Their answer is correct in my view, that pdf is a
> > global document interchange format and they are not pretending to be MS
> > Word which is commercially a very wise decision.
> >
> > So to test these two on-line products I will need to create a list of
> items
> > to be obfuscated then to a bulk convert to Excel (xml) using Nitro then
> use
> > TextEdit Pro to bulk search and replace, then write an Omnis app to
> exactly
> > reproduce the Statements and then print as Pdfs and then test these
> online
> > systems with those. (Of course if I just wanted the data I could just
> load
> > and insert from the Xml).
> >
> > Still the little bit of extra work seems worth it test these onl;ine
> > systems, especially the meaty machine learning AI one;
> >
> > Any more suggestions gratelully received.
> >
> >
> > On Sat, 11 Mar 2023 at 22:46, Mike Matthews - Omnis via omnisdev-en <
> > omnisdev-en at lists.omnis-dev.com> wrote:
> >
> >> Interesting.  Does Acrobat Pro help with maybe a scripting dictionary?
> >>
> >> Some good prices on TVs right now, clearing out Christmas stock :)
> >>
> >> Mike Matthews
> >>
> >> Lineal Software Solutions
> >> Commercial House, The Strand<x-apple-data-detectors://1/1> Barnstaple,
> >> Devon, EX31 1EU<x-apple-data-detectors://1/1>
> >>
> >> omnis at lineal.co.uk<mailto:mike.matthews at lineal.co.uk>
> >>
> >> www.lineal.co.uk<http://www.lineal.co.uk/>
> >>
> >> www.sqlworks.co.uk<http://www.sqlworks.co/>
> >>
> >>
> >>
> >> On 11 Mar 2023, at 08:22, Stephen Miller <stephenmiller1958 at gmail.com
> >> <mailto:stephenmiller1958 at gmail.com>> wrote:
> >>
> >> Caution: This is a message which has originated from outside the
> >> organisation. Ensure the sender is trusted and the content is safe
> before
> >> opening links or attachments.
> >>
> >>
> >>
> >> Hi All
> >>
> >> My challenge for this Saturday night is the following...
> >>
> >> I have a whole lot, 100's, of pdfs that all use Times New Roman.
> >>
> >> It appears that the pdf standard does store a ASCII value for the letter
> >> "A" or a string using the same font and size like "ABC" it stores a
> pointer
> >> to the specific Winansi Font Table position for this character.
> >>
> >> Now I know why Adobe only lets you replace one at a time, no replace
> all,
> >> as it depends on the font.
> >>
> >> Now in my case they are all Times New Roman so that hopefully, as I know
> >> the size and style of the Font from looking in edit mode in Acrobat
> Pro, I
> >> should be able to use a Hex Editor such as Neo to do a global find and
> >> replace???
> >>
> >> For those that have some idea what I am talking about I think the
> wikipedia
> >> page on "Windows-1252", and the table of that page is useful correct?
> >>
> >> Please note all the documents are the same, bank statements, but there
> are
> >> hundreds of them and I want to obfuscate the identity data of the
> account
> >> holder and all accounts this person has transactions with?
> >>
> >> Possible or should I take up drinking again and buy a televison?
> >>
> >> Kind Regards,
> >>
> >> Stephen Miller
> >>
> >> 0455461581
> >> _____________________________________________________________
> >> Manage your list subscriptions at
> >>
> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.omnis-dev.com&c=E,1,Hp91LLgNl88u3Fccn-QBczyP7dzYRhrejHdCNvJtZCTjf85x9_LjsUP-iZa7XxBvQF8UTty33dqpnw5kacMqFBn2jtJqHOzJfrClzQ-nsT8liHX4iEbl&typo=1
> >> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
> >>
> >> _____________________________________________________________
> >> Manage your list subscriptions at https://lists.omnis-dev.com
> >> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
> >>
> > _____________________________________________________________
> > Manage your list subscriptions at https://lists.omnis-dev.com
> > Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>
> _____________________________________________________________
> Manage your list subscriptions at https://lists.omnis-dev.com
> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>


More information about the omnisdev-en mailing list