NO: Saturday night and I am curious......
Stephen Miller
stephenmiller1958 at gmail.com
Mon Mar 13 12:23:06 UTC 2023
Vik et al
The pdfkit worked solution uses a convert to html then global search and
replace then convert back to pdf strategy.
.
This is the next approach I will try but windows only at this stage using
local converters. Initial testing seems that the formatting will be pretty
accurate with the original assuming I use a replacement string of the same
length as the original string.
Kind Regards,
Stephen Miller
0455461581
On Mon, 13 Mar 2023 at 22:34, Stephen Miller <stephenmiller1958 at gmail.com>
wrote:
> Vik
>
> Yes the cheap scanner I have at home works fine and so do the bulk
> machines at Officeworks.
>
> My issue has been finding a tool where I can remove the account numbers in
> bulk throughout the transaction table in a coherent manner so that I can
> patch the correct account numbers back after processing by these web based
> sites of the pdfs.
>
> You are dead right processing the pdfs is easy if they non-image based. My
> problem has been finding a tool that will do a global search and replace
> automatically on all files. I will look a the pdf kit again thanks.
>
> The two products I have been looking at on-line are great as well - just
> need to get the account identity info out programmatically.
>
> Plenty of products will find a string, very few will change to another
> string (some allow deletion or password font substitution).
>
> There seems to be another approach worth investigating, that is using
> Ghostscript.
>
> Anyway onwards and upwards!
>
>
>
>
> Kind Regards,
>
> Stephen Miller
>
> 0455461581
>
>
>
>
>
>
>
>
> On Mon, 13 Mar 2023 at 21:22, Vik Shah <OmnisList at keys2solutions.com.au>
> wrote:
>
>> There are a few ways PDFs are constructed that can be then deconstructed.
>>
>> If these are image based PDFs then OCRing is the way if not then using
>> PDFkit and text extraction is the way.
>>
>> There are many open source PDF extraction utilities out there OR buy a
>> cheap scanner and it comes with a powerful OCR software for free, you can
>> always return the scanner back…. :p
>>
>> I’ll leave this thread with a link and a few of the OCR tools I’ve used
>> https://www.hellosign.com/blog/best-free-and-open-source-optical-character-recognition-ocr-software.
>>
>>
>> PS: if your PDFs are encrypted, there’s an app for that too…. ;-)
>>
>> Regards,
>>
>> Vik Shah
>> Keys2Solutions
>> AU: +61 411 493 495
>>
>> > On 13 Mar 2023, at 14:41, Stephen Miller <stephenmiller1958 at gmail.com>
>> wrote:
>> >
>> > Thanks for the feedback.
>> >
>> > The reason I want to do this is twofold.
>> >
>> > Firstly I want to analyse all my records with my bank for the complete
>> > history of my relationship with the bank. One account goes back to 1993!
>> > and the other three commenced in 2001! So I have a lot of pdfs to deal
>> > with. (More than 1500!!!!)
>> >
>> > Essentially, like a lot of people I guess, there are a lot of inter
>> account
>> > transfers as well as the usual cheques (checks) in the earlier years,
>> > monthly periodical payments etc.
>> >
>> > In other words it will be great personally and maybe of some value
>> > historically. (I don't imagine this has ever been done before).
>> >
>> > Thanks to accidentally being with a great bank, the St George, I am
>> able to
>> > access online the pdfs back to 2012. These load fine into Acrobat Pro or
>> > Nitro and the OCR is perfect and saving as Excel compatible does a great
>> > job of identifying the transaction table in the statements. So
>> extracting
>> > the transaction lines for these is simple to load into a database,
>> >
>> > The OCR issues became problematic for the earlier years, here remember
>> we
>> > are talking about doing these in bulk, not one page at a time on a home
>> > scanner etc.
>> >
>> > That led to my research online and I discovered two amazing products.
>> >
>> > The first is an Chinese Australian guy with a small startup who is
>> living
>> > in Hong Kong and is doing this for a living. His OCR's and coding are
>> > written for bank statements and are very accurate. The other is a
>> > Californian based company which is using AI machine learning and AI
>> with a
>> > user controllable environment to teach the system how to interpret a
>> bank
>> > statement, or any other document for that matter.
>> >
>> > So I am excited to get some hands-on experience with real AI systems
>> but in
>> > both cases I am not interested in my personal details going to India in
>> one
>> > case and Hong Kong in the other.
>> >
>> > So that led me to the obfuscation problem and the possibility of a
>> software
>> > product as a result as I would not be the only one with these concerns.
>> >
>> > So into the rabbit hole.
>> >
>> > On the internet there is much complaint about Adobe not having a global
>> > search and replace. Their answer is correct in my view, that pdf is a
>> > global document interchange format and they are not pretending to be MS
>> > Word which is commercially a very wise decision.
>> >
>> > So to test these two on-line products I will need to create a list of
>> items
>> > to be obfuscated then to a bulk convert to Excel (xml) using Nitro then
>> use
>> > TextEdit Pro to bulk search and replace, then write an Omnis app to
>> exactly
>> > reproduce the Statements and then print as Pdfs and then test these
>> online
>> > systems with those. (Of course if I just wanted the data I could just
>> load
>> > and insert from the Xml).
>> >
>> > Still the little bit of extra work seems worth it test these onl;ine
>> > systems, especially the meaty machine learning AI one;
>> >
>> > Any more suggestions gratelully received.
>> >
>> >
>> > On Sat, 11 Mar 2023 at 22:46, Mike Matthews - Omnis via omnisdev-en <
>> > omnisdev-en at lists.omnis-dev.com> wrote:
>> >
>> >> Interesting. Does Acrobat Pro help with maybe a scripting dictionary?
>> >>
>> >> Some good prices on TVs right now, clearing out Christmas stock :)
>> >>
>> >> Mike Matthews
>> >>
>> >> Lineal Software Solutions
>> >> Commercial House, The Strand<x-apple-data-detectors://1/1> Barnstaple,
>> >> Devon, EX31 1EU<x-apple-data-detectors://1/1>
>> >>
>> >> omnis at lineal.co.uk<mailto:mike.matthews at lineal.co.uk>
>> >>
>> >> www.lineal.co.uk<http://www.lineal.co.uk/>
>> >>
>> >> www.sqlworks.co.uk<http://www.sqlworks.co/>
>> >>
>> >>
>> >>
>> >> On 11 Mar 2023, at 08:22, Stephen Miller <stephenmiller1958 at gmail.com
>> >> <mailto:stephenmiller1958 at gmail.com>> wrote:
>> >>
>> >> Caution: This is a message which has originated from outside the
>> >> organisation. Ensure the sender is trusted and the content is safe
>> before
>> >> opening links or attachments.
>> >>
>> >>
>> >>
>> >> Hi All
>> >>
>> >> My challenge for this Saturday night is the following...
>> >>
>> >> I have a whole lot, 100's, of pdfs that all use Times New Roman.
>> >>
>> >> It appears that the pdf standard does store a ASCII value for the
>> letter
>> >> "A" or a string using the same font and size like "ABC" it stores a
>> pointer
>> >> to the specific Winansi Font Table position for this character.
>> >>
>> >> Now I know why Adobe only lets you replace one at a time, no replace
>> all,
>> >> as it depends on the font.
>> >>
>> >> Now in my case they are all Times New Roman so that hopefully, as I
>> know
>> >> the size and style of the Font from looking in edit mode in Acrobat
>> Pro, I
>> >> should be able to use a Hex Editor such as Neo to do a global find and
>> >> replace???
>> >>
>> >> For those that have some idea what I am talking about I think the
>> wikipedia
>> >> page on "Windows-1252", and the table of that page is useful correct?
>> >>
>> >> Please note all the documents are the same, bank statements, but there
>> are
>> >> hundreds of them and I want to obfuscate the identity data of the
>> account
>> >> holder and all accounts this person has transactions with?
>> >>
>> >> Possible or should I take up drinking again and buy a televison?
>> >>
>> >> Kind Regards,
>> >>
>> >> Stephen Miller
>> >>
>> >> 0455461581
>> >> _____________________________________________________________
>> >> Manage your list subscriptions at
>> >>
>> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.omnis-dev.com&c=E,1,Hp91LLgNl88u3Fccn-QBczyP7dzYRhrejHdCNvJtZCTjf85x9_LjsUP-iZa7XxBvQF8UTty33dqpnw5kacMqFBn2jtJqHOzJfrClzQ-nsT8liHX4iEbl&typo=1
>> >> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>> >>
>> >> _____________________________________________________________
>> >> Manage your list subscriptions at https://lists.omnis-dev.com
>> >> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>> >>
>> > _____________________________________________________________
>> > Manage your list subscriptions at https://lists.omnis-dev.com
>> > Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>>
>> _____________________________________________________________
>> Manage your list subscriptions at https://lists.omnis-dev.com
>> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>>
>
More information about the omnisdev-en
mailing list