NO: Saturday night and I am curious......
Vik Shah
OmnisList at Keys2Solutions.com.au
Mon Mar 13 10:20:47 UTC 2023
There are a few ways PDFs are constructed that can be then deconstructed.
If these are image based PDFs then OCRing is the way if not then using PDFkit and text extraction is the way.
There are many open source PDF extraction utilities out there OR buy a cheap scanner and it comes with a powerful OCR software for free, you can always return the scanner back…. :p
I’ll leave this thread with a link and a few of the OCR tools I’ve used https://www.hellosign.com/blog/best-free-and-open-source-optical-character-recognition-ocr-software.
PS: if your PDFs are encrypted, there’s an app for that too…. ;-)
Regards,
Vik Shah
Keys2Solutions
AU: +61 411 493 495
> On 13 Mar 2023, at 14:41, Stephen Miller <stephenmiller1958 at gmail.com> wrote:
>
> Thanks for the feedback.
>
> The reason I want to do this is twofold.
>
> Firstly I want to analyse all my records with my bank for the complete
> history of my relationship with the bank. One account goes back to 1993!
> and the other three commenced in 2001! So I have a lot of pdfs to deal
> with. (More than 1500!!!!)
>
> Essentially, like a lot of people I guess, there are a lot of inter account
> transfers as well as the usual cheques (checks) in the earlier years,
> monthly periodical payments etc.
>
> In other words it will be great personally and maybe of some value
> historically. (I don't imagine this has ever been done before).
>
> Thanks to accidentally being with a great bank, the St George, I am able to
> access online the pdfs back to 2012. These load fine into Acrobat Pro or
> Nitro and the OCR is perfect and saving as Excel compatible does a great
> job of identifying the transaction table in the statements. So extracting
> the transaction lines for these is simple to load into a database,
>
> The OCR issues became problematic for the earlier years, here remember we
> are talking about doing these in bulk, not one page at a time on a home
> scanner etc.
>
> That led to my research online and I discovered two amazing products.
>
> The first is an Chinese Australian guy with a small startup who is living
> in Hong Kong and is doing this for a living. His OCR's and coding are
> written for bank statements and are very accurate. The other is a
> Californian based company which is using AI machine learning and AI with a
> user controllable environment to teach the system how to interpret a bank
> statement, or any other document for that matter.
>
> So I am excited to get some hands-on experience with real AI systems but in
> both cases I am not interested in my personal details going to India in one
> case and Hong Kong in the other.
>
> So that led me to the obfuscation problem and the possibility of a software
> product as a result as I would not be the only one with these concerns.
>
> So into the rabbit hole.
>
> On the internet there is much complaint about Adobe not having a global
> search and replace. Their answer is correct in my view, that pdf is a
> global document interchange format and they are not pretending to be MS
> Word which is commercially a very wise decision.
>
> So to test these two on-line products I will need to create a list of items
> to be obfuscated then to a bulk convert to Excel (xml) using Nitro then use
> TextEdit Pro to bulk search and replace, then write an Omnis app to exactly
> reproduce the Statements and then print as Pdfs and then test these online
> systems with those. (Of course if I just wanted the data I could just load
> and insert from the Xml).
>
> Still the little bit of extra work seems worth it test these onl;ine
> systems, especially the meaty machine learning AI one;
>
> Any more suggestions gratelully received.
>
>
> On Sat, 11 Mar 2023 at 22:46, Mike Matthews - Omnis via omnisdev-en <
> omnisdev-en at lists.omnis-dev.com> wrote:
>
>> Interesting. Does Acrobat Pro help with maybe a scripting dictionary?
>>
>> Some good prices on TVs right now, clearing out Christmas stock :)
>>
>> Mike Matthews
>>
>> Lineal Software Solutions
>> Commercial House, The Strand<x-apple-data-detectors://1/1> Barnstaple,
>> Devon, EX31 1EU<x-apple-data-detectors://1/1>
>>
>> omnis at lineal.co.uk<mailto:mike.matthews at lineal.co.uk>
>>
>> www.lineal.co.uk<http://www.lineal.co.uk/>
>>
>> www.sqlworks.co.uk<http://www.sqlworks.co/>
>>
>>
>>
>> On 11 Mar 2023, at 08:22, Stephen Miller <stephenmiller1958 at gmail.com
>> <mailto:stephenmiller1958 at gmail.com>> wrote:
>>
>> Caution: This is a message which has originated from outside the
>> organisation. Ensure the sender is trusted and the content is safe before
>> opening links or attachments.
>>
>>
>>
>> Hi All
>>
>> My challenge for this Saturday night is the following...
>>
>> I have a whole lot, 100's, of pdfs that all use Times New Roman.
>>
>> It appears that the pdf standard does store a ASCII value for the letter
>> "A" or a string using the same font and size like "ABC" it stores a pointer
>> to the specific Winansi Font Table position for this character.
>>
>> Now I know why Adobe only lets you replace one at a time, no replace all,
>> as it depends on the font.
>>
>> Now in my case they are all Times New Roman so that hopefully, as I know
>> the size and style of the Font from looking in edit mode in Acrobat Pro, I
>> should be able to use a Hex Editor such as Neo to do a global find and
>> replace???
>>
>> For those that have some idea what I am talking about I think the wikipedia
>> page on "Windows-1252", and the table of that page is useful correct?
>>
>> Please note all the documents are the same, bank statements, but there are
>> hundreds of them and I want to obfuscate the identity data of the account
>> holder and all accounts this person has transactions with?
>>
>> Possible or should I take up drinking again and buy a televison?
>>
>> Kind Regards,
>>
>> Stephen Miller
>>
>> 0455461581
>> _____________________________________________________________
>> Manage your list subscriptions at
>> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.omnis-dev.com&c=E,1,Hp91LLgNl88u3Fccn-QBczyP7dzYRhrejHdCNvJtZCTjf85x9_LjsUP-iZa7XxBvQF8UTty33dqpnw5kacMqFBn2jtJqHOzJfrClzQ-nsT8liHX4iEbl&typo=1
>> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>>
>> _____________________________________________________________
>> Manage your list subscriptions at https://lists.omnis-dev.com
>> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
>>
> _____________________________________________________________
> Manage your list subscriptions at https://lists.omnis-dev.com
> Start a new message -> mailto:omnisdev-en at lists.omnis-dev.com
More information about the omnisdev-en
mailing list