Python Forum
How to Extract Specific Words from PDFs with Python - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: How to Extract Specific Words from PDFs with Python (/thread-15437.html)



How to Extract Specific Words from PDFs with Python - danvsv - Jan-17-2019

I have to copy specific strings from a pdf file and paste it into a specific tag in xml file. So, in the picture attached, number 3 from pdf goes to label tag 3 in xml, if it’s bold in pdf (first string — Melnik BC.), goes to the <collab> tag in xml, if it’s normal string (The Pathogenic Role…) goes into the <article-title> tag, italic string (Current Diabetes…) goes to <source> tag and then year 2015 goes in year tag, 11 goes in volume, 46 goes into fpage and 62 in lpage. Can anybody help me with an idea how can I solve this in python?

Thank you very much,
Dan

[Image: 1.jpg?dl=0]


RE: How to Extract Specific Words from PDFs with Python - Larz60+ - Jan-17-2019

see: https://pypi.org/search/?q=pdf+image+extract

minecart: https://pypi.org/project/minecart/ looks promising, sample code:
>>> pdffile = open('example.pdf', 'rb')
>>> doc = minecart.Document(pdffile)
>>> page = doc.get_page(3)
>>> for shape in page.shapes.iter_in_bbox((0, 0, 100, 200)):
...     print shape.path, shape.fill.color.as_rgb()
>>> im = page.images[0].as_pil()  # requires pillow
>>> im.show()