How to Extract Specific Words from PDFs with Python - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How to Extract Specific Words from PDFs with Python (/thread-15437.html) |
How to Extract Specific Words from PDFs with Python - danvsv - Jan-17-2019 I have to copy specific strings from a pdf file and paste it into a specific tag in xml file. So, in the picture attached, number 3 from pdf goes to label tag 3 in xml, if it’s bold in pdf (first string — Melnik BC.), goes to the <collab> tag in xml, if it’s normal string (The Pathogenic Role…) goes into the <article-title> tag, italic string (Current Diabetes…) goes to <source> tag and then year 2015 goes in year tag, 11 goes in volume, 46 goes into fpage and 62 in lpage. Can anybody help me with an idea how can I solve this in python? Thank you very much, Dan RE: How to Extract Specific Words from PDFs with Python - Larz60+ - Jan-17-2019 see: https://pypi.org/search/?q=pdf+image+extract minecart: https://pypi.org/project/minecart/ looks promising, sample code: >>> pdffile = open('example.pdf', 'rb') >>> doc = minecart.Document(pdffile) >>> page = doc.get_page(3) >>> for shape in page.shapes.iter_in_bbox((0, 0, 100, 200)): ... print shape.path, shape.fill.color.as_rgb() >>> im = page.images[0].as_pil() # requires pillow >>> im.show() |