Python Forum

I have a list of the bookmarks in pdf that I wish to transform. The list prints out in the form:

[

[2, 'Medical Evidence of Record (MER)  Src.:  HELEN HASKELL HOBBS Tmt. Dt.:  Unknown - Unknown (10 pages)', 88, {'kind': 4, 'xref': 7541, 'page': '88', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Copy of Evidence Request (CPYEVREQ)  Src.:  DALLAS DIAGNOSTIC ASSOCIATION Tmt. Dt.:  Unknown - Unknown (7 pages)', 98, {'kind': 4, 'xref': 7552, 'page': '98', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Medical Evidence of Record (MER)  Src.:   Tmt. Dt.:  Unknown - Unknown (7 pages)', 105, {'kind': 4, 'xref': 7560, 'page': '105', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Medical Evidence of Record (MER)  Src.:   Tmt. Dt.:  Unknown - Unknown (7 pages)', 112, {'kind': 4, 'xref': 7568, 'page': '112', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Copy of Evidence Request (CPYEVREQ)  Src.:  BAYLOR & SCOTT MEDICAL CENTER Tmt. Dt.:  Unknown - Unknown (3 pages)', 119, {'kind': 4, 'xref': 7576, 'page': '119', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Medical Evidence of Record (MER)  Src.:  DALLAS DIAGNOSTIC ASSOCIATION Tmt. Dt.:  Unknown - Unknown (119 pages)', 122, {'kind': 4, 'xref': 7580, 'page': '122', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}], [2, 'Copy of Evidence Request (CPYEVREQ)  Src.:  BAYLOR & SCOTT MEDICAL CENTER Tmt. Dt.:  Unknown - Unknown (7 pages)', 241, {'kind': 4, 'xref': 7700, 'page': '241', 'view': 'FitB', 'collapse': False, 'zoom': 0.0}]]

In the title of these items there is too much info crammed in the Title. I want to preserve that full title in my new dictionary so that I can refer to the bookmark, but I need to parse into separate fields the text in the title that appears before "Scr:"as the "Document Type" and the text between "Scr:" and Tmt. Dt." as "Source" So, for example I want output as follows for the first two items:

[{'Title': 'Medical Evidence of Record (MER)  Src.:  HELEN HASKELL HOBBS Tmt. Dt.:  Unknown - Unknown (10 pages)', 'Document Type': 'Medical Evidence of Record (MER)', 'Source': 'HELEN HASKELL HOBBS'},{'Title': 'Copy of Evidence Request (CPYEVREQ)  Src.:  DALLAS DIAGNOSTIC ASSOCIATION Tmt. Dt.:  Unknown - Unknown (7 pages)', 'Document Type': 'Copy of Evidence Request (CPYEVREQ)', 'Source': 'DALLAS DIAGNOSTIC ASSOCIATION'}]

This looks interesting. It extracts the bookmarks directly from the PDF file in the form of a dictionary.

https://stackoverflow.com/questions/5430...genumber-a

You could use regex to split your bookmark title into document type and source like this:

import re


bookmarks = [[2, 'Medical Evidence of Record (MER)  Src.:  HELEN HASKELL HOBBS Tmt. Dt.:  Unknown - Unknown (10 pages)']]
for bookmark in bookmarks:
    title = bookmark[1]
    dt, source, *_ = re.split(r' \S+\.: ', title)
    print({"title": title, "document type": dt, "source": source})

Yes thank you I have done something like that. I now have a python dictionary result:

{'Title': 'Medical Evidence of Record (MER)  Src.:  HELEN HASKELL HOBBS Tmt. Dt.:  Unknown - Unknown (10 pages)', 'Document Type': 'Medical Evidence of Record (MER)', 'Source': 'HELEN HASKELL HOBBS'}
{'Title': 'Copy of Evidence Request (CPYEVREQ)  Src.:  DALLAS DIAGNOSTIC ASSOCIATION Tmt. Dt.:  Unknown - Unknown (7 pages)', 'Document Type': 'Copy of Evidence Request (CPYEVREQ)', 'Source': 'DALLAS DIAGNOSTIC ASSOCIATION'}
{'Title': 'Medical Evidence of Record (MER)  Src.:   Tmt. Dt.:  Unknown - Unknown (7 pages)', 'Document Type': 'Medical Evidence of Record (MER)', 'Source': ''}
{'Title': 'Medical Evidence of Record (MER)  Src.:   Tmt. Dt.:  Unknown - Unknown (7 pages)', 'Document Type': 'Medical Evidence of Record (MER)', 'Source': ''}
{'Title': 'Copy of Evidence Request (CPYEVREQ)  Src.:  BAYLOR & SCOTT MEDICAL CENTER Tmt. Dt.:  Unknown - Unknown (3 pages)', 'Document Type': 'Copy of Evidence Request (CPYEVREQ)', 'Source': 'BAYLOR & SCOTT MEDICAL CENTER'}
{'Title': 'Medical Evidence of Record (MER)  Src.:  DALLAS DIAGNOSTIC ASSOCIATION Tmt. Dt.:  Unknown - Unknown (119 pages)', 'Document Type': 'Medical Evidence of Record (MER)', 'Source': 'DALLAS DIAGNOSTIC ASSOCIATION'}
{'Title': 'Copy of Evidence Request (CPYEVREQ)  Src.:  BAYLOR & SCOTT MEDICAL CENTER Tmt. Dt.:  Unknown - Unknown (7 pages)', 'Document Type': 'Copy of Evidence Request (CPYEVREQ)', 'Source': 'BAYLOR & SCOTT MEDICAL CENTER'}

I want now to reorganize, or perhaps simple iterate over, this dictionary I want to find for each 'Source': value here that has a row with a 'Document Type' of ' 'Copy of Evidence Request (CPYEVREQ)' whether the dictionary includes - for that given Source - a row with 'Document Type' of ''Medical Evidence of Record (MER)'. The former Document Type represents a request for medical records, and a Document Type of the latter represents compliance with that request. I am trying to identify the records requests that have not been complied with.

standenman

deanhystad

standenman