Python Forum

Full Version: Python: re.findall to find multiple instances don't work but search worked
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I want to find ALL instances of asked questions and the persons who asked them in the below text.

++++++++++++++++

26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. We agree with the Panel and will instead strengthen regulations to safeguard the safety of path users. With regard to Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.
++++++++++++++++++++++++++

Desired Output:

asked= ["Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation." ,
"Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs.",
"Mr Melvin Yong asked about the qualifications and training of OEOs."]

askedperson= ["Mr Kwek Hian Chuan Henry", "Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong", "Mr Melvin Yong" ]


My Code which doesn't work:

asked_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.')
asked=re.findall(asked_regex, text_list)

askedperson_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(?=asked)')
askedperson=re.findall(askedperson_regex, text_list)


print(asked)
[('Mr', ' ', 'asked', 'n'), ('Ms', ' ', 'asked', 's'), ('Mr', ' ', 'asked', 's')]

print(askedperson)
[('Mr', ' '), ('Ms', ' '), ('Mr', ' ')]
But the odd thing is that when I used search, I could correctly obtain the 1st match

asked_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.')
asked=asked_regex.search(text_list).group().strip()

print(asked)

26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation.
Your problem was your grouping. This expression has 4 groups.
(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.'
(      1      )( 2  )  (  3  )( 4  )
findall() returns a match as a tuple of groups. That is why your print looks as it does.

The fix is a minor change. Wrap a group around all the parts of the pattern you want to keep, and use indexing to get the desired group from the tuple of groups.
pattern = re.compile(r"((Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.)")
asked = [match[0] for match in re.findall(pattern, text)]
print("\n\n".join(asked))
Output:
Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.
Notice there is an issue with the second match. The period following "injury claims" is not viewed as a terminating character because regex is still looking for "asked". An easy fix is to do some preprocessing. I would replace all newlines with a space and split the text into lines (lines = text.split(".")). That will greatly simplify the regex. All a line needs is Mr|Miss|Ms|Dr and asked. Should speed up searching significantly.
pattern = re.compile(r"((Mr|Miss|Ms|Dr).+asked.+)")

asked = []
for line in text.replace("\n", "").split("."):
    result = pattern.search(line)
    if result:
        asked.append(result.group() + ".")
print("\n\n".join(asked))
Output:
Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.