Python Forum
Python: re.findall to find multiple instances don't work but search worked
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python: re.findall to find multiple instances don't work but search worked
#1
I want to find ALL instances of asked questions and the persons who asked them in the below text.

++++++++++++++++

26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. We agree with the Panel and will instead strengthen regulations to safeguard the safety of path users. With regard to Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.
++++++++++++++++++++++++++

Desired Output:

asked= ["Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation." ,
"Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs.",
"Mr Melvin Yong asked about the qualifications and training of OEOs."]

askedperson= ["Mr Kwek Hian Chuan Henry", "Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong", "Mr Melvin Yong" ]


My Code which doesn't work:

asked_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.')
asked=re.findall(asked_regex, text_list)

askedperson_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(?=asked)')
askedperson=re.findall(askedperson_regex, text_list)


print(asked)
[('Mr', ' ', 'asked', 'n'), ('Ms', ' ', 'asked', 's'), ('Mr', ' ', 'asked', 's')]

print(askedperson)
[('Mr', ' '), ('Ms', ' '), ('Mr', ' ')]
But the odd thing is that when I used search, I could correctly obtain the 1st match

asked_regex=re.compile(r'(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.')
asked=asked_regex.search(text_list).group().strip()

print(asked)

26 Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation.
Reply
#2
Your problem was your grouping. This expression has 4 groups.
(Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.'
(      1      )( 2  )  (  3  )( 4  )
findall() returns a match as a tuple of groups. That is why your print looks as it does.

The fix is a minor change. Wrap a group around all the parts of the pattern you want to keep, and use indexing to get the desired group from the tuple of groups.
pattern = re.compile(r"((Mr|Miss|Ms|Dr)(.|\n)+?(asked)(.|\n)+?\.)")
asked = [match[0] for match in re.findall(pattern, text)]
print("\n\n".join(asked))
Output:
Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. Ms Rahayu Mahzam's suggestion of tapping on the Small Claims Tribunal for personal injury claims up to $20,000, we understand that the Tribunal does not hear personal injury claims. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.
Notice there is an issue with the second match. The period following "injury claims" is not viewed as a terminating character because regex is still looking for "asked". An easy fix is to do some preprocessing. I would replace all newlines with a space and split the text into lines (lines = text.split(".")). That will greatly simplify the regex. All a line needs is Mr|Miss|Ms|Dr and asked. Should speed up searching significantly.
pattern = re.compile(r"((Mr|Miss|Ms|Dr).+asked.+)")

asked = []
for line in text.replace("\n", "").split("."):
    result = pattern.search(line)
    if result:
        asked.append(result.group() + ".")
print("\n\n".join(asked))
Output:
Mr Kwek Hian Chuan Henry asked the Minister for the Environment and Water Resources whether Singapore will stay the course on fighting climate change and meet our climate change commitments despite the current upheavals in the energy market and the potential long-term economic impact arising from the COVID-19 situation. Mr Gan Thiam Poh, Ms Rahayu Mahzam and Mr Melvin Yong have asked about online retailers of PMDs. Mr Melvin Yong asked about the qualifications and training of OEOs.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python convert multiple files to multiple lists MCL169 6 1,436 Nov-25-2023, 05:31 AM
Last Post: Iqratech
  regex findall() returning weird result Radical 1 588 Oct-15-2023, 08:47 PM
Last Post: snippsat
  Search for multiple unknown 3 (2) Byte combinations in a file. lastyle 7 1,257 Aug-14-2023, 02:28 AM
Last Post: deanhystad
Question Using SQLAlchemy, prevent SQLite3 table update by multiple program instances Calab 3 704 Aug-09-2023, 05:51 PM
Last Post: Calab
  [WORKED AROUND] Problem installing elitech-datareader, 'cannot import build_py_2to3' NeilUK 4 1,570 Jul-09-2023, 10:01 AM
Last Post: NeilUK
  Find duplicate files in multiple directories Pavel_47 9 2,932 Dec-27-2022, 04:47 PM
Last Post: deanhystad
  Multiprocessing Pool Multiple Instances How to Kill by Pool ID sunny9495 0 737 Nov-16-2022, 05:57 AM
Last Post: sunny9495
  Search multiple CSV files for a string or strings cubangt 7 7,843 Feb-23-2022, 12:53 AM
Last Post: Pedroski55
  Iterating over a dictionary in a for loop - checking code has worked sallyjc81 1 1,885 Dec-29-2020, 05:14 PM
Last Post: ndc85430
  regex.findall that won't match anything xiaobai97 1 1,974 Sep-24-2020, 02:02 PM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020