Python Forum

Full Version: Use regular expression to return 5 words before and after target word.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,
I've extracted doctor notes from our organization's database (Sequel Server) into a CSV file.

I'd like to use Python to add a new field with just a few words from the notes - (I have Anaconda installed on my pc).

Every row in my csv contains the word "diagnosis" within 5 words of the term diagnosis is the ICD10 code.

Example of data:

Row1: Hey doctor Who is here - I will add R45.1 as diagnosis, then some other medical terms and stuff about the patient. ffdsas ,dsfd tsdsf
Row2: Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd

I want
Row1: I will add R45.1 as diagnosis, then some other medical terms
Row2: other medical terms and stuff diagnosis of R45.2 was entered for

I'm open to any suggestions if Reg expression is not the best approach.

Thanks
Steve
You could do it like this if it's one string and want 5 word before diagnosis and 5 word after.
>>> import re

>>> s1 = 'Hey doctor Who is here - I will add R45.1 as diagnosis, then some other medical terms and stuff about the patient. ffdsas ,dsfd tsdsf'
>>> s2 = 'Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
>>> r1 = re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}diagnosis(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", s1)
>>> r2 = re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}diagnosis(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", s2)
>>> r1.group()
'I will add R45.1 as diagnosis, then some other medical terms'
>>> r2.group()
'other medical terms and stuff diagnosis of R45.2 was entered for'
I think the word "diagnosis" is unimportant here.  Just scrape all the codes from the notes.  It's a well defined format, and should be easy to get.

I don't know anything about it, but yours have all been "R[number][number].[number]", which is a regex that would look like R\d{2}\.\d.  Trying to parse the notes to get codes that happen to be near a word seems ridiculous, when you can just easily go straight to the codes.