Python Forum

Full Version: Identifying keywords in text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I'm a teacher and I'd like to create a program with my students which analyses a text file to find two specific words and then outputs all the text between these words.

Could anyone point me in the direction of any articles or guidance that might help us achieve this please?

Or any advice would be much appreciated, thanks.

Jon
re is one way that comes to mind.
Haha funny request!

Menator is right, re is probably best, but I find re hard. Need to study more.

This is just an example using only simple tools.

# I suppose the word order matters
firstword = 'peck'
secondword = 'peppers'
mystring = 'Peter Piper peppers picked a peck of pickled 某个东西 peppers. Where\'s peppers the peck of pickled 四川 peppers Peter Piper picked?'

# get split the string on firstword, gives you a list
mylist = mystring.split(firstword)
# after splitting on firstword, the phrases we are interested in begin with a space
# now find phrases which begin with a space and contain the second secondword
# split on secondword and save the first element of the new list
myphrases = []
for phrase in mylist:
    if phrase[0] == ' ':
        newlist = phrase.split(secondword)
        # get rid of leading and trailing whitespace
        result = newlist[0].strip()
        myphrases.append(result)

print('found text between', '"' + firstword + '"','and ', '"' + secondword + '"', len(myphrases), 'times')
for p in myphrases:
    print(p)
Thank you for your help.
Look here for re help.

Probably best to use re, just, I find it hard to grasp!
Example of using re

import re

mystring = 'Peter Piper peppers picked a peck of pickled peppers. Where\'s peppers the peck of dozens of pickled peppers Peter Piper picked?'

findit = re.search(r'peck(.*?)peppers', mystring).group(1)

print(f'One occurance  -> {findit}')

findit = re.findall(r'(?:peck)(.*?)(?:peppers)', mystring)

print(f'Multiple occurances - > {findit}')
Output:
One occurance -> of pickled Multiple occurances - > [' of pickled ', ' of dozens of pickled ']
Pedroski55 code work fine.
A advice is to look into f-string🧐 as your line 19 is not nice.
It's also easy to make mistake with that approach,as you do with on whitespace to much.
print('found text between', '"' + firstword + '"','and ', '"' + secondword + '"', len(myphrases), 'times')
# With f-string
print(f'found text between "{firstword}" and "{secondword}" {len(myphrases)} times')
Output:
found text between "brown" and "lazy" 1 times found text between "brown" and "lazy" 1 times
The regex work fine menator01.
Could add to regex to also remove whitespace,but just strip() will fix it easier.
>>> import re
>>> 
>>> text = 'The quick brown fox jumps over the lazy dog'
>>> result = re.search(r'quick(.*?)jumps', text)
>>> result.group(1)
' brown fox '
>>> # Fix whitespace
>>> result.group(1).strip()
'brown fox'