Python Forum
Regex Help for clubbing similar sentence segments
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex Help for clubbing similar sentence segments
#1
Hi,
I have a corpus which contains sentences in a certain pattern, which I would like to change by applying Regex.

The pattern is [a certain set of words][word1], [a certain set of words][word2],[a certain set of words][word3], etc
to be converted to: [a certain set of words][word1], [word2] or [word3]

A few examples:

So whether it is the body, whether it is the mind, whether it is the energy, whether it is the emotions.
Changes to:
So whether it is the body, mind, energy or emotions.

So whether it is the body, whether it is the mind, whether it is the energy.
Changes to:
So whether it is the body, mind or energy.

So he cannot eat, he cannot sleep.
Changes to:
So he cannot eat or sleep.

The Regex I'm using for each of the sentences are
fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1", r"\1\3, \4, \5 or ", fulltext)
fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1", r"\1\3, \4 or ", fulltext)
fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1", r"\1\3 or ", fulltext)

Was wondering if there is a single regex I can apply to all of these, and also a more general case where the pattern repeats any number of times.
Reply
#2
I am not able to follow in my head what does these regexes do. My mind is simple and likes code what it can follow.

def my_func(string, pattern): 
    string = ' or'.join(string.rsplit(',', 1))    # replace last comma with or
    pattern = ''.join([pattern, ' '])             # add space to end of pattern
    splitted = string.split(pattern)              # split string on pattern  
    splitted.insert(1, pattern)                   # put pattern back in
    return ''.join(splitted)                      # return result as string
     
                                                                                            
>>> s = 'So whether it is the body, whether it is the mind, whether it is the energy, whether it is the emotions.'
>>> my_func(s, 'whether it is the')                                                        
'So whether it is the body, mind, energy or emotions.'
Observation - pattern in human language is not correct - there is 'So' in [a certain set of words] which is present in first set of words and not others. If sentence always starts with 'So' it's easy to get 'pattern' out with simple string methods.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
Thanks but this code requires that I know what the pattern is. I used the regex as a more general case where I don't know what the exact set of words would be.
Reply
#4
(Nov-20-2019, 04:06 AM)regstuff Wrote: I don't know what the exact set of words would be.

Now we getting somewhere. The task is actually not about applying changes. Task is:

- find repeating [a certain set of words]
- replace all occurrences except first.

However, some assumptions must be made in order to find repeating chunk of text like is it always before first comma and is there only one word before comma which is not part of the pattern (always like 'the body' and never 'the beautiful body').

If above assumptions are true finding the pattern is not that hard:

Split on comma with removing spaces. Remove last word in first item. Check if second item starts with first item. If not, remove first word in first item and check again. Continue removing first word in first item until there is starting pattern or no words left.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Sum similar items tester_V 3 1,904 Jun-29-2021, 06:58 AM
Last Post: tester_V
  while sentence kimyyya 3 2,907 Mar-20-2021, 06:00 AM
Last Post: Pedroski55
  List / arrays putting in sentence Kurta 3 2,519 Dec-25-2020, 11:29 AM
Last Post: Larz60+
  How to make a telegram bot respond to the specific word in a sentence? Metodolog 2 6,275 Dec-22-2020, 07:30 AM
Last Post: martabassof
  How to match partial sentence in long sentence Mekala 1 1,488 Jul-22-2020, 02:21 PM
Last Post: perfringo
  Remove a sentence if it contains a word. lokhtar 6 5,781 Feb-11-2020, 04:43 PM
Last Post: stullis
  how to get all the possible permutation and combination of a sentence in python sodmzs 1 4,122 Jun-13-2019, 07:02 AM
Last Post: perfringo
  Sentence maker help bidoofis 2 2,457 Feb-08-2019, 03:59 AM
Last Post: bidoofis
  Python: if 'X' in 'Y' but with two similar strings as 'X' DreamingInsanity 6 3,807 Feb-01-2019, 01:28 PM
Last Post: buran
  wont print last sentence.. mitmit293 2 2,325 Jan-27-2019, 05:38 PM
Last Post: aakashjha001

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020