![]() |
Regex Help for clubbing similar sentence segments - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Regex Help for clubbing similar sentence segments (/thread-22600.html) |
Regex Help for clubbing similar sentence segments - regstuff - Nov-19-2019 Hi, I have a corpus which contains sentences in a certain pattern, which I would like to change by applying Regex. The pattern is [a certain set of words][word1], [a certain set of words][word2],[a certain set of words][word3], etc to be converted to: [a certain set of words][word1], [word2] or [word3] A few examples: So whether it is the body, whether it is the mind, whether it is the energy, whether it is the emotions. Changes to: So whether it is the body, mind, energy or emotions. So whether it is the body, whether it is the mind, whether it is the energy. Changes to: So whether it is the body, mind or energy. So he cannot eat, he cannot sleep. Changes to: So he cannot eat or sleep. The Regex I'm using for each of the sentences are fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1", r"\1\3, \4, \5 or ", fulltext) fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1(\b\S+\b)[,]\s\1", r"\1\3, \4 or ", fulltext) fulltext = re.sub(r"((\b\S+\b\s){1,4})(\b\S+\b)[,]\s\1", r"\1\3 or ", fulltext) Was wondering if there is a single regex I can apply to all of these, and also a more general case where the pattern repeats any number of times. RE: Regex Help for clubbing similar sentence segments - perfringo - Nov-19-2019 I am not able to follow in my head what does these regexes do. My mind is simple and likes code what it can follow. def my_func(string, pattern): string = ' or'.join(string.rsplit(',', 1)) # replace last comma with or pattern = ''.join([pattern, ' ']) # add space to end of pattern splitted = string.split(pattern) # split string on pattern splitted.insert(1, pattern) # put pattern back in return ''.join(splitted) # return result as string >>> s = 'So whether it is the body, whether it is the mind, whether it is the energy, whether it is the emotions.' >>> my_func(s, 'whether it is the') 'So whether it is the body, mind, energy or emotions.'Observation - pattern in human language is not correct - there is 'So' in [a certain set of words] which is present in first set of words and not others. If sentence always starts with 'So' it's easy to get 'pattern' out with simple string methods. RE: Regex Help for clubbing similar sentence segments - regstuff - Nov-20-2019 Thanks but this code requires that I know what the pattern is. I used the regex as a more general case where I don't know what the exact set of words would be. RE: Regex Help for clubbing similar sentence segments - perfringo - Nov-20-2019 (Nov-20-2019, 04:06 AM)regstuff Wrote: I don't know what the exact set of words would be. Now we getting somewhere. The task is actually not about applying changes. Task is: - find repeating [a certain set of words] - replace all occurrences except first. However, some assumptions must be made in order to find repeating chunk of text like is it always before first comma and is there only one word before comma which is not part of the pattern (always like 'the body' and never 'the beautiful body'). If above assumptions are true finding the pattern is not that hard: Split on comma with removing spaces. Remove last word in first item. Check if second item starts with first item. If not, remove first word in first item and check again. Continue removing first word in first item until there is starting pattern or no words left. |