![]() |
Extract text - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Extract text (/thread-37573.html) |
Extract text - rektcol - Jun-27-2022 Hello everyone, I hope it will be comprehensible. I have to extract text from file in repository. These files are like : " titi titi titi yes ok ok ok no totot tototo tot " How can i extract text behind yes and no ? I would like to have all these 'ok' ? My output should be : yes ok ok ok no i'm able to read these file with os but i dont know how to extract correctly... thanks for help ! ![]() RE: Extract text - DeaD_EyE - Jun-27-2022 Example with a generator:
word_start = "yes" word_end = "no" words = """yes ok ok ok no""".splitlines() def split(sequence, start, end): start_found = False for element in sequence: if element == word_start and not start_found: start_found = True elif element == word_end and start_found: return # close generator elif start_found: yield element # this point is reached, if the start or end was not found if start_found: # seen start, but no end raise ValueError(f"'{end}' was not the last element in sequence") else: # seen no start in the whole sequence raise ValueError(f"The start_word '{start}' was not found in sequence") oks = list(split(words, word_start, word_end)) print(oks)Here the Version, which includes start-word and stop-word. It has no big difference compared to the previous generator-function. word_start = "yes" word_end = "no" words = """yes ok ok ok no""".splitlines() def split(sequence, start, end): start_found = False for element in sequence: if element == word_start and not start_found: start_found = True yield element elif element == word_end and start_found: yield element # yield the word_end return # close generator elif start_found: yield element # this point is reached, if the start or end was not found if start_found: # seen start, but no end raise ValueError(f"'{end}' was not the last element in sequence") else: # seen no start in the whole sequence raise ValueError(f"The start_word '{start}' was not found in sequence") oks = list(split(words, word_start, word_end)) print(oks) RE: Extract text - Gribouillis - Jun-27-2022 You could use module itertools import io import itertools as itt file = io.StringIO("""\ titi titi titi yes ok ok ok no totot tototo tot """) def takeuntil(pred, seq): for elem in seq: yield elem if pred(elem): return seq = itt.dropwhile((lambda line: line != 'yes\n'), file) seq = takeuntil((lambda line: line == 'no\n'), seq) print(''.join(seq), end='')
RE: Extract text - ibreeden - Jun-27-2022 Also have a look at this thread: Extract a string between 2 words from a text file. RE: Extract text - Gribouillis - Jun-27-2022 An even more functional version from functools import partial import io from itertools import dropwhile import operator import sys def equal(x): return partial(operator.eq, x) def not_equal(x): return partial(operator.ne, x) def takeuntil(pred, seq): for elem in seq: yield elem if pred(elem): return file = io.StringIO("""\ titi titi titi yes ok ok ok no totot tototo tot """) seq = takeuntil(equal('no\n'), dropwhile(not_equal('yes\n'), file)) sys.stdout.writelines(seq)
RE: Extract text - rektcol - Jun-28-2022 First, i learned many tricks and then i just debug myself ! Thanks a lot guys, these solutions are working well. Imagine now if my input contains a number. for each file, the number is different. for exemple the first file is : """ titi 12 yes ok no eee """ and the other file could be like : """ titi 14 yes ok no eee """ How can i specify that these number are changing. I mean, python is searching the exact file. Is it possible to indicate that some string could be not the same ? RE: Extract text - Gribouillis - Jun-28-2022 For varying text, you can use regular expressions import re lines = dropwhile(re.compile(r'(?!^\d+\s+yes\s*$)').match, file) lines = takeuntil(equal('no\n'), lines) sys.stdout.writelines(lines)
|