Python Forum

Full Version: Extract text
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everyone,
I hope it will be comprehensible.

I have to extract text from file in repository. These files are like :

"
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
"

How can i extract text behind yes and no ? I would like to have all these 'ok' ?

My output should be :

yes
ok
ok
ok
no

i'm able to read these file with os but i dont know how to extract correctly...

thanks for help ! Smile
Example with a generator:
  1. assign False to start_found
  2. iterate line by line, which should be word for word
  3. if start was found, change start_found to True
  4. yield element if start_found is True
  5. return from generator, if end is the element. This will also leave the for-loop
  6. optional Exceptions:
    • if the for-loop was finished, but start_found is still False, then start-word was not found
    • if the for-loop was finished and start_found is True, then the end-word was not found

word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
        elif element == word_end and start_found:
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)
Here the Version, which includes start-word and stop-word.
It has no big difference compared to the previous generator-function.
word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
            yield element
        elif element == word_end and start_found:
            yield element # yield the word_end
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)
You could use module itertools
import io
import itertools as itt

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

seq = itt.dropwhile((lambda line: line != 'yes\n'), file)
seq = takeuntil((lambda line: line == 'no\n'), seq)

print(''.join(seq), end='')
Output:
yes ok ok ok no
Also have a look at this thread: Extract a string between 2 words from a text file.
An even more functional version
from functools import partial
import io
from itertools import dropwhile
import operator
import sys

def equal(x):
    return partial(operator.eq, x)

def not_equal(x):
    return partial(operator.ne, x)

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

seq = takeuntil(equal('no\n'), dropwhile(not_equal('yes\n'), file))

sys.stdout.writelines(seq)
Output:
yes ok ok ok no
First, i learned many tricks and then i just debug myself !
Thanks a lot guys, these solutions are working well.

Imagine now if my input contains a number. for each file, the number is different.
for exemple the first file is :
"""
titi
12 yes
ok
no
eee
"""

and the other file could be like :

"""
titi
14 yes
ok
no
eee
"""

How can i specify that these number are changing. I mean, python is searching the exact file. Is it possible to indicate that some string could be not the same ?
For varying text, you can use regular expressions
import re

lines = dropwhile(re.compile(r'(?!^\d+\s+yes\s*$)').match, file)
lines = takeuntil(equal('no\n'), lines)
sys.stdout.writelines(lines)
Output:
12 yes ok ok ok no