Extract text - Printable Version

Extract text - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Extract text (/thread-37573.html)

Extract text - rektcol - Jun-27-2022

Hello everyone,
I hope it will be comprehensible.

I have to extract text from file in repository. These files are like :

"
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
"

How can i extract text behind yes and no ? I would like to have all these 'ok' ?

My output should be :

yes
ok
ok
ok
no

i'm able to read these file with os but i dont know how to extract correctly...

thanks for help ! Smile

RE: Extract text - DeaD_EyE - Jun-27-2022

Example with a generator:

assign False to start_found
iterate line by line, which should be word for word
if start was found, change start_found to True
yield element if start_found is True
return from generator, if end is the element. This will also leave the for-loop
optional Exceptions:
- if the for-loop was finished, but start_found is still False, then start-word was not found
- if the for-loop was finished and start_found is True, then the end-word was not found

word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
        elif element == word_end and start_found:
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)

Here the Version, which includes start-word and stop-word.
It has no big difference compared to the previous generator-function.

word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
            yield element
        elif element == word_end and start_found:
            yield element # yield the word_end
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)

RE: Extract text - Gribouillis - Jun-27-2022

You could use module itertools

import io
import itertools as itt

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

seq = itt.dropwhile((lambda line: line != 'yes\n'), file)
seq = takeuntil((lambda line: line == 'no\n'), seq)

print(''.join(seq), end='')

Output:yes
ok
ok
ok
no

RE: Extract text - ibreeden - Jun-27-2022

Also have a look at this thread: Extract a string between 2 words from a text file.

RE: Extract text - Gribouillis - Jun-27-2022

An even more functional version

from functools import partial
import io
from itertools import dropwhile
import operator
import sys

def equal(x):
    return partial(operator.eq, x)

def not_equal(x):
    return partial(operator.ne, x)

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

seq = takeuntil(equal('no\n'), dropwhile(not_equal('yes\n'), file))

sys.stdout.writelines(seq)

Output:yes
ok
ok
ok
no

RE: Extract text - rektcol - Jun-28-2022

First, i learned many tricks and then i just debug myself !
Thanks a lot guys, these solutions are working well.

Imagine now if my input contains a number. for each file, the number is different.
for exemple the first file is :
"""
titi
12 yes
ok
no
eee
"""

and the other file could be like :

"""
titi
14 yes
ok
no
eee
"""

How can i specify that these number are changing. I mean, python is searching the exact file. Is it possible to indicate that some string could be not the same ?

RE: Extract text - Gribouillis - Jun-28-2022

For varying text, you can use regular expressions

import re

lines = dropwhile(re.compile(r'(?!^\d+\s+yes\s*$)').match, file)
lines = takeuntil(equal('no\n'), lines)
sys.stdout.writelines(lines)

Output:12 yes
ok
ok
ok
no