Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract text
#1
Hello everyone,
I hope it will be comprehensible.

I have to extract text from file in repository. These files are like :

"
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
"

How can i extract text behind yes and no ? I would like to have all these 'ok' ?

My output should be :

yes
ok
ok
ok
no

i'm able to read these file with os but i dont know how to extract correctly...

thanks for help ! Smile
Reply
#2
Example with a generator:
  1. assign False to start_found
  2. iterate line by line, which should be word for word
  3. if start was found, change start_found to True
  4. yield element if start_found is True
  5. return from generator, if end is the element. This will also leave the for-loop
  6. optional Exceptions:
    • if the for-loop was finished, but start_found is still False, then start-word was not found
    • if the for-loop was finished and start_found is True, then the end-word was not found

word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
        elif element == word_end and start_found:
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)
Here the Version, which includes start-word and stop-word.
It has no big difference compared to the previous generator-function.
word_start = "yes"
word_end = "no"

words = """yes
ok
ok
ok
no""".splitlines()

def split(sequence, start, end):
    start_found = False

    for element in sequence:
        if element == word_start and not start_found:
            start_found = True
            yield element
        elif element == word_end and start_found:
            yield element # yield the word_end
            return
            # close generator
        elif start_found:
            yield element

    # this point is reached, if the start or end was not found
    if start_found:
        # seen start, but no end
        raise ValueError(f"'{end}' was not the last element in sequence")
    else:
        # seen no start in the whole sequence
        raise ValueError(f"The start_word '{start}' was not found in sequence")

oks = list(split(words, word_start, word_end))
print(oks)
BashBedlam, tester_V, rektcol like this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
You could use module itertools
import io
import itertools as itt

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

seq = itt.dropwhile((lambda line: line != 'yes\n'), file)
seq = takeuntil((lambda line: line == 'no\n'), seq)

print(''.join(seq), end='')
Output:
yes ok ok ok no
DeaD_EyE and tester_V like this post
Reply
#4
Also have a look at this thread: Extract a string between 2 words from a text file.
rektcol likes this post
Reply
#5
An even more functional version
from functools import partial
import io
from itertools import dropwhile
import operator
import sys

def equal(x):
    return partial(operator.eq, x)

def not_equal(x):
    return partial(operator.ne, x)

def takeuntil(pred, seq):
    for elem in seq:
        yield elem
        if pred(elem):
            return

file = io.StringIO("""\
titi
titi
titi
yes
ok
ok
ok
no
totot
tototo
tot
""")

seq = takeuntil(equal('no\n'), dropwhile(not_equal('yes\n'), file))

sys.stdout.writelines(seq)
Output:
yes ok ok ok no
tester_V and rektcol like this post
Reply
#6
First, i learned many tricks and then i just debug myself !
Thanks a lot guys, these solutions are working well.

Imagine now if my input contains a number. for each file, the number is different.
for exemple the first file is :
"""
titi
12 yes
ok
no
eee
"""

and the other file could be like :

"""
titi
14 yes
ok
no
eee
"""

How can i specify that these number are changing. I mean, python is searching the exact file. Is it possible to indicate that some string could be not the same ?
Reply
#7
For varying text, you can use regular expressions
import re

lines = dropwhile(re.compile(r'(?!^\d+\s+yes\s*$)').match, file)
lines = takeuntil(equal('no\n'), lines)
sys.stdout.writelines(lines)
Output:
12 yes ok ok ok no
rektcol likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extract only text strip byte array Pir8Radio 7 2,790 Nov-29-2022, 10:24 PM
Last Post: Pir8Radio
  Extract only certain text which are needed Calli 26 5,614 Oct-10-2022, 03:58 PM
Last Post: deanhystad
  Extract a string between 2 words from a text file OscarBoots 2 1,827 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Extract text based on postion and pattern guddu_12 2 1,582 Sep-27-2021, 08:32 PM
Last Post: guddu_12
  Extract specific sentences from text file Bubly 3 3,339 May-31-2021, 06:55 PM
Last Post: Larz60+
  extract color text from PDF Maha 0 2,036 May-31-2021, 04:05 PM
Last Post: Maha
Question How to extract multiple text from a string? chatguy 2 2,315 Feb-28-2021, 07:39 AM
Last Post: bowlofred
  How to extract a single word from a text file buttercup 7 3,432 Jul-22-2020, 04:45 AM
Last Post: bowlofred
  How to extract specific rows and columns from a text file with Python Farhan 0 3,352 Mar-25-2020, 09:18 PM
Last Post: Farhan
  Extract Strings From Text File - Out Put Results to Individual Files dj99 8 4,850 Jun-28-2018, 10:41 AM
Last Post: dj99

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020