Python Forum
.txt return specific lines or strings
Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
.txt return specific lines or strings
#1
I have a text file I would like to read and search. I need to find specific strings that could possibly be URLs. Once that is found I would like to store them in a list or set.

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Print one line. The first line
print(target_open.readline())
Example .txt file:

This is a file:

Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com

Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion

How to find a link in the middle of line youtube.com for example
Reply
#2
You would want a for loop to go through the lines in the file, and the re module to build a regular expression to find the urls.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#3
I know about the for loop but was hoping for a step in the right direction.
Reply
#4
Is this a step?

results = []

with open('randomtextfile.txt', 'r', encoding='UTF-8') as f:
    for line in f:
        # find whether line contains youtube link
        # and append link to results
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
This is what I have so far... I think my order might be messed up because it is printing the line following a line that contains a '.'

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Start loop. Only return possible url links.
for line in target_open:
    if '.' in line:
        print(target_open.readline())
RESULTS:

Output:
/Users/sheaonstott/PycharmProjects/PyCharm_Main/bin/python /Users/sheaonstott/PycharmProjects/PyCharm_Main/dreampy/txt_search.py is to extract the urls and place them in a list or set python.org firstname.wtf Process finished with exit code 0
Reply
#6
Yes, you are messing up your loop. Every time through the loop Python reads a line and associates that value with the line variable. When you do readline on line 10, you are skipping ahead a line. You don't need to do that, you already have line, so just print(line).

Now a dot can be in a url or not. So you need to check for more than just a dot. I'm assuming this is homework, and you haven't learned the re module yet. In that case I would suggest creating a list of top level domains, like '.org' and '.com'. Then for each line, loop through that list, checking for each top level domain being in the line.
Craig "Ichabod" O'Brien - xenomind.com
I wish you happiness.
Recommended Tutorials: BBCode, functions, classes, text adventures
Reply
#7
I got it. Thanks

import re
# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Read the file
read_file = target_open.read()

# No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)

# Print urls in list one at a time. Top to bottom.
for i in urls:
    print(i)
Reply
#8
(Feb-08-2019, 02:34 AM)s_o_what Wrote: # No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)

Every time I used regex I got headache. Every time I returned to code I had written consisting regex I got major headache. So I changed my habits a bit and get rid of major headache, but writing is still un-pythonic pain.

My brain is not fit to read patterns of any considerable size, therefore I do it very verbose, step-by-step. Just to give simple example (I really don't want to decompose the pattern in code above) how I make things comprehensible for myself: when I write pattern I write logical pieces and then compile into one human readable pattern (thank you, f-strings):

optional_whitespace = '[\s]*'
protocoll = 'http[s]*://'
address = 'www[.]youtube[.]com/'

youtube_url = re.compile(f'{optional_whitespace}{protocoll}{address}')
When I return to this pattern in couple of months I do understand what i was trying to accomplish.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#9
(Feb-08-2019, 02:34 AM)s_o_what Wrote: No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)
It takes url address out of link in HTML source code.
A better way always when dealing with HTML/XML is to use a Parser.
Example:
urls.txt
Output:
<a href="https://python-forum.io/" target="_blank">Visit Python Forum</a> <li class="tier-2" role="treeitem"><a href="http://docs.python.org/devguide/">Developer Guide</a></li> <a href="ftp://theftpserver.com/files/acounts.pdf">Download file</a>

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('urls.txt', encoding='utf-8'), 'lxml')
links = soup.find_all('a', href=True)
for url in links:
    print(url.get('href'))
Output:
https://python-forum.io/ http://docs.python.org/devguide/ ftp://theftpserver.com/files/acounts.pdf
Web-Scraping part-1
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 5,053 Aug-06-2019, 07:23 AM
Last Post: fishhook

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020