Python Forum

Full Version: .txt return specific lines or strings
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I have a text file I would like to read and search. I need to find specific strings that could possibly be URLs. Once that is found I would like to store them in a list or set.

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Print one line. The first line
print(target_open.readline())
Example .txt file:

This is a file:

Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com

Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion

How to find a link in the middle of line youtube.com for example
You would want a for loop to go through the lines in the file, and the re module to build a regular expression to find the urls.
I know about the for loop but was hoping for a step in the right direction.
Is this a step?

results = []

with open('randomtextfile.txt', 'r', encoding='UTF-8') as f:
    for line in f:
        # find whether line contains youtube link
        # and append link to results
This is what I have so far... I think my order might be messed up because it is printing the line following a line that contains a '.'

# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Start loop. Only return possible url links.
for line in target_open:
    if '.' in line:
        print(target_open.readline())
RESULTS:

Output:
/Users/sheaonstott/PycharmProjects/PyCharm_Main/bin/python /Users/sheaonstott/PycharmProjects/PyCharm_Main/dreampy/txt_search.py is to extract the urls and place them in a list or set python.org firstname.wtf Process finished with exit code 0
Yes, you are messing up your loop. Every time through the loop Python reads a line and associates that value with the line variable. When you do readline on line 10, you are skipping ahead a line. You don't need to do that, you already have line, so just print(line).

Now a dot can be in a url or not. So you need to check for more than just a dot. I'm assuming this is homework, and you haven't learned the re module yet. In that case I would suggest creating a list of top level domains, like '.org' and '.com'. Then for each line, loop through that list, checking for each top level domain being in the line.
I got it. Thanks

import re
# Target file to search
target_file = 'randomtextfile.txt'

# Open the target file in Read mode
target_open = open(target_file, 'r')

# Read the file
read_file = target_open.read()

# No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)

# Print urls in list one at a time. Top to bottom.
for i in urls:
    print(i)
(Feb-08-2019, 02:34 AM)s_o_what Wrote: [ -> ]# No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)

Every time I used regex I got headache. Every time I returned to code I had written consisting regex I got major headache. So I changed my habits a bit and get rid of major headache, but writing is still un-pythonic pain.

My brain is not fit to read patterns of any considerable size, therefore I do it very verbose, step-by-step. Just to give simple example (I really don't want to decompose the pattern in code above) how I make things comprehensible for myself: when I write pattern I write logical pieces and then compile into one human readable pattern (thank you, f-strings):

optional_whitespace = '[\s]*'
protocoll = 'http[s]*://'
address = 'www[.]youtube[.]com/'

youtube_url = re.compile(f'{optional_whitespace}{protocoll}{address}')
When I return to this pattern in couple of months I do understand what i was trying to accomplish.
(Feb-08-2019, 02:34 AM)s_o_what Wrote: [ -> ]No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)
It takes url address out of link in HTML source code.
A better way always when dealing with HTML/XML is to use a Parser.
Example:
urls.txt
Output:
<a href="https://python-forum.io/" target="_blank">Visit Python Forum</a> <li class="tier-2" role="treeitem"><a href="http://docs.python.org/devguide/">Developer Guide</a></li> <a href="ftp://theftpserver.com/files/acounts.pdf">Download file</a>

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('urls.txt', encoding='utf-8'), 'lxml')
links = soup.find_all('a', href=True)
for url in links:
    print(url.get('href'))
Output:
https://python-forum.io/ http://docs.python.org/devguide/ ftp://theftpserver.com/files/acounts.pdf
Web-Scraping part-1