Posts: 4
Threads: 1
Joined: Feb 2019
I have a text file I would like to read and search. I need to find specific strings that could possibly be URLs. Once that is found I would like to store them in a list or set.
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Print one line. The first line
print(target_open.readline()) Example .txt file:
This is a file:
Sample file that contains random urls. The goal of this
is to extract the urls and place them in a list or set
with python. Here is a random link ESPN.com
Links will have multiple extensions but for the most part
will be one or another.
python.org
mywebsite.net
firstname.wtf
creepy.onion
How to find a link in the middle of line youtube.com for example
Posts: 4,220
Threads: 97
Joined: Sep 2016
You would want a for loop to go through the lines in the file, and the re module to build a regular expression to find the urls.
Posts: 4
Threads: 1
Joined: Feb 2019
I know about the for loop but was hoping for a step in the right direction.
Posts: 1,950
Threads: 8
Joined: Jun 2018
Feb-06-2019, 04:00 AM
(This post was last modified: Feb-06-2019, 04:00 AM by perfringo.)
Is this a step?
results = []
with open('randomtextfile.txt', 'r', encoding='UTF-8') as f:
for line in f:
# find whether line contains youtube link
# and append link to results
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 4
Threads: 1
Joined: Feb 2019
This is what I have so far... I think my order might be messed up because it is printing the line following a line that contains a '.'
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Start loop. Only return possible url links.
for line in target_open:
if '.' in line:
print(target_open.readline()) RESULTS:
Output: /Users/sheaonstott/PycharmProjects/PyCharm_Main/bin/python /Users/sheaonstott/PycharmProjects/PyCharm_Main/dreampy/txt_search.py
is to extract the urls and place them in a list or set
python.org
firstname.wtf
Process finished with exit code 0
Posts: 4,220
Threads: 97
Joined: Sep 2016
Yes, you are messing up your loop. Every time through the loop Python reads a line and associates that value with the line variable. When you do readline on line 10, you are skipping ahead a line. You don't need to do that, you already have line, so just print(line).
Now a dot can be in a url or not. So you need to check for more than just a dot. I'm assuming this is homework, and you haven't learned the re module yet. In that case I would suggest creating a list of top level domains, like '.org' and '.com'. Then for each line, loop through that list, checking for each top level domain being in the line.
Posts: 4
Threads: 1
Joined: Feb 2019
I got it. Thanks
import re
# Target file to search
target_file = 'randomtextfile.txt'
# Open the target file in Read mode
target_open = open(target_file, 'r')
# Read the file
read_file = target_open.read()
# No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)
# Print urls in list one at a time. Top to bottom.
for i in urls:
print(i)
Posts: 1,950
Threads: 8
Joined: Jun 2018
(Feb-08-2019, 02:34 AM)s_o_what Wrote: # No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file)
Every time I used regex I got headache. Every time I returned to code I had written consisting regex I got major headache. So I changed my habits a bit and get rid of major headache, but writing is still un-pythonic pain.
My brain is not fit to read patterns of any considerable size, therefore I do it very verbose, step-by-step. Just to give simple example (I really don't want to decompose the pattern in code above) how I make things comprehensible for myself: when I write pattern I write logical pieces and then compile into one human readable pattern (thank you, f-strings):
optional_whitespace = '[\s]*'
protocoll = 'http[s]*://'
address = 'www[.]youtube[.]com/'
youtube_url = re.compile(f'{optional_whitespace}{protocoll}{address}') When I return to this pattern in couple of months I do understand what i was trying to accomplish.
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy
Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Posts: 7,319
Threads: 123
Joined: Sep 2016
Feb-08-2019, 11:49 AM
(This post was last modified: Feb-08-2019, 11:49 AM by snippsat.)
(Feb-08-2019, 02:34 AM)s_o_what Wrote: No idea what this does.......
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file) It takes url address out of link in HTML source code.
A better way always when dealing with HTML/XML is to use a Parser.
Example:
urls.txt
Output: <a href="https://python-forum.io/" target="_blank">Visit Python Forum</a>
<li class="tier-2" role="treeitem"><a href="http://docs.python.org/devguide/">Developer Guide</a></li>
<a href="ftp://theftpserver.com/files/acounts.pdf">Download file</a>
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('urls.txt', encoding='utf-8'), 'lxml')
links = soup.find_all('a', href=True)
for url in links:
print(url.get('href')) Output: https://python-forum.io/
http://docs.python.org/devguide/
ftp://theftpserver.com/files/acounts.pdf
Web-Scraping part-1
|