.txt return specific lines or strings - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: .txt return specific lines or strings (/thread-15909.html) |
.txt return specific lines or strings - s_o_what - Feb-06-2019 I have a text file I would like to read and search. I need to find specific strings that could possibly be URLs. Once that is found I would like to store them in a list or set. # Target file to search target_file = 'randomtextfile.txt' # Open the target file in Read mode target_open = open(target_file, 'r') # Print one line. The first line print(target_open.readline())Example .txt file: This is a file: Sample file that contains random urls. The goal of this is to extract the urls and place them in a list or set with python. Here is a random link ESPN.com Links will have multiple extensions but for the most part will be one or another. python.org mywebsite.net firstname.wtf creepy.onion How to find a link in the middle of line youtube.com for example RE: .txt return specific lines or strings - ichabod801 - Feb-06-2019 You would want a for loop to go through the lines in the file, and the re module to build a regular expression to find the urls. RE: .txt return specific lines or strings - s_o_what - Feb-06-2019 I know about the for loop but was hoping for a step in the right direction. RE: .txt return specific lines or strings - perfringo - Feb-06-2019 Is this a step? results = [] with open('randomtextfile.txt', 'r', encoding='UTF-8') as f: for line in f: # find whether line contains youtube link # and append link to results RE: .txt return specific lines or strings - s_o_what - Feb-07-2019 This is what I have so far... I think my order might be messed up because it is printing the line following a line that contains a '.' # Target file to search target_file = 'randomtextfile.txt' # Open the target file in Read mode target_open = open(target_file, 'r') # Start loop. Only return possible url links. for line in target_open: if '.' in line: print(target_open.readline())RESULTS:
RE: .txt return specific lines or strings - ichabod801 - Feb-07-2019 Yes, you are messing up your loop. Every time through the loop Python reads a line and associates that value with the line variable. When you do readline on line 10, you are skipping ahead a line. You don't need to do that, you already have line, so just print(line). Now a dot can be in a url or not. So you need to check for more than just a dot. I'm assuming this is homework, and you haven't learned the re module yet. In that case I would suggest creating a list of top level domains, like '.org' and '.com'. Then for each line, loop through that list, checking for each top level domain being in the line. RE: .txt return specific lines or strings - s_o_what - Feb-08-2019 I got it. Thanks import re # Target file to search target_file = 'randomtextfile.txt' # Open the target file in Read mode target_open = open(target_file, 'r') # Read the file read_file = target_open.read() # No idea what this does....... urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', read_file) # Print urls in list one at a time. Top to bottom. for i in urls: print(i) RE: .txt return specific lines or strings - perfringo - Feb-08-2019 (Feb-08-2019, 02:34 AM)s_o_what Wrote: # No idea what this does....... Every time I used regex I got headache. Every time I returned to code I had written consisting regex I got major headache. So I changed my habits a bit and get rid of major headache, but writing is still un-pythonic pain. My brain is not fit to read patterns of any considerable size, therefore I do it very verbose, step-by-step. Just to give simple example (I really don't want to decompose the pattern in code above) how I make things comprehensible for myself: when I write pattern I write logical pieces and then compile into one human readable pattern (thank you, f-strings): optional_whitespace = '[\s]*' protocoll = 'http[s]*://' address = 'www[.]youtube[.]com/' youtube_url = re.compile(f'{optional_whitespace}{protocoll}{address}')When I return to this pattern in couple of months I do understand what i was trying to accomplish. RE: .txt return specific lines or strings - snippsat - Feb-08-2019 (Feb-08-2019, 02:34 AM)s_o_what Wrote: No idea what this does.......It takes url address out of link in HTML source code. A better way always when dealing with HTML/XML is to use a Parser. Example: urls.txt
from bs4 import BeautifulSoup soup = BeautifulSoup(open('urls.txt', encoding='utf-8'), 'lxml') links = soup.find_all('a', href=True) for url in links: print(url.get('href')) Web-Scraping part-1
|