Python Forum

Full Version: Read input file and print hyperlinks
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everybody, sorry for my last post it does not show the picture,
Edit admin:
No problem,just find "Insert Python tag" button.

I am new in python and i am trying to make a program that prompts for an input file, reads it and prints all the lines
that contain hyperlinks and the text that follows the hyperlink. For example if the file contains the link :

"<a href="http://python-forum.io/search.php?action=unreads">Unread Posts</a>" 
The output print should be:
Output:
htt://python-forum.io/search.php?action=unreads     Unread Posts
You can take a look at my tutorial here Web-Scraping part-1.
Thank you for the reply,

just i have difficulty to make it work for files that are stored in my computer.
What have you tried so far?  Please post the code you've written.  I would suggest starting with a small file, perhaps 3 or 4 lines.  To make it easy, make sure the file is in the same location as your script.  Your script should start off simple as well, open the file, read a line, write it to the screen, go back read the next line, write it to the screen, and so on. Once you do that and it runs without errors, start refining your script.
Here a example with line you have post.
from bs4 import BeautifulSoup

with open('html_from_disk.txt') as f:
   html = f.read()

soup = BeautifulSoup(html, 'html.parser')
text = soup.find('a').text
link = soup.find('a')
print(text) #--> Unread Posts
print(link.get('href')) #--> http:/python-forum.io/search.php?action=unreads
Hello and thank you for the precious help,

with this code I managed to print all hyperlinks in separate lines , but still I can't find how to print also the text that follows every hyperlink.
Could I add to the above code a prompt for the user to give me the input file?

I tried to add this:
test=raw_input('Enter a filename: ')

with open('test') as f: 
but it does not work.
You can not have quotes around 'test',
then is just a string test.

Here with a better variable name.
file_name = raw_input('Enter a filename: ')

with open(file_name) as f:
I managed to make it work with this code

from bs4 import BeautifulSoup
file = raw_input('Type file path: ')
with open(file) as f:
 html = f.read()
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))
    print(link.get_text())
but I still get links that I do not wont, like the links from img tags, is there any way to exclude them from print?
Quote:<img src="http://www.ekdd.gr/ekdda/custom/seminars/bullet_green.png"><a>test1</a></div>
<img src="http://www.ekdd.gr/ekdda/custom/seminars/bullet_red.png"><a>test2</a></div>
from the above I get None test1
                               None test2
You most learn to not use quote tag on code,
i have fixed all you post.
In editor there there is "Insert python tag" to right of "Insert quote" button.

This is wrong:
print(link.get_text())
# Shall be
print(link.text)
Quote:but I still get links that I do not wont, like the links from img tags, is there any way to exclude them from print?
for link in soup.find_all('a'):
    if 'img' not in link:
        print(link.get('href'))
        print(link.text)