Python Forum
Please Help With Syntax - New To Python 3
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Please Help With Syntax - New To Python 3
#1
I am new to Python 3, old to BASIC and VB. I have a text list of URLs and I want to increment through the list and simply save the HTML of each URL as a text file and NOT open the html in a browser.

I wrote this script at work and will test it when I get home, but I wanted to have the script syntax checked before that.

Thank you in advance.

import urllib2
file_path = 'C:\Users\<username>\Desktop\myurls.txt'
with open(file_path) as file:
	myurl = file.readline()
    urllib.urlretrieve(myurl, myurl & ".txt")
Reply
#2
Backslashes in a string can be problematic because they might be confused as the start of an escape sequence. You'll have a problem with \U which is the start of a unicode escape sequence. Use a raw string to prevent this.

file_path = r'C:\Users\<username>\Desktop\myurls.txt'

When you open a file you should specify if you want to read, write or append the file.

There is no module named urllib2. Read about how to use urllib2.

https://docs.python.org/2/library/urllib2.html
jgbarber65 likes this post
Reply
#3
(Jun-03-2024, 03:56 PM)deanhystad Wrote: Backslashes in a string can be problematic because they might be confused as the start of an escape sequence. You'll have a problem with \U which is the start of a unicode escape sequence. Use a raw string to prevent this.

file_path = r'C:\Users\<username>\Desktop\myurls.txt'

When you open a file you should specify if you want to read, write or append the file.

There is no module named urllib2. Read about how to use urllib2.

https://docs.python.org/2/library/urllib2.html

Thank you.
Reply
#4
Do you only want the first line from your text?

with open(urltext, 'r') as infile:
text = infile.readline() # or readlines ??

Lately, I've been looking at re. You can do what want very simply like this using re:

import re

urltext = '/home/pedro/tmp/some_urls.txt' # urls mixed with text
with open(urltext, 'r') as infile:
    text = infile.read()

# the thing about urls, they can't/shouldn't contain spaces
# spaces cause problems
# \S finds anything that is not whitespace
e = re.compile(r'(https?://\S+)') # https? finds http or https
res = e.findall(text)
for r in res:
    print(r)
And you can change the re expression to specialise it for certain words.
(urls from a previous question on getting urls from a webpage.)

Output:
https://tree-diffusion.github.io/ https://github.com/Nike-Inc/koheesio https://phys.org/news/2024-05-glimpses-volcanic-world-telescope-images.html https://chipsandcheese.com/2024/06/03/intels-lion-cove-architecture-preview/ https://www.anandtech.com/show/21425/intel-lunar-lake-architecture-deep-dive-lion-cove-xe2-and-npu4 https://www.merriam-webster.com/wordplay/top-10-rare-and-amusing-insults-vol-2 https://asteriskmag.com/issues/06/how-to-make-a-great-government-website https://www.ycombinator.com/blog/why-yc-went-to-dc/ https://toaster.llc/photon/ https://samcurry.net/hacking-millions-of-modems https://arxiv.org/abs/2405.20233 https://physics.stackexchange.com/questions/816698/how-many-photons-are-received-per-bit-transmitted-from-voyager-1 https://www.belfercenter.org/publication/seeing-data-structure https://kk.org/thetechnium/files/2023/12/howtowalkandtalk.pdf https://danlark.org/2020/06/14/128-bit-division/ https://rootsofprogress.org/robert-allen-british-industrial-revolution https://www.youtube.com/watch?v=EKWGGDXe5MA https://github.com/fiddyschmitt/File-Tunnel https://spectrum.ieee.org/geothermal-energy-gyrotron-quaise https://technicalwriting.dev/a11y/skip.html https://tridao.me/blog/2024/mamba2-part1-model/ https://careers.snapmagic.com/o/technical-project-manager https://zompist.com/yingzi/yingzi.htm https://github.com/Dicklesworthstone/grassman_article https://spritely.institute/news/cirkoban-sokoban-meets-cellular-automata-written-in-scheme.html https://awesomekling.substack.com/p/forking-ladybird-and-stepping-down-serenityos https://bgr.com/science/new-theory-suggests-time-is-an-illusion-created-by-quantum-entanglement/ https://github.com/Jana-Marie/ligra
Reply
#5
To give some advice,the code you have is for Python 2.
Also as advice do use use urllib,use Request and BeautifulSoup.
Example.
myurls.txt:
Output:
https://python-forum.io/ https://books.toscrape.com/
import requests
from bs4 import BeautifulSoup

file_path = r'G:\div_code\myurls.txt'
with open(file_path) as fp:
    for url in fp:
        url = url.strip()
        print(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        print(soup.find(['title', 'a']))
Output:
https://python-forum.io/ <title>Python Forum</title> https://books.toscrape.com/ <title> All products | Books to Scrape - Sandbox </title>
Now get title tags of urls,to get all HTML just save soup object.
It's really not usefully get all HTML of a webpages,as it has a lot of garbage not needed.
For the basic of web-scraping can look a this Web-Scraping part-1.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020