Please Help With Syntax - New To Python 3

jgbarber65 · Jun-03-2024, 03:07 PM

I am new to Python 3, old to BASIC and VB. I have a text list of URLs and I want to increment through the list and simply save the HTML of each URL as a text file and NOT open the html in a browser.

I wrote this script at work and will test it when I get home, but I wanted to have the script syntax checked before that.

Thank you in advance.

import urllib2
file_path = 'C:\Users\<username>\Desktop\myurls.txt'
with open(file_path) as file:
	myurl = file.readline()
    urllib.urlretrieve(myurl, myurl & ".txt")

**deanhystad** · Jun-03-2024, 03:56 PM

Backslashes in a string can be problematic because they might be confused as the start of an escape sequence. You'll have a problem with \U which is the start of a unicode escape sequence. Use a raw string to prevent this.

file_path = r'C:\Users\<username>\Desktop\myurls.txt'

When you open a file you should specify if you want to read, write or append the file.

There is no module named urllib2. Read about how to use urllib2.

https://docs.python.org/2/library/urllib2.html

jgbarber65 · Jun-03-2024, 04:37 PM

(Jun-03-2024, 03:56 PM)deanhystad Wrote: Backslashes in a string can be problematic because they might be confused as the start of an escape sequence. You'll have a problem with \U which is the start of a unicode escape sequence. Use a raw string to prevent this.

file_path = r'C:\Users\<username>\Desktop\myurls.txt'

When you open a file you should specify if you want to read, write or append the file.

There is no module named urllib2. Read about how to use urllib2.

https://docs.python.org/2/library/urllib2.html

Thank you.

Pedroski55 · Jun-04-2024, 03:47 PM

Do you only want the first line from your text?

with open(urltext, 'r') as infile:
text = infile.readline() # or readlines ??

Lately, I've been looking at re. You can do what want very simply like this using re:

import re

urltext = '/home/pedro/tmp/some_urls.txt' # urls mixed with text
with open(urltext, 'r') as infile:
    text = infile.read()

# the thing about urls, they can't/shouldn't contain spaces
# spaces cause problems
# \S finds anything that is not whitespace
e = re.compile(r'(https?://\S+)') # https? finds http or https
res = e.findall(text)
for r in res:
    print(r)

And you can change the re expression to specialise it for certain words.
(urls from a previous question on getting urls from a webpage.)

Output:https://tree-diffusion.github.io/
https://github.com/Nike-Inc/koheesio
https://phys.org/news/2024-05-glimpses-volcanic-world-telescope-images.html
https://chipsandcheese.com/2024/06/03/intels-lion-cove-architecture-preview/
https://www.anandtech.com/show/21425/intel-lunar-lake-architecture-deep-dive-lion-cove-xe2-and-npu4
https://www.merriam-webster.com/wordplay/top-10-rare-and-amusing-insults-vol-2
https://asteriskmag.com/issues/06/how-to-make-a-great-government-website
https://www.ycombinator.com/blog/why-yc-went-to-dc/
https://toaster.llc/photon/
https://samcurry.net/hacking-millions-of-modems
https://arxiv.org/abs/2405.20233
https://physics.stackexchange.com/questions/816698/how-many-photons-are-received-per-bit-transmitted-from-voyager-1
https://www.belfercenter.org/publication/seeing-data-structure
https://kk.org/thetechnium/files/2023/12/howtowalkandtalk.pdf
https://danlark.org/2020/06/14/128-bit-division/
https://rootsofprogress.org/robert-allen-british-industrial-revolution
https://www.youtube.com/watch?v=EKWGGDXe5MA
https://github.com/fiddyschmitt/File-Tunnel
https://spectrum.ieee.org/geothermal-energy-gyrotron-quaise
https://technicalwriting.dev/a11y/skip.html
https://tridao.me/blog/2024/mamba2-part1-model/
https://careers.snapmagic.com/o/technical-project-manager
https://zompist.com/yingzi/yingzi.htm
https://github.com/Dicklesworthstone/grassman_article
https://spritely.institute/news/cirkoban-sokoban-meets-cellular-automata-written-in-scheme.html
https://awesomekling.substack.com/p/forking-ladybird-and-stepping-down-serenityos
https://bgr.com/science/new-theory-suggests-time-is-an-illusion-created-by-quantum-entanglement/
https://github.com/Jana-Marie/ligra

***snippsat*** · (This post was last modified: Jun-04-2024, 07:00 PM by snippsat.)

To give some advice,the code you have is for Python 2.
Also as advice do use use urllib,use Request and BeautifulSoup.
Example.
myurls.txt:

Output:https://python-forum.io/
https://books.toscrape.com/

import requests
from bs4 import BeautifulSoup

file_path = r'G:\div_code\myurls.txt'
with open(file_path) as fp:
    for url in fp:
        url = url.strip()
        print(url)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        print(soup.find(['title', 'a']))

Output:https://python-forum.io/
<title>Python Forum</title>
https://books.toscrape.com/
<title>
    All products | Books to Scrape - Sandbox
</title>

Now get title tags of urls,to get all HTML just save soup object.
It's really not usefully get all HTML of a webpages,as it has a lot of garbage not needed.
For the basic of web-scraping can look a this Web-Scraping part-1.

Please Help With Syntax - New To Python 3

User Panel Messages

Announcements