Posts: 4
Threads: 1
Joined: Jun 2019
Hi,
I have this scipt
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://en.wikipedia.org/wiki/Python')
bs = BeautifulSoup(html, "html.parser")
titles = bs.find_all(['title', 'h1', 'h2','h3','h4','h5','h6','p','img','alt'])
print('List all the header tags :', *titles, sep='\n\n') This script open One url but I want to open 2 or 3 ou 4 Url
or open all url from 100 result of google request.
How do I make changes ?
Thanks
Max
Posts: 360
Threads: 5
Joined: Jun 2019
urls = ['url1', 'url2', 'url3', 'etc']
for url in urls:
html = urlopen(url)
#your code ...
Posts: 4
Threads: 1
Joined: Jun 2019
Jun-20-2019, 12:52 PM
(This post was last modified: Jun-20-2019, 12:52 PM by MaxwellCosta.)
I write
from urllib.request import urlopen
from bs4 import BeautifulSoup
urls = ['https://en.wikipedia.org/wiki/Python', 'https://en.wikipedia.org/wiki/JavaScript']
for url in urls:
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
titles = bs.find_all(['title', 'h1', 'h2','h3','h4','h5','h6','p','img','alt'])
print('List all the header tags :', *titles, sep='\n\n') But this error is generated
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
How do I write the code ?
Thanks for your help
Posts: 7,315
Threads: 123
Joined: Sep 2016
Like this.
from urllib.request import urlopen
from bs4 import BeautifulSoup
urls = ['https://en.wikipedia.org/wiki/Python', 'https://en.wikipedia.org/wiki/JavaScript']
for url in urls:
bs = BeautifulSoup(urlopen(url), "html.parser")
titles = bs.find_all(['title', 'h1', 'h2','h3','h4','h5','h6','p','img','alt'])
print('List all the header tags :', *titles, sep='\n\n') The output can can get messy soon or already,maybe you have a plan for this output.
A advice is to use Requests and lxml(faster) as parser.
import requests
from bs4 import BeautifulSoup
urls = ['https://en.wikipedia.org/wiki/Python', 'https://en.wikipedia.org/wiki/JavaScript']
for url in urls:
bs = BeautifulSoup(requests.get(url).content, "lxml")
titles = bs.find_all(['title', 'h1', 'h2','h3','h4','h5','h6','p','img','alt'])
print('List all the header tags :', *titles, sep='\n\n')
Posts: 4
Threads: 1
Joined: Jun 2019
Jun-20-2019, 01:44 PM
(This post was last modified: Jun-20-2019, 01:54 PM by MaxwellCosta.)
Thanks for your Help
I will share the plan with you later or through another topic as I have to leave. Thank you very much for your help
Posts: 4
Threads: 1
Joined: Jun 2019
Hi,
Excuse me, but if I may, I’ll come back to the subject.
For the script to work, I need to add the urls one by one.
(If I have 50 url, I must add 50 url)
for exemple: urls = ['https://fr.wikipedia.org/wiki/JavaScript', 'https://fr.wikipedia.org/wiki/Python_(langage)'] Could we make the script work but with the google request
www.google.fr/search?q=wikipedia+javascript+python
Thanks you for your help
Maxwell
|