Python Forum

my question here

import urllib
import re
urls=[]
i=0
regex='<title>(.+?)</title>'
pattern=re.compile(regex)
while i<len(urls):
   htmlfile=urllib.urlopen(urls[i])
   a=htmlfile.read()
   titles=re.findall(pattern,a)
   print titles
   i=i+1

Hi ekansh,

I cannot see exactly what you are asking.

If you are trying to run this script under Python 3.0 then it will fail on the print statement as 3.0 requires bracket () around the element that you are asking it to print. e.g. print(titles).

But this is a wild guess as I am unsure of the exact question that you would like answered.

Kindly let us know what you would like us to look at.

Good Luck,

Bass

Also note that the urllib prackage changed between python 2.x and 3.x. I think you would need to use urllib.requests.urlopen().

2to3 come with Python.

C:\python36\Tools\scripts
λ 2to3 -w url_con.py

After and also pep-8 fix.

import urllib.request, urllib.parse, urllib.error
import re

urls = []
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
while i < len(urls):
   htmlfile = urllib.request.urlopen(urls[i])
   a = htmlfile.read()
   titles = re.findall(pattern, a)
   print(titles)
   i = i + 1

So to bad stuff regex with html Hand

Funny answer.

Better,take a look at Web-Scraping-part-1.

from bs4 import BeautifulSoup
import requests

urls = ['https://www.python.org/',
        'https://python-forum.io/',
        'http://cnn.com/']
for url in urls:
   url_get = requests.get(url)
   soup = BeautifulSoup(url_get.content, 'lxml')
   print(soup.select('head > title')[0].text)

Output:Welcome to Python.org
Python Forum
CNN - Breaking News, U.S., World, Weather, Entertainment & Video News

(Jul-17-2017, 07:41 PM)snippsat Wrote: [ -> ]2to3 come with Python.

C:\python36\Tools\scripts
λ 2to3 -w url_con.py

After and also pep-8 fix.

import urllib.request, urllib.parse, urllib.error
import re

urls = []
i = 0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
while i < len(urls):
   htmlfile = urllib.request.urlopen(urls[i])
   a = htmlfile.read()
   titles = re.findall(pattern, a)
   print(titles)
   i = i + 1

So to bad stuff regex with html Hand

Funny answer.

Better,take a look at Web-Scraping-part-1.

from bs4 import BeautifulSoup
import requests

urls = ['https://www.python.org/',
        'https://python-forum.io/',
        'http://cnn.com/']
for url in urls:
   url_get = requests.get(url)
   soup = BeautifulSoup(url_get.content, 'lxml')
   print(soup.select('head > title')[0].text)

Output:Welcome to Python.org
Python Forum
CNN - Breaking News, U.S., World, Weather, Entertainment & Video News

What was the point of simply quoting @snippsat's entire post?

(Jul-17-2017, 05:35 PM)ekansh Wrote: [ -> ]my question here

import urllib
import re
urls=[]
i=0
regex='<title>(.+?)</title>'
pattern=re.compile(regex)
while i<len(urls):
   htmlfile=urllib.urlopen(urls[i])
   a=htmlfile.read()
   titles=re.findall(pattern,a)
   print titles
   i=i+1

It's a syntax error. If you run it, python will tell you what the problem is.

I think that you should try selenium. It's better.

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.python.org/")
nav = browser.find_element_by_id("mainnav")
print(nav.text)

Check these examples:https://likegeeks.com/python-web-scraping/

ekansh

Bass

ichabod801

snippsat

ekansh

sparkz_alot

nilamo

seco