Looping with Beautifulsoup - Printable Version

Looping with Beautifulsoup - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Looping with Beautifulsoup (/thread-15564.html)

Looping with Beautifulsoup - CaptainCsaba - Jan-22-2019

Hi!

This is my very first python script, I will probably make some noob mistakes and I apologise for that.
So I have a txt file that is full of URLS, they are all in new lines. I created a scrips which basically goes onto one of these sites and gets 7 things we need and it put's it into an excel file. It works perfectly if I use for example (and remove the looping):

my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'

However if I want to create a loop for it, which means going into the "input.txt" file and reading a line, webscraping it and putting it into excel and repeating it until it reaches the end of the list there is just simply nothing that gets created.

Where does the loop go wrong?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#counts the number of lines in input.txt (how many times the loop has to run)
filename = "input.txt"
myfile= open(filename)
linescounted = len(myfile.readlines())
numberoflines = linescounted + 1

#creates the output excel file
filename = "products.csv"
f= open(filename, "w")
	headers = "name, special_cond, sector, subsector, index, marketcond, isin\n"
f.write(headers)

urlcount = 0
while urlcount < numberoflines:
	my_url = myfile.readline()

#my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'
# Opens up the connection and grabs the page
	uClient = uReq(my_url)
# Offloads content into a variable
	page_html = uClient.read()
# Closes the client
	uClient.close()
# html parsing
	page_soup = soup(page_html, "html.parser")

	
	name = page_soup.find('h1', attrs={'class','tesummary'}).text.replace("\n", "") 
	spec = page_soup.find('div', attrs={'class':'commonTable table-responsive'}).find('tr', attrs={'class':'even'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text.replace("\r", "")
	sect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	subsect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	index = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	mainmarket = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	isin = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text


	f.write(name.replace("\r", " ") + "," + spec.replace("\n", "") + "," + sect + "," + subsect + "," + index.replace(",", "|") + "," + mainmarket + "," + isin + "\n"
	
	
	numberoflines = numberoflines + 1


f.close()

RE: Looping with Beautifulsoup - ichabod801 - Jan-22-2019

You can only go through a file object once. You do that on line 7 when you use the readlines method. So then on line 18 you read nothing, because you've already read the file.

I would suggest getting rid of urlcount and numberoflines, and looping over the file directly.

with open('input.txt') as url_file:
    for my_url in url_file:
        uClient = uReq(my_url.strip())
        ...

RE: Looping with Beautifulsoup - CaptainCsaba - Jan-23-2019

(Jan-22-2019, 04:20 PM)ichabod801 Wrote: You can only go through a file object once. You do that on line 7 when you use the readlines method. So then on line 18 you read nothing, because you've already read the file.

I would suggest getting rid of urlcount and numberoflines, and looping over the file directly.
with open('input.txt') as url_file:
    for my_url in url_file:
        uClient = uReq(my_url.strip())
        ...

The idea is great, it's much shorter and easier this way, however, I get a syntax error at the "as" in while open('input.txt') as url_file: thich I don't really understand why. I am using python 3.7.

RE: Looping with Beautifulsoup - buran - Jan-23-2019

(Jan-23-2019, 07:17 AM)CaptainCsaba Wrote: The idea is great, it's much shorter and easier this way, however, I get a syntax error at the "as" in while open('input.txt') as url_file: thich I don't really understand why. I am using python 3.7.

Post your actual code that produce error. Ichabood's code is fine, so there is problem with how you changed your code

RE: Looping with Beautifulsoup - CaptainCsaba - Jan-23-2019

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

filename = "products.csv"
f= open(filename, "w")
headers = "name, special_cond, sector, subsector, index, marketcond, isin\n"
f.write(headers)

while open('input.txt') as url_file:
	for my_url in url_file:
		uClient = uReq(my_url.stip())
		page_html = uClient.read()
		uClient.close()
		page_soup = soup(page_html, "html.parser")

	
		name = page_soup.find('h1', attrs={'class','tesummary'}).text.replace("\n", "") 
		spec = page_soup.find('div', attrs={'class':'commonTable table-responsive'}).find('tr', attrs={'class':'even'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text.replace("\r", "")
		sect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		subsect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		index = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		mainmarket = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		isin = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text


		f.write(name.replace("\r", " ") + "," + spec.replace("\n", "") + "," + sect + "," + subsect + "," + index.replace(",", "|") + "," + mainmarket + "," + isin + "\n"
	
f.close()

This is the error:

>>> while open('input.txt') as url_file:
  File "<stdin>", line 1
    while open('input.txt') as url_file:
                             ^
SyntaxError: invalid syntax
>>> for my_url in url_file:
...
  File "<stdin>", line 2

    ^
IndentationError: expected an indented block

Also for some reason is gives me the same error in the end:

... f.close()
  File "<stdin>", line 3
    f.close()
    ^
SyntaxError: invalid syntax

RE: Looping with Beautifulsoup - snippsat - Jan-23-2019

You also have a line that's 557 character long Hand

Instead of using like like 20 find_next('td'),can go straight to a value with CSS Selectors.
BS support this trough select() and select_one().
Example:

from bs4 import BeautifulSoup
import requests

url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('div.commonTable.table-responsive table > tbody > tr:nth-of-type(4) > td:nth-of-type(2)')[0].text)

Output:
148.80

You can get selector bye right click copy --> copy selector in browser.
Also nth-child(4) that comes from browser copy,has to be renamed nth-of-type(4) in BS.

RE: Looping with Beautifulsoup - buran - Jan-23-2019

please read https://python-forum.io/Thread-How-to-Execute-python-code
It looks like you are confusing how to run your code. the second snippet (so called error) is executed in python interactive mode.
You must save the first block as py file then run ot.
As to the last snipped/error - > there is missing closing parenthesis at the end of the previous line (#26)

RE: Looping with Beautifulsoup - CaptainCsaba - Jan-23-2019

Nevermind, I realised I wrote "while" not "with". Also i checked the link and added parenthesis to the end. Now everything works perfectly. I did quite a few noob mistakes, thank you for the patience and for the help!

RE: Looping with Beautifulsoup - buran - Jan-23-2019

If it comforts you I also didn't notice the while/with error