Looping with Beautifulsoup

CaptainCsaba · (This post was last modified: Jan-22-2019, 11:56 AM by CaptainCsaba.)

Hi!

This is my very first python script, I will probably make some noob mistakes and I apologise for that.
So I have a txt file that is full of URLS, they are all in new lines. I created a scrips which basically goes onto one of these sites and gets 7 things we need and it put's it into an excel file. It works perfectly if I use for example (and remove the looping):

my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'

However if I want to create a loop for it, which means going into the "input.txt" file and reading a line, webscraping it and putting it into excel and repeating it until it reaches the end of the list there is just simply nothing that gets created.

Where does the loop go wrong?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#counts the number of lines in input.txt (how many times the loop has to run)
filename = "input.txt"
myfile= open(filename)
linescounted = len(myfile.readlines())
numberoflines = linescounted + 1

#creates the output excel file
filename = "products.csv"
f= open(filename, "w")
	headers = "name, special_cond, sector, subsector, index, marketcond, isin\n"
f.write(headers)

urlcount = 0
while urlcount < numberoflines:
	my_url = myfile.readline()

#my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'
# Opens up the connection and grabs the page
	uClient = uReq(my_url)
# Offloads content into a variable
	page_html = uClient.read()
# Closes the client
	uClient.close()
# html parsing
	page_soup = soup(page_html, "html.parser")

	
	name = page_soup.find('h1', attrs={'class','tesummary'}).text.replace("\n", "") 
	spec = page_soup.find('div', attrs={'class':'commonTable table-responsive'}).find('tr', attrs={'class':'even'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text.replace("\r", "")
	sect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	subsect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	index = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	mainmarket = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
	isin = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text


	f.write(name.replace("\r", " ") + "," + spec.replace("\n", "") + "," + sect + "," + subsect + "," + index.replace(",", "|") + "," + mainmarket + "," + isin + "\n"
	
	
	numberoflines = numberoflines + 1


f.close()

***ichabod801*** · Jan-22-2019, 04:20 PM

You can only go through a file object once. You do that on line 7 when you use the readlines method. So then on line 18 you read nothing, because you've already read the file.

I would suggest getting rid of urlcount and numberoflines, and looping over the file directly.

with open('input.txt') as url_file:
    for my_url in url_file:
        uClient = uReq(my_url.strip())
        ...

CaptainCsaba · (This post was last modified: Jan-23-2019, 07:17 AM by CaptainCsaba.)

(Jan-22-2019, 04:20 PM)ichabod801 Wrote: You can only go through a file object once. You do that on line 7 when you use the readlines method. So then on line 18 you read nothing, because you've already read the file.

I would suggest getting rid of urlcount and numberoflines, and looping over the file directly.
with open('input.txt') as url_file:
    for my_url in url_file:
        uClient = uReq(my_url.strip())
        ...

The idea is great, it's much shorter and easier this way, however, I get a syntax error at the "as" in while open('input.txt') as url_file: thich I don't really understand why. I am using python 3.7.

**buran** · Jan-23-2019, 09:41 AM

(Jan-23-2019, 07:17 AM)CaptainCsaba Wrote: The idea is great, it's much shorter and easier this way, however, I get a syntax error at the "as" in while open('input.txt') as url_file: thich I don't really understand why. I am using python 3.7.

Post your actual code that produce error. Ichabood's code is fine, so there is problem with how you changed your code

CaptainCsaba · (This post was last modified: Jan-23-2019, 10:49 AM by CaptainCsaba.)

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

filename = "products.csv"
f= open(filename, "w")
headers = "name, special_cond, sector, subsector, index, marketcond, isin\n"
f.write(headers)

while open('input.txt') as url_file:
	for my_url in url_file:
		uClient = uReq(my_url.stip())
		page_html = uClient.read()
		uClient.close()
		page_soup = soup(page_html, "html.parser")

	
		name = page_soup.find('h1', attrs={'class','tesummary'}).text.replace("\n", "") 
		spec = page_soup.find('div', attrs={'class':'commonTable table-responsive'}).find('tr', attrs={'class':'even'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text.replace("\r", "")
		sect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		subsect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		index = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		mainmarket = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text
		isin = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text


		f.write(name.replace("\r", " ") + "," + spec.replace("\n", "") + "," + sect + "," + subsect + "," + index.replace(",", "|") + "," + mainmarket + "," + isin + "\n"
	
f.close()

This is the error:

>>> while open('input.txt') as url_file:
  File "<stdin>", line 1
    while open('input.txt') as url_file:
                             ^
SyntaxError: invalid syntax
>>> for my_url in url_file:
...
  File "<stdin>", line 2

    ^
IndentationError: expected an indented block

Also for some reason is gives me the same error in the end:

... f.close()
  File "<stdin>", line 3
    f.close()
    ^
SyntaxError: invalid syntax

***snippsat*** · Jan-23-2019, 11:18 AM

You also have a line that's 557 character long Hand

Instead of using like like 20 find_next('td'),can go straight to a value with CSS Selectors.
BS support this trough select() and select_one().
Example:

from bs4 import BeautifulSoup
import requests

url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('div.commonTable.table-responsive table > tbody > tr:nth-of-type(4) > td:nth-of-type(2)')[0].text)

Output:
148.80

You can get selector bye right click copy --> copy selector in browser.
Also nth-child(4) that comes from browser copy,has to be renamed nth-of-type(4) in BS.

**buran** · Jan-23-2019, 11:20 AM

please read https://python-forum.io/Thread-How-to-Ex...ython-code
It looks like you are confusing how to run your code. the second snippet (so called error) is executed in python interactive mode.
You must save the first block as py file then run ot.
As to the last snipped/error - > there is missing closing parenthesis at the end of the previous line (#26)

CaptainCsaba · (This post was last modified: Jan-23-2019, 11:30 AM by CaptainCsaba.)

Nevermind, I realised I wrote "while" not "with". Also i checked the link and added parenthesis to the end. Now everything works perfectly. I did quite a few noob mistakes, thank you for the patience and for the help!

**buran** · Jan-23-2019, 12:38 PM

If it comforts you I also didn't notice the while/with error

Looping with Beautifulsoup

User Panel Messages

Announcements