Jan-22-2019, 11:55 AM
(This post was last modified: Jan-22-2019, 11:56 AM by CaptainCsaba.)
Hi!
This is my very first python script, I will probably make some noob mistakes and I apologise for that.
So I have a txt file that is full of URLS, they are all in new lines. I created a scrips which basically goes onto one of these sites and gets 7 things we need and it put's it into an excel file. It works perfectly if I use for example (and remove the looping):
Where does the loop go wrong?
This is my very first python script, I will probably make some noob mistakes and I apologise for that.
So I have a txt file that is full of URLS, they are all in new lines. I created a scrips which basically goes onto one of these sites and gets 7 things we need and it put's it into an excel file. It works perfectly if I use for example (and remove the looping):
my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en'However if I want to create a loop for it, which means going into the "input.txt" file and reading a line, webscraping it and putting it into excel and repeating it until it reaches the end of the list there is just simply nothing that gets created.
Where does the loop go wrong?
from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup #counts the number of lines in input.txt (how many times the loop has to run) filename = "input.txt" myfile= open(filename) linescounted = len(myfile.readlines()) numberoflines = linescounted + 1 #creates the output excel file filename = "products.csv" f= open(filename, "w") headers = "name, special_cond, sector, subsector, index, marketcond, isin\n" f.write(headers) urlcount = 0 while urlcount < numberoflines: my_url = myfile.readline() #my_url = 'https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/summary/company-summary/GB00BH4HKS39GBGBXSET1.html?lang=en' # Opens up the connection and grabs the page uClient = uReq(my_url) # Offloads content into a variable page_html = uClient.read() # Closes the client uClient.close() # html parsing page_soup = soup(page_html, "html.parser") name = page_soup.find('h1', attrs={'class','tesummary'}).text.replace("\n", "") spec = page_soup.find('div', attrs={'class':'commonTable table-responsive'}).find('tr', attrs={'class':'even'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text.replace("\r", "") sect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text subsect = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text index = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text mainmarket = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text isin = page_soup.find('div', attrs={'id':'pi-colonna2'}).find('div', attrs={'class':'table-responsive'}).find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').find_next('td').text f.write(name.replace("\r", " ") + "," + spec.replace("\n", "") + "," + sect + "," + subsect + "," + index.replace(",", "|") + "," + mainmarket + "," + isin + "\n" numberoflines = numberoflines + 1 f.close()