Posts: 6
Threads: 2
Joined: Jan 2021
Jan-24-2021, 03:40 AM
Hi All,
The below code works exactly how I want it to work for 'title' but just not working at all for 'address'.
path = "C:\\Users\\mpeter\\Downloads\\lksd\\"
titleList = []
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
title = soup.find_all("title")
title = soup.title.string
titleList.append(title)
streetAddressList = []
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
address = soup.find_all("address", class_={"styles_address__zrPvy"})
address = soup.address.string
streetAddressList.append(address)
with open('output2.csv', 'w') as myfile:
writer = csv.writer(myfile)
writer.writerows((titleList, streetAddressList)) Here is the HTML for the address element.
[<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>Level 1 44 Market Street<!-- -->, <!-- -->Sydney</address>]
All I want is the Title and Address elements in string format, address works if I don't insert the .string line but just gives all HTML. Please help.
Posts: 8,167
Threads: 160
Joined: Sep 2016
Jan-24-2021, 06:35 AM
(This post was last modified: Jan-24-2021, 06:35 AM by buran.)
we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
streetAddressList = [item.string for item in address] this replace lines 12, 18, 19
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-24-2021, 06:35 AM)buran Wrote: we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
streetAddressList = [item.string for item in address] this replace lines 12, 18, 19
Hi Buran, thanks for your comment. It looks like a very simple solution, which I like. For some reason though, it is not returning any data.
Here is the new code you suggested:
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
titleList = []
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
title = soup.find_all("title")
title = soup.title.string
titleList.append(title)
for infile in glob.glob(os.path.join(path, "*.html")):
markup = (infile)
soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
address = soup.find_all("address", class_={"styles_address__zrPvy"})
streetAddressList = [item.string for item in address]
with open('output2.csv', 'w') as myfile:
writer = csv.writer(myfile)
writer.writerows((titleList, streetAddressList)) Am I missing something?
Posts: 8,167
Threads: 160
Joined: Sep 2016
Jan-24-2021, 10:16 AM
(This post was last modified: Jan-24-2021, 10:17 AM by buran.)
I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).
There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
for infile in glob.glob(os.path.join(path, "*.html")):
with open(infile, "r") as f, open('output2.csv', 'а') as myfile:
writer = csv.writer(myfile)
soup = BeautifulSoup(f.read(), 'lxml')
title = soup.find("title")
if title:
title = soup.title.string
else:
title = '' # just in case there is no title tag
address = soup.find_all("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
for item in address:
writer.writerow([title, item.string]) Note, the code is not tested as I don't have your html files.
Posts: 6
Threads: 2
Joined: Jan 2021
Jan-24-2021, 11:26 PM
(This post was last modified: Jan-24-2021, 11:26 PM by Dredd.)
(Jan-24-2021, 10:16 AM)buran Wrote: I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).
There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
for infile in glob.glob(os.path.join(path, "*.html")):
with open(infile, "r") as f, open('output2.csv', 'а') as myfile:
writer = csv.writer(myfile)
soup = BeautifulSoup(f.read(), 'lxml')
title = soup.find("title")
if title:
title = soup.title.string
else:
title = '' # just in case there is no title tag
address = soup.find_all("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
for item in address:
writer.writerow([title, item.string]) Note, the code is not tested as I don't have your html files.
Hi Buran, thanks for your help here. It is still not working as expected. To simplify it, I have 4x locally downloaded HTML files which I am trying to return the title and address in string format. **CORRECTION** The method is returning 6x of the titles and only 1x address .
This is an example of one of the HTML files, which I have multiple: https://toddle.com.au/centres/tobeme-ear...-five-dock, I am basically just trying to scrap the name(title) and address of the school, on locally downloaded HTML files
Again you're assistance is greatly appreciated.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Quick test,see if this helps.
import requests
from bs4 import BeautifulSoup
url = 'https://toddle.com.au/centres/tobeme-early-learning-five-dock'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('div.styles_headingBox__1BiMj > h1')
print(title.text)
adress = soup.select_one('.styles_address__zrPvy')
print(adress.text) Output: ToBeMe Early Learning - Five Dock
25-27 Spencer Street, Five Dock
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-24-2021, 11:49 PM)snippsat Wrote: Quick test,see if this helps.
import requests
from bs4 import BeautifulSoup
url = 'https://toddle.com.au/centres/tobeme-early-learning-five-dock'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('div.styles_headingBox__1BiMj > h1')
print(title.text)
adress = soup.select_one('.styles_address__zrPvy')
print(adress.text) Output: ToBeMe Early Learning - Five Dock
25-27 Spencer Street, Five Dock
Hi Snip, thanks for your reply. As you can see from my code above it is pulling it from multiple locally saved HTML files using glob . Any suggestions how to do this?
Posts: 8,167
Threads: 160
Joined: Sep 2016
(Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address. I don't see how it will return 6 titles and one address, but anyway
import csv
import glob
import os
from bs4 import BeautifulSoup
def parse(fname):
with open(fname) as f:
soup = BeautifulSoup(f.read(), 'lxml')
title = soup.find("title")
address = soup.find("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
return [title.text, address.text]
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open('output2.csv', 'w') as myfile:
writer = csv.writer(myfile)
for infile in glob.glob(os.path.join(path, "*.html")):
writer.writerow(parse(infile)) with a single html file in a folder this produce
Output: "ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-25-2021, 06:52 AM)buran Wrote: (Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address. I don't see how it will return 6 titles and one address, but anyway
import csv
import glob
import os
from bs4 import BeautifulSoup
def parse(fname):
with open(fname) as f:
soup = BeautifulSoup(f.read(), 'lxml')
title = soup.find("title")
address = soup.find("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
return [title.text, address.text]
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open('output2.csv', 'w') as myfile:
writer = csv.writer(myfile)
for infile in glob.glob(os.path.join(path, "*.html")):
writer.writerow(parse(infile)) with a single html file in a folder this produce
Output: "ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.
You da man Buran!
|