Posts: 6
Threads: 2
Joined: Jan 2021
Jan-24-2021, 03:40 AM
Hi All,
The below code works exactly how I want it to work for 'title' but just not working at all for 'address'.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
path = "C:\\Users\\mpeter\\Downloads\\lksd\\"
titleList = []
for infile in glob.glob(os.path.join(path, "*.html" )):
markup = (infile)
soup = BeautifulSoup( open (markup, "r" ).read(), 'lxml' )
title = soup.find_all( "title" )
title = soup.title.string
titleList.append(title)
streetAddressList = []
for infile in glob.glob(os.path.join(path, "*.html" )):
markup = (infile)
soup = BeautifulSoup( open (markup, "r" ).read(), 'lxml' )
address = soup.find_all( "address" , class_ = { "styles_address__zrPvy" })
address = soup.address.string
streetAddressList.append(address)
with open ( 'output2.csv' , 'w' ) as myfile:
writer = csv.writer(myfile)
writer.writerows((titleList, streetAddressList))
|
Here is the HTML for the address element.
[<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>Level 1 44 Market Street<!-- -->, <!-- -->Sydney</address>]
All I want is the Title and Address elements in string format, address works if I don't insert the .string line but just gives all HTML. Please help.
Posts: 8,167
Threads: 160
Joined: Sep 2016
Jan-24-2021, 06:35 AM
(This post was last modified: Jan-24-2021, 06:35 AM by buran.)
we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
1 |
streetAddressList = [item.string for item in address]
|
this replace lines 12, 18, 19
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-24-2021, 06:35 AM)buran Wrote: we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
1 |
streetAddressList = [item.string for item in address]
|
this replace lines 12, 18, 19
Hi Buran, thanks for your comment. It looks like a very simple solution, which I like. For some reason though, it is not returning any data.
Here is the new code you suggested:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
titleList = []
for infile in glob.glob(os.path.join(path, "*.html" )):
markup = (infile)
soup = BeautifulSoup( open (markup, "r" ).read(), 'lxml' )
title = soup.find_all( "title" )
title = soup.title.string
titleList.append(title)
for infile in glob.glob(os.path.join(path, "*.html" )):
markup = (infile)
soup = BeautifulSoup( open (markup, "r" ).read(), 'lxml' )
address = soup.find_all( "address" , class_ = { "styles_address__zrPvy" })
streetAddressList = [item.string for item in address]
with open ( 'output2.csv' , 'w' ) as myfile:
writer = csv.writer(myfile)
writer.writerows((titleList, streetAddressList))
|
Am I missing something?
Posts: 8,167
Threads: 160
Joined: Sep 2016
Jan-24-2021, 10:16 AM
(This post was last modified: Jan-24-2021, 10:17 AM by buran.)
I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).
There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
for infile in glob.glob(os.path.join(path, "*.html" )):
with open (infile, "r" ) as f, open ( 'output2.csv' , 'а' ) as myfile:
writer = csv.writer(myfile)
soup = BeautifulSoup(f.read(), 'lxml' )
title = soup.find( "title" )
if title:
title = soup.title.string
else :
title = ''
address = soup.find_all( "address" , class_ = { "styles_address__zrPvy" })
for item in address:
writer.writerow([title, item.string])
|
Note, the code is not tested as I don't have your html files.
Posts: 6
Threads: 2
Joined: Jan 2021
Jan-24-2021, 11:26 PM
(This post was last modified: Jan-24-2021, 11:26 PM by Dredd.)
(Jan-24-2021, 10:16 AM)buran Wrote: I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).
There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
for infile in glob.glob(os.path.join(path, "*.html" )):
with open (infile, "r" ) as f, open ( 'output2.csv' , 'а' ) as myfile:
writer = csv.writer(myfile)
soup = BeautifulSoup(f.read(), 'lxml' )
title = soup.find( "title" )
if title:
title = soup.title.string
else :
title = ''
address = soup.find_all( "address" , class_ = { "styles_address__zrPvy" })
for item in address:
writer.writerow([title, item.string])
|
Note, the code is not tested as I don't have your html files.
Hi Buran, thanks for your help here. It is still not working as expected. To simplify it, I have 4x locally downloaded HTML files which I am trying to return the title and address in string format. **CORRECTION** The method is returning 6x of the titles and only 1x address .
This is an example of one of the HTML files, which I have multiple: https://toddle.com.au/centres/tobeme-ear...-five-dock, I am basically just trying to scrap the name(title) and address of the school, on locally downloaded HTML files
Again you're assistance is greatly appreciated.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Quick test,see if this helps.
1 2 3 4 5 6 7 8 9 10 |
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
title = soup.select_one( 'div.styles_headingBox__1BiMj > h1' )
print (title.text)
adress = soup.select_one( '.styles_address__zrPvy' )
print (adress.text)
|
Output: ToBeMe Early Learning - Five Dock
25-27 Spencer Street, Five Dock
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-24-2021, 11:49 PM)snippsat Wrote: Quick test,see if this helps.
1 2 3 4 5 6 7 8 9 10 |
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
title = soup.select_one( 'div.styles_headingBox__1BiMj > h1' )
print (title.text)
adress = soup.select_one( '.styles_address__zrPvy' )
print (adress.text)
|
Output: ToBeMe Early Learning - Five Dock
25-27 Spencer Street, Five Dock
Hi Snip, thanks for your reply. As you can see from my code above it is pulling it from multiple locally saved HTML files using glob . Any suggestions how to do this?
Posts: 8,167
Threads: 160
Joined: Sep 2016
(Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address. I don't see how it will return 6 titles and one address, but anyway
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import csv
import glob
import os
from bs4 import BeautifulSoup
def parse(fname):
with open (fname) as f:
soup = BeautifulSoup(f.read(), 'lxml' )
title = soup.find( "title" )
address = soup.find( "address" , class_ = { "styles_address__zrPvy" })
return [title.text, address.text]
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open ( 'output2.csv' , 'w' ) as myfile:
writer = csv.writer(myfile)
for infile in glob.glob(os.path.join(path, "*.html" )):
writer.writerow(parse(infile))
|
with a single html file in a folder this produce
Output: "ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.
Posts: 6
Threads: 2
Joined: Jan 2021
(Jan-25-2021, 06:52 AM)buran Wrote: (Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address. I don't see how it will return 6 titles and one address, but anyway
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import csv
import glob
import os
from bs4 import BeautifulSoup
def parse(fname):
with open (fname) as f:
soup = BeautifulSoup(f.read(), 'lxml' )
title = soup.find( "title" )
address = soup.find( "address" , class_ = { "styles_address__zrPvy" })
return [title.text, address.text]
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open ( 'output2.csv' , 'w' ) as myfile:
writer = csv.writer(myfile)
for infile in glob.glob(os.path.join(path, "*.html" )):
writer.writerow(parse(infile))
|
with a single html file in a folder this produce
Output: "ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.
You da man Buran!
|