Python Forum
Extracting the Address tag from multiple HTML files using BeautifulSoup
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting the Address tag from multiple HTML files using BeautifulSoup
#1
Smile 
Hi All,

The below code works exactly how I want it to work for 'title' but just not working at all for 'address'.

path = "C:\\Users\\mpeter\\Downloads\\lksd\\"

titleList = []

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    title = soup.find_all("title")
    title = soup.title.string
    titleList.append(title)
    
streetAddressList = []

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    address = soup.find_all("address", class_={"styles_address__zrPvy"})
    address = soup.address.string
    streetAddressList.append(address)
  
with open('output2.csv', 'w') as myfile:
   writer = csv.writer(myfile)
   writer.writerows((titleList, streetAddressList))
Here is the HTML for the address element.

[<address class="styles_address__zrPvy"><svg class="styles_addressIcon__3Pu3L" height="42" viewbox="0 0 32 42" width="32" xmlns="http://www.w3.org/2000/svg"><path d="M14.381 41.153C2.462 23.873.25 22.1.25 15.75.25 7.051 7.301 0 16 0s15.75 7.051 15.75 15.75c0 6.35-2.212 8.124-14.131 25.403a1.97 1.97 0 01-3.238 0zM16 22.313a6.562 6.562 0 100-13.125 6.562 6.562 0 000 13.124z"></path></svg>Level 1 44 Market Street<!-- -->, <!-- -->Sydney</address>]

All I want is the Title and Address elements in string format, address works if I don't insert the .string line but just gives all HTML. Please help.
Reply
#2
we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
streetAddressList = [item.string for item in address]
this replace lines 12, 18, 19
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
(Jan-24-2021, 06:35 AM)buran Wrote: we are not able to run your code, but soup.find_all will return list and list has no string attribute.
You need to iterate over the list and append string attribute for every element
something like
streetAddressList = [item.string for item in address]
this replace lines 12, 18, 19

Hi Buran, thanks for your comment. It looks like a very simple solution, which I like. For some reason though, it is not returning any data.

Here is the new code you suggested:

path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"

titleList = []

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    title = soup.find_all("title")
    title = soup.title.string
    titleList.append(title)

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read(), 'lxml')
    address = soup.find_all("address", class_={"styles_address__zrPvy"})
streetAddressList = [item.string for item in address]


with open('output2.csv', 'w') as myfile:
   writer = csv.writer(myfile)
   writer.writerows((titleList, streetAddressList))
Am I missing something?
Reply
#4
I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).

There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
 
for infile in glob.glob(os.path.join(path, "*.html")):
    with open(infile, "r") as f, open('output2.csv', 'а') as myfile:
        writer = csv.writer(myfile)
        soup = BeautifulSoup(f.read(), 'lxml')
        title = soup.find("title")
        if title:
           title = soup.title.string
        else:
            title = '' # just in case there is no title tag
        address = soup.find_all("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
        for item in address:
            writer.writerow([title, item.string])
Note, the code is not tested as I don't have your html files.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
(Jan-24-2021, 10:16 AM)buran Wrote: I didn't look at your code into depth. Now I see see it's a bit weird. You iterate over bunch of files, to read first title in separate list, then again to read address(es).

There is no need to use find_all for title - it is expected to have only one tag title, right? Just use soup.find()
Then, is it one address or multiple in each file?
You write to a file after you have exited the second loop. But using list-comprehension will give you only the data from last file not all files (i.e. like when you append to a single list) - this is something I overlooked.
Finally if write 2 lists, but I don't think it will give you what you expect anyway.
import csv
path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
 
for infile in glob.glob(os.path.join(path, "*.html")):
    with open(infile, "r") as f, open('output2.csv', 'а') as myfile:
        writer = csv.writer(myfile)
        soup = BeautifulSoup(f.read(), 'lxml')
        title = soup.find("title")
        if title:
           title = soup.title.string
        else:
            title = '' # just in case there is no title tag
        address = soup.find_all("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
        for item in address:
            writer.writerow([title, item.string])
Note, the code is not tested as I don't have your html files.

Hi Buran, thanks for your help here. It is still not working as expected. To simplify it, I have 4x locally downloaded HTML files which I am trying to return the title and addressin string format. **CORRECTION** The method is returning 6x of the titles and only 1x address.

This is an example of one of the HTML files, which I have multiple: https://toddle.com.au/centres/tobeme-ear...-five-dock, I am basically just trying to scrap the name(title) and address of the school, on locally downloaded HTML files

Again you're assistance is greatly appreciated.
Reply
#6
Quick test,see if this helps.
import requests
from bs4 import BeautifulSoup

url = 'https://toddle.com.au/centres/tobeme-early-learning-five-dock'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('div.styles_headingBox__1BiMj > h1')
print(title.text)
adress = soup.select_one('.styles_address__zrPvy')
print(adress.text)
Output:
ToBeMe Early Learning - Five Dock 25-27 Spencer Street, Five Dock
Reply
#7
(Jan-24-2021, 11:49 PM)snippsat Wrote: Quick test,see if this helps.
import requests
from bs4 import BeautifulSoup

url = 'https://toddle.com.au/centres/tobeme-early-learning-five-dock'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
title = soup.select_one('div.styles_headingBox__1BiMj > h1')
print(title.text)
adress = soup.select_one('.styles_address__zrPvy')
print(adress.text)
Output:
ToBeMe Early Learning - Five Dock 25-27 Spencer Street, Five Dock

Hi Snip, thanks for your reply. As you can see from my code above it is pulling it from multiple locally saved HTML files using glob. Any suggestions how to do this?
Reply
#8
(Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address.
I don't see how it will return 6 titles and one address, but anyway

import csv
import glob
import os
from bs4 import BeautifulSoup

def parse(fname):
    with open(fname) as f:
        soup = BeautifulSoup(f.read(), 'lxml')
        title = soup.find("title")
        address = soup.find("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
        return [title.text, address.text]


path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open('output2.csv', 'w') as myfile:
    writer = csv.writer(myfile)
    for infile in glob.glob(os.path.join(path, "*.html")):
        writer.writerow(parse(infile))
with a single html file in a folder this produce

Output:
"ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#9
(Jan-25-2021, 06:52 AM)buran Wrote:
(Jan-24-2021, 11:26 PM)Dredd Wrote: The method is returning 6x of the titles and only 1x address.
I don't see how it will return 6 titles and one address, but anyway

import csv
import glob
import os
from bs4 import BeautifulSoup

def parse(fname):
    with open(fname) as f:
        soup = BeautifulSoup(f.read(), 'lxml')
        title = soup.find("title")
        address = soup.find("address", class_={"styles_address__zrPvy"}) # do you really need find_all?
        return [title.text, address.text]


path = "C:\\Users\\mzoljan\\Downloads\\lksd\\"
with open('output2.csv', 'w') as myfile:
    writer = csv.writer(myfile)
    for infile in glob.glob(os.path.join(path, "*.html")):
        writer.writerow(parse(infile))
with a single html file in a folder this produce

Output:
"ToBeMe Early Learning - Five Dock, Five Dock | Toddle","25-27 Spencer Street, Five Dock"
There are bunch of <script> tags with JSON inside and it is possible to extract the above info also from them.

You da man Buran!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Getting a URL from Amazon using requests-html, or beautifulsoup aaander 1 1,660 Nov-06-2022, 10:59 PM
Last Post: snippsat
  Populating list items to html code and create individualized html code files ChainyDaisy 0 1,590 Sep-21-2022, 07:18 PM
Last Post: ChainyDaisy
  requests-html + Beautifulsoup klaarnou 0 2,436 Mar-21-2022, 05:31 PM
Last Post: klaarnou
  BeautifulSoup Showing none while extracting image url josephandrew 0 1,935 Sep-20-2021, 11:40 AM
Last Post: josephandrew
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,620 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Extracting html data using attributes WiPi 14 5,471 May-04-2020, 02:04 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,360 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Web crawler extracting specific text from HTML lewdow 1 3,398 Jan-03-2020, 11:21 PM
Last Post: snippsat
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 3,219 Aug-25-2019, 01:13 AM
Last Post: kawasso
  How do I extract specific lines from HTML files before and after a word? glittergirl 1 5,097 Aug-06-2019, 07:23 AM
Last Post: fishhook

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020