BeautifulSoup Parsing Error

slinkplink · (This post was last modified: Jun-07-2017, 05:18 PM by buran.)

Hi all,

Novice here, so go easy on me. I'm in the process of writing a web scraping script to automate a public records database search. For some reason, the html script that it's returning once I've parsed it with BeautifulSoup contains a formatting error that results in most of the data being omitted from the table it's in. It basically closes the table after the first row, instead of closing the table after the last row. Where should I be looking to fix this? I've included the code I'm using below:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://public.hcad.org/records/Real/AdvancedResults.asp?name=&desc=&stname=westheimer&bstnum=&estnum=&zip=&kmap=&facet=&isd=&StateCategory=F1&BSC=&LUC=&nbhd=&val=&valrange=.10&sqft=&sqftrange=.10&Sort=Account&bstep=0&taxyear=2017&Search=Search')
r.status_code
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())

**nilamo** · Jun-07-2017, 05:58 PM

What are you trying to get out of the page?
If you ignore the tables, and just go straight for the rows, you can get the data. I was playing around, and this seems to work fine: rows = soup.find_all("tr"). You'll need to skip the first 7 or so rows, since those are headers, the search criteria, generic info, etc.

slinkplink · (This post was last modified: Jun-07-2017, 09:11 PM by slinkplink.)

Thanks for the reply! It worked, but I also tried downloading the lxml parser and using that, which worked and allowed me to select the specific table I'm interested in:

soup = BeautifulSoup(r.text, 'lxml')
property_table = soup.find_all('table')[3]

The issue I'm stuck with now is that I'm not sure how to extract the text from it. Using

property_table.get_text()

returns mostly text but with a lot of newline ("\n") tags and without formatting. Basically, my question now is this: How do I extract this data nicely into an excel table or .csv format?

Thanks again for your help!

***metulburr*** · (This post was last modified: Jun-07-2017, 09:57 PM by metulburr.)

(Jun-07-2017, 09:11 PM)slinkplink Wrote: returns mostly text but with a lot of newline ("\n") tags and without formatting. Basically, my question now is this: How do I extract this data nicely into an excel table or .csv format?

Thats because its getting all the text from the table....and whitespace is considered text. You can use str.strip() to remove whitespace for each row and get each row.

Im sure there is already pre-made html to cvs scripts if you search for them...this is the first result from google
https://gist.github.com/n8henrie/08a31f02fd1282d12b75

test.py

#!/usr/bin/env python3
"""html_to_csv.py
Prompts for a URL, displays HTML tables from that page, then converts
the selected table to a csv file.
"""

import sys
import pandas

if sys.version[0] == '2':
    input = raw_input
    
url = input("Enter the URL: ")
tables = pandas.io.html.read_html(url)
'''
for index, table in enumerate(tables):
    print("Table {}:".format(index + 1))
    print(table.head() + '\n')
    print('-' * 60)
    print('\n')
'''
choice = int(input("Enter the number of the table you want: ")) - 1
filename = input("Enter a filename (.csv extension assumed): ") + '.csv'

with open(filename, 'w') as outfile:
    tables[choice].to_csv(outfile, index=False, header=False)

Output:metulburr@ubuntu:~$ python test.py
Enter the URL: https://public.hcad.org/records/Real/AdvancedResults.asp?name=&desc=&stname=westheimer&bstnum=&estnum=&zip=&kmap=&facet=&isd=&StateCategory=F1&BSC=&LUC=&nbhd=&val=&valrange=.10&sqft=&sqftrange=.10&Sort=Account&bstep=0&taxyear=2017&Search=Search
Enter the number of the table you want: 3
Enter a filename (.csv extension assumed): stuff

Output:

slinkplink · Jun-08-2017, 03:06 PM

Thanks! That's exactly what I was looking for. It's got me thinking now: is there a way I could build a data set that consists of aggregated search results from the page I'm searching? What I mean by that, is what would you recommend I look into for the purpose of automatically stringing together several sets of search results from the same page? (they would all have the same format since they would be coming from the same source.)
For example, if I want to search all the properties along a given street - "westheimer", for example - but within that, search only properties that also have the State Category F1, F2, C1, and C2, what should I look into to automatically conduct those four searches and export that data into one .csv file?
I don't mean to have anyone do this for me, I'm trying to learn how to do it. As little or as much direction as anyone is willing to provide would be much appreciated though!

**Larz60+** · Jun-08-2017, 08:49 PM

start with snippsat's tutorials on web scraping
here and here

seco · Feb-12-2018, 02:55 PM

To scrape specific elements using the class attribute as an example:

from urllib.request import urlopen
 
from urllib.error import HTTPError
 
from urllib.error import URLError
 
from bs4 import BeautifulSoup
 
try:
 
    html = urlopen("https://likegeeks.com/")
 
except HTTPError as e:
 
    print(e)
 
except URLError:
 
    print("Server down or incorrect domain")
 
else:
 
    res = BeautifulSoup(html.read(),"html5lib")
 
    tags = res.findAll("h3", {"class": "post-title"})
 
    for tag in tags:
 
        print(tag.getText())

Also, you can use findAll like this:

tags = res.findAll("span", "a" "img")

Or by inner text:

tags = res.findAll(text="Python Programming Basics with Examples")

Check this tutorial: Python web scraping

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Strange ModuleNotFound Error on BeautifulSoup for Python 3.11	Gaberson19	1	1,011	Jul-13-2023, 10:38 AM Last Post: Gaurav_Kumar
	[Solved]Help with BeautifulSoup.getText() Error	Extra	5	3,786	Jan-19-2023, 02:03 PM Last Post: prvncpa
	BeautifulSoup not parsing other URLs	giddyhead	0	1,202	Feb-23-2022, 05:35 PM Last Post: giddyhead
	BeautifulSoup: 6k records - but stops after parsing 20 lines	apollo	0	1,821	May-10-2021, 05:08 PM Last Post: apollo
	Logic behind BeautifulSoup data-parsing	jimsxxl	7	4,325	Apr-13-2021, 09:06 AM Last Post: jimsxxl
	Error with NumPy, BeautifulSoup when using pip	tsurubaso	7	5,299	Oct-20-2020, 04:34 PM Last Post: tsurubaso
	Python beautifulsoup pagination error	The61	5	3,487	Apr-09-2020, 09:17 PM Last Post: Larz60+
	BeautifulSoup: Error while extracting a value from an HTML table	kawasso	3	3,236	Aug-25-2019, 01:13 AM Last Post: kawasso
	beautifulsoup error	rudolphyaber	7	5,543	May-26-2019, 02:12 PM Last Post: heiner55
	Beautifulsoup parsing	Larz60+	7	6,087	Apr-05-2017, 03:07 AM Last Post: Larz60+

BeautifulSoup Parsing Error

User Panel Messages

Announcements