Python Forum

Full Version: help with selecting from html
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am trying to just select the "One rate" and the data values from the html string similar to below.


</tr>, <tr class="bluebar">
<th headers="abcd"><p class="sub3">One rate</p></th>
<td><span class="datavalue">5</span></td>
<td><span class="datavalue">6</span></td>
<td><span class="datavalue">7</span></td>
<td><span class="datavalue">8</span></td>
<td><span class="datavalue">9</span></td>
</tr>


I used the beautifulSoup (variable page_soup)



Rates = page_soup.findAll("tr",{"class":"bluebar"})
#from the results selected the 3rd one (which gave me the above string)
my_rate = Rates[3]
my_rate_txt = my_rate.p
my_rate_values = my_rate.findAll("td")
my_rate.text
my_rate.text
Out[240]: '\One rate\n5\n6\n7\n8\n9\n'

is there an easier way to just select/print the "One rate" and data values 5, 6, 7, 8, 9 ?
thank you
from bs4 import BeautifulSoup

html = '''\
<tr class="bluebar">
 <th headers="abcd">
   <p class="sub3">One rate</p>
 </th>
 <td><span class="datavalue">5</span></td>
 <td><span class="datavalue">6</span></td>
 <td><span class="datavalue">7</span></td>
 <td><span class="datavalue">8</span></td>
 <td><span class="datavalue">9</span></td>
</tr>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('.sub3')[0].text)
print([int(n.text) for n in soup.select('.datavalue')])
Output:
One rate [5, 6, 7, 8, 9]
Using  CSS selectors,for more stuff about this look at Web-Scraping part-1.
finally I decided to export the entire table to csv... and this worked..
But, when I open the export.csv file, some lines have "b flag (looks like binary flags)
even when I remove the encode('utf8') , still get the b flags.
How can I remove these b flags and have a clean csv file?

table = page_soup.find("table", { "id" : "xyz" })
for row in table.findAll("tr"):
   cells = row.findAll("td")
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])

with open(‘export.csv', 'w') as f:
       writer = csv.writer(f)
       writer.writerow(headers)
       writer.writerows(row for row in rows if row)
encode('utf8')
How are you reading this in?
If read in with Requests(give back site encoding) and BeautifulSoup using Python 3 it will be Unicode.
Encoding:
Quote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode

If you using encode('utf8')(from Unicode --> bytes) then is the wrong way.
>>> s = 'hello'
>>> s
'hello'
>>> s.encode('utf-8')
b'hello'
>>> type(s.encode('utf-8'))
<class 'bytes'>
A example there is no encoding before i write to csv.
Then tell it to use utf-8 when writing out.
from bs4 import BeautifulSoup
import requests
import csv

url = 'https://www.python.org/'
url_get = requests.get(url)
print(url_get.encoding) #--> utf-8
soup = BeautifulSoup(url_get.content, 'lxml')
title_tag = soup.select('head > title')

with open('some.csv', 'w', encoding='utf-8') as f:
   writer = csv.writer(f)
   writer.writerow(title_tag) #--> <title>Welcome to Python.org</title>
I tried but it doesn't work... results still have binary flags:

Category,June2016,Apr.2017,May2017,June2017,Change from:May2017-June2017,Estatus,CN pop,Clf,Prate,Em,Ep ratio,Unem,Un rate
b''
"b'253,397'","b'254,588'","b'254,767'","b'254,957'",b'190'
"b'158,889'","b'160,213'","b'159,784'","b'160,145'",b'361'



here is the entire code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
my_url = 'http://www.igobychad.com/test_table.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
page_soup.find("table", { "id" : "Emp_sum" })
table = page_soup.find("table", { "id" : "Emp_sum" })
for row in table.findAll("tr"):
   cells = row.findAll("td")
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open('employment_situation.csv', 'w') as f:
       writer = csv.writer(f)
       writer.writerow(headers)
       writer.writerows(row for row in rows if row)

OK, I found it.. changed val.text.encode('utf8') to val.text

but results are still not formatted correctly like the html table on the page. Any idea how to fix that?
page_html = uClient.read()
Here you read into Python 3 without encoding and it will be bytes.
There is no need to use read() at all,let BeautifulSoup convert to Unicode.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

my_url = 'http://www.igobychad.com/test_table.html'
page_html = uReq(my_url)
page_soup = soup(page_html, "html.parser")
If you look at page_soup,you see that is correct no bytes.
Here is advisable way to read in.
from bs4 import BeautifulSoup
import requests
import csv

url = 'http://www.igobychad.com/test_table.html'
url_get = requests.get(url)
page_soup = BeautifulSoup(url_get.content, 'lxml')
So using Requests an lxml as parser.
pip install requests lxml

This line dos nothing:
page_soup.find("table", { "id" : "Emp_sum" })
You have to store it in variable,before continue.
page_soup = page_soup.find("table", { "id" : "Emp_sum" })
Yes, I am with you on that. Thank you.

Now the CSV formatting doesn't match with the table on the page. How can I fix that?
Write the fix yourself,or there are several packages for html table to csv for Python.
The easy way is to use Pandas.
import pandas as pd

url = 'http://www.igobychad.com/test_table.html'
for i, df in enumerate(pd.read_html(url)):
   df.to_csv('myfile_{}.csv'.format(i))
So it will parse all tables of the page,here it's only one.
Here a Notebook how it look.
that looks pretty good... but why its not portable.. for example if I just change the URL to any other html page with table, why it gives many errors like this one:

TypeError                                 Traceback (most recent call last)
<ipython-input-26-2fbd83981590> in <module>()

----> 3 for i, df in enumerate(pd.read_html(url)):


.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
    904                   thousands=thousands, attrs=attrs, encoding=encoding,
    905                   decimal=decimal, converters=converters, na_values=na_values,
--> 906                   keep_default_na=keep_default_na)

.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
    746     for table in tables:
    747         try:
--> 748             ret.append(_data_to_frame(data=table, **kwargs))
    749         except EmptyDataError:  # empty table
    750             continue

.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _data_to_frame(**kwargs)
    638 
    639     # fill out elements of body that are "ragged"
--> 640     _expand_elements(body)
    641     tp = TextParser(body, header=header, **kwargs)
    642     df = tp.read()

.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _expand_elements(body)
    621     empty = ['']
    622     for ind, length in iteritems(not_max):
--> 623         body[ind] += empty * (lens_max - length)
    624 
    625 

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U424') dtype('<U424') dtype('<U424')