Posts: 5
Threads: 1
Joined: Jul 2017
I am trying to just select the "One rate" and the data values from the html string similar to below.
</tr>, <tr class="bluebar">
<th headers="abcd"><p class="sub3">One rate</p></th>
<td><span class="datavalue">5</span></td>
<td><span class="datavalue">6</span></td>
<td><span class="datavalue">7</span></td>
<td><span class="datavalue">8</span></td>
<td><span class="datavalue">9</span></td>
</tr>
I used the beautifulSoup (variable page_soup)
Rates = page_soup.findAll("tr",{"class":"bluebar"})
#from the results selected the 3rd one (which gave me the above string)
my_rate = Rates[3]
my_rate_txt = my_rate.p
my_rate_values = my_rate.findAll("td")
my_rate.text my_rate.text
Out[240]: '\One rate\n5\n6\n7\n8\n9\n'
is there an easier way to just select/print the "One rate" and data values 5, 6, 7, 8, 9 ?
thank you
Posts: 7,319
Threads: 123
Joined: Sep 2016
from bs4 import BeautifulSoup
html = '''\
<tr class="bluebar">
<th headers="abcd">
<p class="sub3">One rate</p>
</th>
<td><span class="datavalue">5</span></td>
<td><span class="datavalue">6</span></td>
<td><span class="datavalue">7</span></td>
<td><span class="datavalue">8</span></td>
<td><span class="datavalue">9</span></td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.sub3')[0].text)
print([int(n.text) for n in soup.select('.datavalue')]) Output: One rate
[5, 6, 7, 8, 9]
Using CSS selectors,for more stuff about this look at Web-Scraping part-1.
Posts: 5
Threads: 1
Joined: Jul 2017
finally I decided to export the entire table to csv... and this worked..
But, when I open the export.csv file, some lines have "b flag (looks like binary flags)
even when I remove the encode('utf8') , still get the b flags.
How can I remove these b flags and have a clean csv file?
table = page_soup.find("table", { "id" : "xyz" })
for row in table.findAll("tr"):
cells = row.findAll("td")
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open(‘export.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in rows if row) encode('utf8')
Posts: 7,319
Threads: 123
Joined: Sep 2016
How are you reading this in?
If read in with Requests(give back site encoding) and BeautifulSoup using Python 3 it will be Unicode.
Encoding:
Quote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8.
But when you load that document into Beautiful Soup, you’ll discover it’s been converted to Unicode
If you using encode('utf8') (from Unicode --> bytes) then is the wrong way.
>>> s = 'hello'
>>> s
'hello'
>>> s.encode('utf-8')
b'hello'
>>> type(s.encode('utf-8'))
<class 'bytes'> A example there is no encoding before i write to csv.
Then tell it to use utf-8 when writing out.
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.python.org/'
url_get = requests.get(url)
print(url_get.encoding) #--> utf-8
soup = BeautifulSoup(url_get.content, 'lxml')
title_tag = soup.select('head > title')
with open('some.csv', 'w', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(title_tag) #--> <title>Welcome to Python.org</title>
Posts: 5
Threads: 1
Joined: Jul 2017
Jul-24-2017, 08:42 AM
(This post was last modified: Jul-24-2017, 09:00 AM by chadonline.)
I tried but it doesn't work... results still have binary flags:
Category,June2016,Apr.2017,May2017,June2017,Change from:May2017-June2017,Estatus,CN pop,Clf,Prate,Em,Ep ratio,Unem,Un rate
b''
"b'253,397'","b'254,588'","b'254,767'","b'254,957'",b'190'
"b'158,889'","b'160,213'","b'159,784'","b'160,145'",b'361'
here is the entire code:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
my_url = 'http://www.igobychad.com/test_table.html'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
page_soup.find("table", { "id" : "Emp_sum" })
table = page_soup.find("table", { "id" : "Emp_sum" })
for row in table.findAll("tr"):
cells = row.findAll("td")
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
with open('employment_situation.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in rows if row)
OK, I found it.. changed val.text.encode('utf8') to val.text
but results are still not formatted correctly like the html table on the page. Any idea how to fix that?
Posts: 7,319
Threads: 123
Joined: Sep 2016
Jul-24-2017, 09:24 AM
(This post was last modified: Jul-24-2017, 09:24 AM by snippsat.)
page_html = uClient.read()
Here you read into Python 3 without encoding and it will be bytes .
There is no need to use read() at all,let BeautifulSoup convert to Unicode.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
my_url = 'http://www.igobychad.com/test_table.html'
page_html = uReq(my_url)
page_soup = soup(page_html, "html.parser") If you look at page_soup,you see that is correct no bytes .
Here is advisable way to read in.
from bs4 import BeautifulSoup
import requests
import csv
url = 'http://www.igobychad.com/test_table.html'
url_get = requests.get(url)
page_soup = BeautifulSoup(url_get.content, 'lxml') So using Requests an lxml as parser.
pip install requests lxml
This line dos nothing:
page_soup.find("table", { "id" : "Emp_sum" })
You have to store it in variable,before continue.
page_soup = page_soup.find("table", { "id" : "Emp_sum" })
Posts: 5
Threads: 1
Joined: Jul 2017
Yes, I am with you on that. Thank you.
Now the CSV formatting doesn't match with the table on the page. How can I fix that?
Posts: 7,319
Threads: 123
Joined: Sep 2016
Jul-24-2017, 12:08 PM
(This post was last modified: Jul-24-2017, 12:08 PM by snippsat.)
Write the fix yourself,or there are several packages for html table to csv for Python.
The easy way is to use Pandas.
import pandas as pd
url = 'http://www.igobychad.com/test_table.html'
for i, df in enumerate(pd.read_html(url)):
df.to_csv('myfile_{}.csv'.format(i)) So it will parse all tables of the page,here it's only one.
Here a Notebook how it look.
Posts: 5
Threads: 1
Joined: Jul 2017
that looks pretty good... but why its not portable.. for example if I just change the URL to any other html page with table, why it gives many errors like this one:
TypeError Traceback (most recent call last)
<ipython-input-26-2fbd83981590> in <module>()
----> 3 for i, df in enumerate(pd.read_html(url)):
.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
904 thousands=thousands, attrs=attrs, encoding=encoding,
905 decimal=decimal, converters=converters, na_values=na_values,
--> 906 keep_default_na=keep_default_na)
.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs)
746 for table in tables:
747 try:
--> 748 ret.append(_data_to_frame(data=table, **kwargs))
749 except EmptyDataError: # empty table
750 continue
.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _data_to_frame(**kwargs)
638
639 # fill out elements of body that are "ragged"
--> 640 _expand_elements(body)
641 tp = TextParser(body, header=header, **kwargs)
642 df = tp.read()
.../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _expand_elements(body)
621 empty = ['']
622 for ind, length in iteritems(not_max):
--> 623 body[ind] += empty * (lens_max - length)
624
625
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U424') dtype('<U424') dtype('<U424')
|