help with selecting from html - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: help with selecting from html (/thread-3996.html) |
help with selecting from html - chadonline - Jul-14-2017 I am trying to just select the "One rate" and the data values from the html string similar to below. </tr>, <tr class="bluebar"> <th headers="abcd"><p class="sub3">One rate</p></th> <td><span class="datavalue">5</span></td> <td><span class="datavalue">6</span></td> <td><span class="datavalue">7</span></td> <td><span class="datavalue">8</span></td> <td><span class="datavalue">9</span></td> </tr> I used the beautifulSoup (variable page_soup) Rates = page_soup.findAll("tr",{"class":"bluebar"}) #from the results selected the 3rd one (which gave me the above string) my_rate = Rates[3] my_rate_txt = my_rate.p my_rate_values = my_rate.findAll("td") my_rate.textmy_rate.text Out[240]: '\One rate\n5\n6\n7\n8\n9\n' is there an easier way to just select/print the "One rate" and data values 5, 6, 7, 8, 9 ? thank you RE: help with selecting from html - snippsat - Jul-15-2017 from bs4 import BeautifulSoup html = '''\ <tr class="bluebar"> <th headers="abcd"> <p class="sub3">One rate</p> </th> <td><span class="datavalue">5</span></td> <td><span class="datavalue">6</span></td> <td><span class="datavalue">7</span></td> <td><span class="datavalue">8</span></td> <td><span class="datavalue">9</span></td> </tr>''' soup = BeautifulSoup(html, 'lxml') print(soup.select('.sub3')[0].text) print([int(n.text) for n in soup.select('.datavalue')]) Using CSS selectors,for more stuff about this look at Web-Scraping part-1.
RE: help with selecting from html - chadonline - Jul-23-2017 finally I decided to export the entire table to csv... and this worked.. But, when I open the export.csv file, some lines have "b flag (looks like binary flags) even when I remove the encode('utf8') , still get the b flags. How can I remove these b flags and have a clean csv file? table = page_soup.find("table", { "id" : "xyz" }) for row in table.findAll("tr"): cells = row.findAll("td") headers = [header.text for header in table.find_all('th')] rows = [] for row in table.find_all('tr'): rows.append([val.text.encode('utf8') for val in row.find_all('td')]) with open(‘export.csv', 'w') as f: writer = csv.writer(f) writer.writerow(headers) writer.writerows(row for row in rows if row)encode('utf8') RE: help with selecting from html - snippsat - Jul-23-2017 How are you reading this in? If read in with Requests(give back site encoding) and BeautifulSoup using Python 3 it will be Unicode. Encoding: Quote:Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. If you using encode('utf8') (from Unicode --> bytes) then is the wrong way.>>> s = 'hello' >>> s 'hello' >>> s.encode('utf-8') b'hello' >>> type(s.encode('utf-8')) <class 'bytes'>A example there is no encoding before i write to csv. Then tell it to use utf-8 when writing out.from bs4 import BeautifulSoup import requests import csv url = 'https://www.python.org/' url_get = requests.get(url) print(url_get.encoding) #--> utf-8 soup = BeautifulSoup(url_get.content, 'lxml') title_tag = soup.select('head > title') with open('some.csv', 'w', encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(title_tag) #--> <title>Welcome to Python.org</title> RE: help with selecting from html - chadonline - Jul-24-2017 I tried but it doesn't work... results still have binary flags: Category,June2016,Apr.2017,May2017,June2017,Change from:May2017-June2017,Estatus,CN pop,Clf,Prate,Em,Ep ratio,Unem,Un rate b'' "b'253,397'","b'254,588'","b'254,767'","b'254,957'",b'190' "b'158,889'","b'160,213'","b'159,784'","b'160,145'",b'361' here is the entire code: from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup import csv my_url = 'http://www.igobychad.com/test_table.html' uClient = uReq(my_url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") page_soup.find("table", { "id" : "Emp_sum" }) table = page_soup.find("table", { "id" : "Emp_sum" }) for row in table.findAll("tr"): cells = row.findAll("td") headers = [header.text for header in table.find_all('th')] rows = [] for row in table.find_all('tr'): rows.append([val.text.encode('utf8') for val in row.find_all('td')]) with open('employment_situation.csv', 'w') as f: writer = csv.writer(f) writer.writerow(headers) writer.writerows(row for row in rows if row) OK, I found it.. changed val.text.encode('utf8') to val.text but results are still not formatted correctly like the html table on the page. Any idea how to fix that? RE: help with selecting from html - snippsat - Jul-24-2017 page_html = uClient.read() Here you read into Python 3 without encoding and it will be bytes .There is no need to use read() at all,let BeautifulSoup convert to Unicode.from urllib.request import urlopen as uReq from bs4 import BeautifulSoup as soup import csv my_url = 'http://www.igobychad.com/test_table.html' page_html = uReq(my_url) page_soup = soup(page_html, "html.parser")If you look at page_soup,you see that is correct no bytes .Here is advisable way to read in. from bs4 import BeautifulSoup import requests import csv url = 'http://www.igobychad.com/test_table.html' url_get = requests.get(url) page_soup = BeautifulSoup(url_get.content, 'lxml')So using Requests an lxml as parser. pip install requests lxml This line dos nothing: page_soup.find("table", { "id" : "Emp_sum" }) You have to store it in variable,before continue. page_soup = page_soup.find("table", { "id" : "Emp_sum" })
RE: help with selecting from html - chadonline - Jul-24-2017 Yes, I am with you on that. Thank you. Now the CSV formatting doesn't match with the table on the page. How can I fix that? RE: help with selecting from html - snippsat - Jul-24-2017 Write the fix yourself,or there are several packages for html table to csv for Python. The easy way is to use Pandas. import pandas as pd url = 'http://www.igobychad.com/test_table.html' for i, df in enumerate(pd.read_html(url)): df.to_csv('myfile_{}.csv'.format(i))So it will parse all tables of the page,here it's only one. Here a Notebook how it look. RE: help with selecting from html - chadonline - Jul-25-2017 that looks pretty good... but why its not portable.. for example if I just change the URL to any other html page with table, why it gives many errors like this one: TypeError Traceback (most recent call last) <ipython-input-26-2fbd83981590> in <module>() ----> 3 for i, df in enumerate(pd.read_html(url)): .../anaconda/lib/python3.6/site-packages/pandas/io/html.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na) 904 thousands=thousands, attrs=attrs, encoding=encoding, 905 decimal=decimal, converters=converters, na_values=na_values, --> 906 keep_default_na=keep_default_na) .../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _parse(flavor, io, match, attrs, encoding, **kwargs) 746 for table in tables: 747 try: --> 748 ret.append(_data_to_frame(data=table, **kwargs)) 749 except EmptyDataError: # empty table 750 continue .../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _data_to_frame(**kwargs) 638 639 # fill out elements of body that are "ragged" --> 640 _expand_elements(body) 641 tp = TextParser(body, header=header, **kwargs) 642 df = tp.read() .../anaconda/lib/python3.6/site-packages/pandas/io/html.py in _expand_elements(body) 621 empty = [''] 622 for ind, length in iteritems(not_max): --> 623 body[ind] += empty * (lens_max - length) 624 625 TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U424') dtype('<U424') dtype('<U424') |