Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 convert html table to json
#1
I am trying web scraping for the first time. I am able to get the html table which is defined like:
<table>
  <tbody>
    <tr></tr>
    <tr>
      <th>1 abc</th>
      <td>good</td>
      <td><a href="/good">John (Nick)</a></td>
      <td>Lincoln</td>
    </tr>
    <tr>
        <th>20 xyz</th>
        <td>bad</td>
        <td><a href="/bad">Emma</a></td>
        <td>Smith</td>
      </tr>
      <tr></tr>
      ...
  </tbody>
</table>
I have omitted thead for ease. I just want them to json like:
{
  "collections": [
    {
      "size": "1", # from 1 abc
      "identity": "abc", # from 1 abc
      "name": "John (Nick)" # from <a href="/good">John (Nick)</a>
    },
    ...
  ]
}
I have followed [this](https://stackoverflow.com/questions/1854...le-to-json). But I'm having trouble to out json like in the given example.

Please note:
There are empty tr elements between the tbody. And first tag in tr is th.
Quote
#2
In general, your steps could be something like these: 1) getting html-source (already done); 2) parsing html document (take a look at packages: BeautifulSoup, lxml); 3) forming a dict or a list of dicts; 4) converting obtained python object(s) to json, e.g. using json.dumps.

parsed = BeautifulSoup(astr)
collections = list()
for row in parsed.find_all('tr'):
    values = list()
    res = dict()
    th = row.find('th')
    if th:
        a, b = th.text.split()
        res.update({'size':a, 'identity':b})
    td = row.find_all('td')
    if td:
        res.update({'name': td[1].text})
    if res:
        collections.append(res)
and use `json.dumps()`
import json
json.dumps(collections)
You can easily fit the snippet to your needs.
Quote
#3
That looks awesome!

I just tried but got the error:

Error:
for row in parsed.find_all('tr') ^ SyntaxError: invalid syntax
Maybe I'm using parsed incorrectly. I have:

html_data = soup.find(id='collections')
parsed = BeautifulSoup(html_data) # this line I think is incorrect?
I also tried simply without the line of parsed:
for row in html_data.find_all('tr')
Please guide me. Thanks a lot.
Quote
#4
The colon is missing.
bhojendra and scidam like this post
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#5
Ah, how silly I'm. Was not able to see that. Thanks.
Quote
#6
By the way, I extended the code a bit, just for fun.

#!/usr/bin/env python3
"""
Specialized Html2Json converter.
Does only work with the given format.
 
Html2Json reads by default from stdin
If a terminal is detected, the filename be
the  first argument.
"""
 
import json
import sys
from argparse import ArgumentParser
from collections import deque
from pathlib import Path
from typing import Union
 
from bs4 import BeautifulSoup
 
 
example_html = """<table>
 <tbody>
   <tr></tr>
   <tr>
     <th>1 abc</th>
     <td>good</td>
     <td><a href="/good">John (Nick)</a></td>
     <td>Lincoln</td>
   </tr>
   <tr>
       <th>20 xyz</th>
       <td>bad</td>
       <td><a href="/bad">Emma</a></td>
       <td>Smith</td>
     </tr>
     <tr></tr>
     ...
 </tbody>
</table>"""
 
 
def html2json(html: str, debug: bool = False):
    """
   Converts html to a json str.
   """
    collections = {'collections': []}
    result = collections['collections']
    # result has the same reference of the list object
    # inside the list
    fields = 'size identity state name nick'.split()
    skip_fields = 'state nick'.split()
    bs = BeautifulSoup(html, features='html.parser')
    for tr in bs.find_all('tr'):
        state = deque()
        for th_td in tr.children:
            if hasattr(th_td, 'text'):
                state.append(th_td.text)
        if not state:
            continue
        if debug:
            print(state, file=sys.stderr)
        size, identity = state.popleft().strip().split()
        size = int(size)
        state.extendleft((size, identity))
        dataset = {
            field: value for (field, value) in
            zip(fields, state)
            if field not in skip_fields
        }
        result.append(dataset)
    return json.dumps(collections, indent=4)
 
 
def main(
    inputfile: Union[None, Path],
    debug: bool, *,
    example: bool
    ) -> str:
    """
   Convert the html inputfile to a json string
   and return it.
 
   If sys.stdin is a pipe, then prefering this as source.
   """
    if example:
        return html2json(example_html, debug)
    if inputfile and sys.stdin.isatty():
        html_source = inputfile.read_text()
    elif not inputfile and not sys.stdin.isatty():
        html_source = sys.stdin.read()
        # reads until the pipe is closed by
        # the previous process: cat for example
    else:
        # should be impossible
        # prefering in this case the stdin
        html_source = sys.stdin.read()
    return html2json(html_source, debug)
 
 
if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument('-f', dest='inputfile', default=None, type=Path, help='A path to the inputfile, if stdin is not used.')
    parser.add_argument('-d', dest='debug', action='store_true', help='Debug')
    parser.add_argument('-e', dest='example', action='store_true', help='Example with example data')
    args = parser.parse_args()
    json_str = main(**vars(args))
    print(json_str)
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  HTML Styling Not Working yoitspython 1 109 Aug-13-2019, 06:26 AM
Last Post: fishhook
  spliting html code with br tag yokaso 11 262 Aug-07-2019, 03:18 PM
Last Post: snippsat
  How do I get rid of the HTML tags in my output? glittergirl 1 308 Aug-05-2019, 08:30 PM
Last Post: snippsat
  How to capture Single Column from Web Html Table? ahmedwaqas92 5 287 Jul-29-2019, 02:17 AM
Last Post: ahmedwaqas92
  Getting a specific text inside an html with soup mathieugrimbert 9 366 Jul-10-2019, 12:40 PM
Last Post: mathieugrimbert
  getting options from a html form pgoosen 5 344 Jul-03-2019, 06:07 PM
Last Post: nilamo
  table from wikipedia flow50 5 407 Jul-01-2019, 07:12 PM
Last Post: snippsat
  [Flask] html error 405 SheeppOSU 0 182 Jun-08-2019, 04:42 PM
Last Post: SheeppOSU
  [split] Using beautiful soup to get html attribute value moski 6 375 Jun-03-2019, 04:24 PM
Last Post: moski
  html error 404 SheeppOSU 1 212 Jun-03-2019, 02:19 PM
Last Post: heiner55

Forum Jump:


Users browsing this thread: 1 Guest(s)