convert html table to json

bhojendra · (This post was last modified: Jul-28-2019, 06:40 PM by bhojendra.)

I am trying web scraping for the first time. I am able to get the html table which is defined like:

<table>
  <tbody>
    <tr></tr>
    <tr>
      <th>1 abc</th>
      <td>good</td>
      <td><a href="/good">John (Nick)</a></td>
      <td>Lincoln</td>
    </tr>
    <tr>
        <th>20 xyz</th>
        <td>bad</td>
        <td><a href="/bad">Emma</a></td>
        <td>Smith</td>
      </tr>
      <tr></tr>
      ...
  </tbody>
</table>

I have omitted thead for ease. I just want them to json like:

{
  "collections": [
    {
      "size": "1", # from 1 abc
      "identity": "abc", # from 1 abc
      "name": "John (Nick)" # from <a href="/good">John (Nick)</a>
    },
    ...
  ]
}

I have followed [this](https://stackoverflow.com/questions/1854...le-to-json). But I'm having trouble to out json like in the given example.

Please note: There are empty tr elements between the tbody. And first tag in tr is th.

**scidam** · Jul-29-2019, 12:41 AM

In general, your steps could be something like these: 1) getting html-source (already done); 2) parsing html document (take a look at packages: BeautifulSoup, lxml); 3) forming a dict or a list of dicts; 4) converting obtained python object(s) to json, e.g. using json.dumps.

parsed = BeautifulSoup(astr)
collections = list()
for row in parsed.find_all('tr'):
    values = list()
    res = dict()
    th = row.find('th')
    if th:
        a, b = th.text.split()
        res.update({'size':a, 'identity':b})
    td = row.find_all('td')
    if td:
        res.update({'name': td[1].text})
    if res:
        collections.append(res)

and use `json.dumps()`
import json
json.dumps(collections)

You can easily fit the snippet to your needs.

bhojendra · Jul-29-2019, 04:03 PM

That looks awesome!

I just tried but got the error:

Error:for row in parsed.find_all('tr')
                                   ^
SyntaxError: invalid syntax

Maybe I'm using parsed incorrectly. I have:

html_data = soup.find(id='collections')
parsed = BeautifulSoup(html_data) # this line I think is incorrect?

I also tried simply without the line of parsed:

for row in html_data.find_all('tr')

Please guide me. Thanks a lot.

DeaD_EyE · Jul-29-2019, 09:18 PM

The colon is missing.

bhojendra · Jul-30-2019, 05:26 AM

Ah, how silly I'm. Was not able to see that. Thanks.

DeaD_EyE · Jul-30-2019, 07:53 PM

By the way, I extended the code a bit, just for fun.

#!/usr/bin/env python3
"""
Specialized Html2Json converter.
Does only work with the given format.
 
Html2Json reads by default from stdin
If a terminal is detected, the filename be
the  first argument.
"""
 
import json
import sys
from argparse import ArgumentParser
from collections import deque
from pathlib import Path
from typing import Union
 
from bs4 import BeautifulSoup
 
 
example_html = """<table>
 <tbody>
   <tr></tr>
   <tr>
     <th>1 abc</th>
     <td>good</td>
     <td><a href="/good">John (Nick)</a></td>
     <td>Lincoln</td>
   </tr>
   <tr>
       <th>20 xyz</th>
       <td>bad</td>
       <td><a href="/bad">Emma</a></td>
       <td>Smith</td>
     </tr>
     <tr></tr>
     ...
 </tbody>
</table>"""
 
 
def html2json(html: str, debug: bool = False):
    """
   Converts html to a json str.
   """
    collections = {'collections': []}
    result = collections['collections']
    # result has the same reference of the list object
    # inside the list
    fields = 'size identity state name nick'.split()
    skip_fields = 'state nick'.split()
    bs = BeautifulSoup(html, features='html.parser')
    for tr in bs.find_all('tr'):
        state = deque()
        for th_td in tr.children:
            if hasattr(th_td, 'text'):
                state.append(th_td.text)
        if not state:
            continue
        if debug:
            print(state, file=sys.stderr)
        size, identity = state.popleft().strip().split()
        size = int(size)
        state.extendleft((size, identity))
        dataset = {
            field: value for (field, value) in
            zip(fields, state)
            if field not in skip_fields
        }
        result.append(dataset)
    return json.dumps(collections, indent=4)
 
 
def main(
    inputfile: Union[None, Path],
    debug: bool, *,
    example: bool
    ) -> str:
    """
   Convert the html inputfile to a json string
   and return it.
 
   If sys.stdin is a pipe, then prefering this as source.
   """
    if example:
        return html2json(example_html, debug)
    if inputfile and sys.stdin.isatty():
        html_source = inputfile.read_text()
    elif not inputfile and not sys.stdin.isatty():
        html_source = sys.stdin.read()
        # reads until the pipe is closed by
        # the previous process: cat for example
    else:
        # should be impossible
        # prefering in this case the stdin
        html_source = sys.stdin.read()
    return html2json(html_source, debug)
 
 
if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument('-f', dest='inputfile', default=None, type=Path, help='A path to the inputfile, if stdin is not used.')
    parser.add_argument('-d', dest='debug', action='store_true', help='Debug')
    parser.add_argument('-e', dest='example', action='store_true', help='Example with example data')
    args = parser.parse_args()
    json_str = main(**vars(args))
    print(json_str)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Suggestion request for scrapping html table	Vkkindia	3	2,025	Dec-06-2021, 06:09 PM Last Post: Larz60+
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,617	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Help: Beautiful Soup - Parsing HTML table	ironfelix717	2	2,671	Oct-01-2020, 02:19 PM Last Post: snippsat
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,357	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	Imprt HTML table to array	meleghengersor	2	2,099	Jan-23-2020, 10:23 AM Last Post: perfringo
	BeautifulSoup: Error while extracting a value from an HTML table	kawasso	3	3,217	Aug-25-2019, 01:13 AM Last Post: kawasso
	How to capture Single Column from Web Html Table?	ahmedwaqas92	5	4,348	Jul-29-2019, 02:17 AM Last Post: ahmedwaqas92
	convert html to pdf in django site	shahpy	4	6,021	Aug-17-2018, 11:10 AM Last Post: Larz60+
	Unable to convert XML to JSON	priyanka	1	3,727	Jun-29-2018, 09:23 AM Last Post: snippsat

convert html table to json

User Panel Messages

Announcements