Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
convert html table to json
#1
I am trying web scraping for the first time. I am able to get the html table which is defined like:
<table>
  <tbody>
    <tr></tr>
    <tr>
      <th>1 abc</th>
      <td>good</td>
      <td><a href="/good">John (Nick)</a></td>
      <td>Lincoln</td>
    </tr>
    <tr>
        <th>20 xyz</th>
        <td>bad</td>
        <td><a href="/bad">Emma</a></td>
        <td>Smith</td>
      </tr>
      <tr></tr>
      ...
  </tbody>
</table>
I have omitted thead for ease. I just want them to json like:
{
  "collections": [
    {
      "size": "1", # from 1 abc
      "identity": "abc", # from 1 abc
      "name": "John (Nick)" # from <a href="/good">John (Nick)</a>
    },
    ...
  ]
}
I have followed [this](https://stackoverflow.com/questions/1854...le-to-json). But I'm having trouble to out json like in the given example.

Please note:
There are empty tr elements between the tbody. And first tag in tr is th.
Reply
#2
In general, your steps could be something like these: 1) getting html-source (already done); 2) parsing html document (take a look at packages: BeautifulSoup, lxml); 3) forming a dict or a list of dicts; 4) converting obtained python object(s) to json, e.g. using json.dumps.

parsed = BeautifulSoup(astr)
collections = list()
for row in parsed.find_all('tr'):
    values = list()
    res = dict()
    th = row.find('th')
    if th:
        a, b = th.text.split()
        res.update({'size':a, 'identity':b})
    td = row.find_all('td')
    if td:
        res.update({'name': td[1].text})
    if res:
        collections.append(res)
and use `json.dumps()`
import json
json.dumps(collections)
You can easily fit the snippet to your needs.
Reply
#3
That looks awesome!

I just tried but got the error:

Error:
for row in parsed.find_all('tr') ^ SyntaxError: invalid syntax
Maybe I'm using parsed incorrectly. I have:

html_data = soup.find(id='collections')
parsed = BeautifulSoup(html_data) # this line I think is incorrect?
I also tried simply without the line of parsed:
for row in html_data.find_all('tr')
Please guide me. Thanks a lot.
Reply
#4
The colon is missing.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
Ah, how silly I'm. Was not able to see that. Thanks.
Reply
#6
By the way, I extended the code a bit, just for fun.

#!/usr/bin/env python3
"""
Specialized Html2Json converter.
Does only work with the given format.
 
Html2Json reads by default from stdin
If a terminal is detected, the filename be
the  first argument.
"""
 
import json
import sys
from argparse import ArgumentParser
from collections import deque
from pathlib import Path
from typing import Union
 
from bs4 import BeautifulSoup
 
 
example_html = """<table>
 <tbody>
   <tr></tr>
   <tr>
     <th>1 abc</th>
     <td>good</td>
     <td><a href="/good">John (Nick)</a></td>
     <td>Lincoln</td>
   </tr>
   <tr>
       <th>20 xyz</th>
       <td>bad</td>
       <td><a href="/bad">Emma</a></td>
       <td>Smith</td>
     </tr>
     <tr></tr>
     ...
 </tbody>
</table>"""
 
 
def html2json(html: str, debug: bool = False):
    """
   Converts html to a json str.
   """
    collections = {'collections': []}
    result = collections['collections']
    # result has the same reference of the list object
    # inside the list
    fields = 'size identity state name nick'.split()
    skip_fields = 'state nick'.split()
    bs = BeautifulSoup(html, features='html.parser')
    for tr in bs.find_all('tr'):
        state = deque()
        for th_td in tr.children:
            if hasattr(th_td, 'text'):
                state.append(th_td.text)
        if not state:
            continue
        if debug:
            print(state, file=sys.stderr)
        size, identity = state.popleft().strip().split()
        size = int(size)
        state.extendleft((size, identity))
        dataset = {
            field: value for (field, value) in
            zip(fields, state)
            if field not in skip_fields
        }
        result.append(dataset)
    return json.dumps(collections, indent=4)
 
 
def main(
    inputfile: Union[None, Path],
    debug: bool, *,
    example: bool
    ) -> str:
    """
   Convert the html inputfile to a json string
   and return it.
 
   If sys.stdin is a pipe, then prefering this as source.
   """
    if example:
        return html2json(example_html, debug)
    if inputfile and sys.stdin.isatty():
        html_source = inputfile.read_text()
    elif not inputfile and not sys.stdin.isatty():
        html_source = sys.stdin.read()
        # reads until the pipe is closed by
        # the previous process: cat for example
    else:
        # should be impossible
        # prefering in this case the stdin
        html_source = sys.stdin.read()
    return html2json(html_source, debug)
 
 
if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument('-f', dest='inputfile', default=None, type=Path, help='A path to the inputfile, if stdin is not used.')
    parser.add_argument('-d', dest='debug', action='store_true', help='Debug')
    parser.add_argument('-e', dest='example', action='store_true', help='Example with example data')
    args = parser.parse_args()
    json_str = main(**vars(args))
    print(json_str)
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Suggestion request for scrapping html table Vkkindia 3 2,025 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,617 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,671 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,357 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  Imprt HTML table to array meleghengersor 2 2,099 Jan-23-2020, 10:23 AM
Last Post: perfringo
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 3,217 Aug-25-2019, 01:13 AM
Last Post: kawasso
  How to capture Single Column from Web Html Table? ahmedwaqas92 5 4,348 Jul-29-2019, 02:17 AM
Last Post: ahmedwaqas92
  convert html to pdf in django site shahpy 4 6,021 Aug-17-2018, 11:10 AM
Last Post: Larz60+
  Unable to convert XML to JSON priyanka 1 3,727 Jun-29-2018, 09:23 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020