Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 convert html table to json
#1
I am trying web scraping for the first time. I am able to get the html table which is defined like:
<table>
  <tbody>
    <tr></tr>
    <tr>
      <th>1 abc</th>
      <td>good</td>
      <td><a href="/good">John (Nick)</a></td>
      <td>Lincoln</td>
    </tr>
    <tr>
        <th>20 xyz</th>
        <td>bad</td>
        <td><a href="/bad">Emma</a></td>
        <td>Smith</td>
      </tr>
      <tr></tr>
      ...
  </tbody>
</table>
I have omitted thead for ease. I just want them to json like:
{
  "collections": [
    {
      "size": "1", # from 1 abc
      "identity": "abc", # from 1 abc
      "name": "John (Nick)" # from <a href="/good">John (Nick)</a>
    },
    ...
  ]
}
I have followed [this](https://stackoverflow.com/questions/1854...le-to-json). But I'm having trouble to out json like in the given example.

Please note:
There are empty tr elements between the tbody. And first tag in tr is th.
Quote
#2
In general, your steps could be something like these: 1) getting html-source (already done); 2) parsing html document (take a look at packages: BeautifulSoup, lxml); 3) forming a dict or a list of dicts; 4) converting obtained python object(s) to json, e.g. using json.dumps.

parsed = BeautifulSoup(astr)
collections = list()
for row in parsed.find_all('tr'):
    values = list()
    res = dict()
    th = row.find('th')
    if th:
        a, b = th.text.split()
        res.update({'size':a, 'identity':b})
    td = row.find_all('td')
    if td:
        res.update({'name': td[1].text})
    if res:
        collections.append(res)
and use `json.dumps()`
import json
json.dumps(collections)
You can easily fit the snippet to your needs.
Quote
#3
That looks awesome!

I just tried but got the error:

Error:
for row in parsed.find_all('tr') ^ SyntaxError: invalid syntax
Maybe I'm using parsed incorrectly. I have:

html_data = soup.find(id='collections')
parsed = BeautifulSoup(html_data) # this line I think is incorrect?
I also tried simply without the line of parsed:
for row in html_data.find_all('tr')
Please guide me. Thanks a lot.
Quote
#4
The colon is missing.
scidam and bhojendra like this post
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote
#5
Ah, how silly I'm. Was not able to see that. Thanks.
Quote
#6
By the way, I extended the code a bit, just for fun.

#!/usr/bin/env python3
"""
Specialized Html2Json converter.
Does only work with the given format.
 
Html2Json reads by default from stdin
If a terminal is detected, the filename be
the  first argument.
"""
 
import json
import sys
from argparse import ArgumentParser
from collections import deque
from pathlib import Path
from typing import Union
 
from bs4 import BeautifulSoup
 
 
example_html = """<table>
 <tbody>
   <tr></tr>
   <tr>
     <th>1 abc</th>
     <td>good</td>
     <td><a href="/good">John (Nick)</a></td>
     <td>Lincoln</td>
   </tr>
   <tr>
       <th>20 xyz</th>
       <td>bad</td>
       <td><a href="/bad">Emma</a></td>
       <td>Smith</td>
     </tr>
     <tr></tr>
     ...
 </tbody>
</table>"""
 
 
def html2json(html: str, debug: bool = False):
    """
   Converts html to a json str.
   """
    collections = {'collections': []}
    result = collections['collections']
    # result has the same reference of the list object
    # inside the list
    fields = 'size identity state name nick'.split()
    skip_fields = 'state nick'.split()
    bs = BeautifulSoup(html, features='html.parser')
    for tr in bs.find_all('tr'):
        state = deque()
        for th_td in tr.children:
            if hasattr(th_td, 'text'):
                state.append(th_td.text)
        if not state:
            continue
        if debug:
            print(state, file=sys.stderr)
        size, identity = state.popleft().strip().split()
        size = int(size)
        state.extendleft((size, identity))
        dataset = {
            field: value for (field, value) in
            zip(fields, state)
            if field not in skip_fields
        }
        result.append(dataset)
    return json.dumps(collections, indent=4)
 
 
def main(
    inputfile: Union[None, Path],
    debug: bool, *,
    example: bool
    ) -> str:
    """
   Convert the html inputfile to a json string
   and return it.
 
   If sys.stdin is a pipe, then prefering this as source.
   """
    if example:
        return html2json(example_html, debug)
    if inputfile and sys.stdin.isatty():
        html_source = inputfile.read_text()
    elif not inputfile and not sys.stdin.isatty():
        html_source = sys.stdin.read()
        # reads until the pipe is closed by
        # the previous process: cat for example
    else:
        # should be impossible
        # prefering in this case the stdin
        html_source = sys.stdin.read()
    return html2json(html_source, debug)
 
 
if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument('-f', dest='inputfile', default=None, type=Path, help='A path to the inputfile, if stdin is not used.')
    parser.add_argument('-d', dest='debug', action='store_true', help='Debug')
    parser.add_argument('-e', dest='example', action='store_true', help='Example with example data')
    args = parser.parse_args()
    json_str = main(**vars(args))
    print(json_str)
My code examples are always for Python >=3.6.0
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 270 Aug-25-2019, 01:13 AM
Last Post: kawasso
  How to capture Single Column from Web Html Table? ahmedwaqas92 5 443 Jul-29-2019, 02:17 AM
Last Post: ahmedwaqas92
  sqlalchemy DataTables::"No data available in table" when using self-joined table Asma 0 596 Nov-22-2018, 02:46 PM
Last Post: Asma
  convert html to pdf in django site shahpy 4 2,867 Aug-17-2018, 11:10 AM
Last Post: Larz60+
  Unable to convert XML to JSON priyanka 1 1,131 Jun-29-2018, 09:23 AM
Last Post: snippsat

Forum Jump:


Users browsing this thread: 1 Guest(s)