Python Forum

I am trying web scraping for the first time. I am able to get the html table which is defined like:

<table>
  <tbody>
    <tr></tr>
    <tr>
      <th>1 abc</th>
      <td>good</td>
      <td><a href="/good">John (Nick)</a></td>
      <td>Lincoln</td>
    </tr>
    <tr>
        <th>20 xyz</th>
        <td>bad</td>
        <td><a href="/bad">Emma</a></td>
        <td>Smith</td>
      </tr>
      <tr></tr>
      ...
  </tbody>
</table>

I have omitted thead for ease. I just want them to json like:

{
  "collections": [
    {
      "size": "1", # from 1 abc
      "identity": "abc", # from 1 abc
      "name": "John (Nick)" # from <a href="/good">John (Nick)</a>
    },
    ...
  ]
}

I have followed [this](https://stackoverflow.com/questions/1854...le-to-json). But I'm having trouble to out json like in the given example.

Please note: There are empty tr elements between the tbody. And first tag in tr is th.

In general, your steps could be something like these: 1) getting html-source (already done); 2) parsing html document (take a look at packages: BeautifulSoup, lxml); 3) forming a dict or a list of dicts; 4) converting obtained python object(s) to json, e.g. using json.dumps.

parsed = BeautifulSoup(astr)
collections = list()
for row in parsed.find_all('tr'):
    values = list()
    res = dict()
    th = row.find('th')
    if th:
        a, b = th.text.split()
        res.update({'size':a, 'identity':b})
    td = row.find_all('td')
    if td:
        res.update({'name': td[1].text})
    if res:
        collections.append(res)

and use `json.dumps()`
import json
json.dumps(collections)

You can easily fit the snippet to your needs.

That looks awesome!

I just tried but got the error:

Error:for row in parsed.find_all('tr')
                                   ^
SyntaxError: invalid syntax

Maybe I'm using parsed incorrectly. I have:

html_data = soup.find(id='collections')
parsed = BeautifulSoup(html_data) # this line I think is incorrect?

I also tried simply without the line of parsed:

for row in html_data.find_all('tr')

Please guide me. Thanks a lot.

The colon is missing.

Ah, how silly I'm. Was not able to see that. Thanks.

By the way, I extended the code a bit, just for fun.

#!/usr/bin/env python3
"""
Specialized Html2Json converter.
Does only work with the given format.
 
Html2Json reads by default from stdin
If a terminal is detected, the filename be
the  first argument.
"""
 
import json
import sys
from argparse import ArgumentParser
from collections import deque
from pathlib import Path
from typing import Union
 
from bs4 import BeautifulSoup
 
 
example_html = """<table>
 <tbody>
   <tr></tr>
   <tr>
     <th>1 abc</th>
     <td>good</td>
     <td><a href="/good">John (Nick)</a></td>
     <td>Lincoln</td>
   </tr>
   <tr>
       <th>20 xyz</th>
       <td>bad</td>
       <td><a href="/bad">Emma</a></td>
       <td>Smith</td>
     </tr>
     <tr></tr>
     ...
 </tbody>
</table>"""
 
 
def html2json(html: str, debug: bool = False):
    """
   Converts html to a json str.
   """
    collections = {'collections': []}
    result = collections['collections']
    # result has the same reference of the list object
    # inside the list
    fields = 'size identity state name nick'.split()
    skip_fields = 'state nick'.split()
    bs = BeautifulSoup(html, features='html.parser')
    for tr in bs.find_all('tr'):
        state = deque()
        for th_td in tr.children:
            if hasattr(th_td, 'text'):
                state.append(th_td.text)
        if not state:
            continue
        if debug:
            print(state, file=sys.stderr)
        size, identity = state.popleft().strip().split()
        size = int(size)
        state.extendleft((size, identity))
        dataset = {
            field: value for (field, value) in
            zip(fields, state)
            if field not in skip_fields
        }
        result.append(dataset)
    return json.dumps(collections, indent=4)
 
 
def main(
    inputfile: Union[None, Path],
    debug: bool, *,
    example: bool
    ) -> str:
    """
   Convert the html inputfile to a json string
   and return it.
 
   If sys.stdin is a pipe, then prefering this as source.
   """
    if example:
        return html2json(example_html, debug)
    if inputfile and sys.stdin.isatty():
        html_source = inputfile.read_text()
    elif not inputfile and not sys.stdin.isatty():
        html_source = sys.stdin.read()
        # reads until the pipe is closed by
        # the previous process: cat for example
    else:
        # should be impossible
        # prefering in this case the stdin
        html_source = sys.stdin.read()
    return html2json(html_source, debug)
 
 
if __name__ == '__main__':
    parser = ArgumentParser(description=__doc__)
    parser.add_argument('-f', dest='inputfile', default=None, type=Path, help='A path to the inputfile, if stdin is not used.')
    parser.add_argument('-d', dest='debug', action='store_true', help='Debug')
    parser.add_argument('-e', dest='example', action='store_true', help='Example with example data')
    args = parser.parse_args()
    json_str = main(**vars(args))
    print(json_str)

bhojendra

scidam

bhojendra

DeaD_EyE

bhojendra

DeaD_EyE