Python Forum
Same Data Showing Several Times With Beautifulsoup Query
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Same Data Showing Several Times With Beautifulsoup Query
#1
Hi there,

I have the following Python Code :-

import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup
import xlrd
import re

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

res3 = requests.get("https://web.archive.org/web/20220521203053/https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm")     
soup3 = BeautifulSoup(res3.content,'lxml')

BBMF_2022 = []

#BBMF_elem = soup3.find_all('a', string=re.compile(r'between|Flypast'))


for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
    li1 = item.find_parent().text
    #li2 = li1.find_previous().font
    #print(link)
    print(li1)
    #print(li2)
    
    #BBMF_2022.append(li1)


#check if links are in dataframe
#df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])

#df
The issue I have is when I run the Code, the Data is printed for 15 Entries from May 28th to May 29th, several times,
I am not sure why that is the case ? Could someone suggest for me the reason why ? And tell me what I need to change in the Code, so
that that Data is printed only once and not several times ? I have tried to Scrape Data from a Website, where entries contain the word between or Flypast.

When I use the following piece of Code instead :-

for item in soup3.find_all('a', string=re.compile(r'between|Flypast')):
    li1 = item.find_parent().text
    #li2 = li1.find_previous().font
    #print(link)
    #print(li1)
    #print(li2)
     
    BBMF_2022.append(li1)
 
 df = pd.DataFrame(BBMF_2022, columns=['BBMF_2022'])
 
 df


The first entry for the 28th May, is printed out in the DataFrame 15 times ! instead of 15 seperate Entries I mentioned before.

Any help would be much appreciated.

Best Regards

Eddie Winch ))
Reply
#2
You are using a redirected url, instead use: https://python-forum.io/thread-37342.html ?

This code will get all data and save as a json file, without any filtering. You can add filters, and any other data you need
import requests
from bs4 import BeautifulSoup
import os
import json
import sys

class airshowdata:
    def __init__(self):
        self.airshow_details = {}
        self.cd = CreateDict()
        self.jsonfile = 'airshow.json'

    def get_links(self):
        url = 'https://www.military-airshows.co.uk/press22/bbmfschedule2022.htm'

        res3 = requests.get(url)
        if res3.status_code == 200:
            soup3 = BeautifulSoup(res3.content,'lxml')
        else:
            print(f"Cannot load page {url}")
            sys.exit(-1)

        links = soup3.find_all('a')
        for link in links:
            anode = self.cd.add_node(self.airshow_details, link.text.strip())
            self.cd.add_cell(anode, 'url', link.get('href'))

        with open(self.jsonfile, 'w') as fp:
            json.dump(self.airshow_details, fp)

        # following not needed and can be removed (displays dictionary contents)
        self.cd.display_dict(self.airshow_details)


class CreateDict:
    """
    CreateDict.py - Contains methods to simplify node and cell creation within
                    a dictionary

    Usage:     
        new_dict(dictname) - Creates a new dictionary instance with the name
            contained in dictname

        add_node(parent, nodename) - Creates a new node (nested dictionary)
            named in nodename, in parent dictionary.

        add_cell(nodename, cellname, value) - Creates a leaf node within node
            named in nodename, with a cell name of cellname, and value of value.

        display_dict(dictname) - Recursively displays a nested dictionary.

    Requirements:
        Python standard library:
            os
    
    Author: Larz60+  -- May 2019.
    """
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

    def new_dict(self, dictname):
        setattr(self, dictname, {})

    def add_node(self, parent, nodename):
        node = parent[nodename] = {}
        return node

    def add_cell(self, nodename, cellname, value):
        cell =  nodename[cellname] = value
        return cell

    def display_dict(self, dictname, level=0):
        indent = " " * (4 * level)
        for key, value in dictname.items():
            if isinstance(value, dict):
                print(f'\n{indent}{key}')
                level += 1
                self.display_dict(value, level)
            else:
                print(f'{indent}{key}: {value}')
            if level > 0:
                level -= 1


def main():
    airs = airshowdata()
    airs.get_links()


if __name__ == '__main__':
    main()
Reply
#3
Many thanks for that Code Larz60+, its very much appreciated by me, thankyou for taking the time to type it. I chose
the web.archive link, because the Data is from a week ago, from that Website, the 21st May Data was removed from the Website the other day.

Does anyone have any idea, how I can change my Code, to solve the issue I am having with it ?

Any help would be very much appreciated.

Regards

Eddie Winch ))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Basic SQL query using Py: Inserting or querying sqlite3 database not returning data marlonbown 3 1,362 Nov-08-2022, 07:16 PM
Last Post: marlonbown
  Showing an empty chart, then input data via function kgall89 0 975 Jun-02-2022, 01:53 AM
Last Post: kgall89
  Showing data change korenron 10 2,554 Mar-20-2022, 01:50 PM
Last Post: korenron
  Extracting data without showing dtype, name etc. tgottsc1 3 4,380 Jan-10-2021, 02:15 PM
Last Post: buran
  Converting query string as a condition for filter data. shah_entrance 1 1,785 Jan-14-2020, 09:22 AM
Last Post: perfringo
  LDAP code to query for host not returning data burvil 2 3,754 Oct-17-2018, 10:03 PM
Last Post: burvil
  Dataframe Data Query Database ab0217 0 2,273 Oct-16-2018, 02:26 AM
Last Post: ab0217

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020