Python Forum
Web scraper not populating .txt with scraped data
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web scraper not populating .txt with scraped data
#1
Hey everyone,

I was wondering if I could get you guys to help me out a little. I'm attempting to make a web scraper to scrape a site for some strings of numbers. I'm first trying to scrape a list of links and then join those URLs and scrape the strings of numbers that I'm looking for. At the end of that I'm just trying to save the scraped strings of numbers to a .txt file for later use.

I'm surprisingly not getting any errors, but my .txt file is not being populated with the scraped data. I'm thinking that maybe I'm not directing it correctly through the html via html tags/classes?


here is my code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
for link in soup.find_all('a', class_='noline'):
    anchor = link.find('a')
    if anchor and (href := anchor.get('href')):
        links.append(urljoin(url, href))

# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
    response = requests.get(link)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for tag in soup.find_all('div', class_='ball blue5 fcblack1'):
        numbers = tag.text.strip().split()
        winning_numbers.append(numbers)

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as f:
    for numbers in winning_numbers:
        f.write(''.join(numbers) + '\n')
I joined a while ago, but I'm still not good at coding lol, so please forgive my ignorance, or my bad looking code. Any help in pointing out my mistakes and how to fix them would be greatly appreciated please.
Reply
#2
To help with first part.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = []
base_url = 'http://www.calotteryx.com'
for link in soup.find_all('a', class_='noline'):
    if 'Fantasy' in link.get('href'):
        #print(f"{base_url}{link.get('href')}")
        links.append(f"{base_url}{link.get('href')}")

print(links)
Tips try to test code incremental at all stages as you code,eg after line 12 nothing will work.
print() work fine for testing this out.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
for link in soup.find_all('a', class_='noline'):
    anchor = link.find('a')
    print(anchor)
Output:
None None None ...
BlackHeart likes this post
Reply
#3

Smile Thank you! That helped a lot, and that was such a great tip. I actually got it to work here is my code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
Base_url = 'http://www.calotteryx.com'
for link in soup.find_all('a', class_='noline'):
   if 'Fantasy' in link.get('href'):
      links.append(f"{Base_url}{link.get('href')}")

# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
    response = requests.get(link)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for tag in soup.find_all('div', class_='ball blue5 fcblack1'):
        numbers = tag.text.strip().split()
        winning_numbers.append(numbers)

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as Nums:
    for winners in winning_numbers:
        Nums.write("%s\n" % winners)
    print('Done')
The last thing I need to figure out is how to format the data that's been scraped. Currently every digit (01,15,02,10, etc) is considered it's own string I think? So it's being written to the document via
 Nums.write("%s\n" % winners)
and it's format is vertical. After every string it creates a new line, which I think I understand why. It's the %s\n, but I want it to create a new line every 5, or I guess it would be every 10 numbers because each string has two digits in it. My solution was to try something like
Nums.write('%s %s %s %s %s\n' % winners)
but I get a TypeError: not enough arguments in string.

I appreciate being pointed in the right direction please. I'm thinking may possibly use f-strings, or format(), but am unsure if those will work like how I want them to.

anyway thanks for all the help.
Reply
#4
So I was able to figure it out. This is how I accomplished it:

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as f:
    count = 0
    for winner in winning_numbers:
        f.write('{:<3}'.format(str(winner)))
        count += 1
        if count % 5 == 0:
            f.write('\n')
I'll leave this here in case someone else is having the same issue.
Reply
#5
Some tips,the structure that you save is not so useful if shall do something with numbers.
So when you save it look like this a text file with number in a list.
Output:
['09']['10']['18']['29']['32'] ['08']['16']['17']['20']['37'] ['02']['10']['15']['30']['36'] ['01']['13']['29']['33']['36'] ['03']['16']['24']['31']['36'] .....
So this is my charges, this an other way to get numbers and all numbers that's in class_="idx".
Then format is like this 06-10-16-25-32.
Also i save the data structure(list of list) to json, then can get same data structure(Serialization) back when read the file.
Example:
import requests
from bs4 import BeautifulSoup
from itertools import islice
import re
import json

def link_monthly():
    '''Scrape the links to each monthly results page'''
    url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    base_url = 'http://www.calotteryx.com'
    for link in soup.find_all('a', class_='noline'):
        if 'by-month' in link.get('href'):
             yield f"{base_url}{link.get('href')}"

if __name__ == '__main__':
    # Scrape the links to each monthly results page
    first_2 = islice(link_monthly(), 2)
    winning_numbers = []
    #for link in link_monthly(): # all months
    for link in first_2:
        response = requests.get(link)
        soup = BeautifulSoup(response.content, 'lxml')
        title_tag = soup.find_all(class_="idx")
        for numb in title_tag:
           res = re.search('are\s(.*)', numb.get('title'))
           winning_numbers.append(res.group(1).split('-'))
    # Save
    with open('winning_numbs.json', 'w') as js:
        json.dump(winning_numbers, js)
    # Read
    with open('winning_numbs.json') as j_read:
        win_numbers = json.load(j_read)
Output:
>>> win_numbers [['08', '16', '23', '24', '26'], ['02', '13', '25', '32', '37'], ['13', '19', '33', '36', '39'], ['08', '10', '22', '26', '30'], ['08', '13', '28', '30', '35'], ['01', '26', '27', '30', '39'], ['12', '15', '17', '26', '39'], ['09', '19', '21', '33', '36'], ['01', '14', '21', '32', '39'], ['10', '15', '20', '32', '39'], ['02', '16', '22', '30', '36'], ['01', '07', '18', '37', '38'], ['06', '18', '28', '32', '39'], ['01', '12', '18', '25', '35'], ['11', '12', '14', '21', '26'], ['02', '16', '31', '34', '38'], ['19', '29', '35', '37', '39'], ['08', '21', '22', '28', '38'], ['09', '23', '24', '30', '38'], ['12', '16', '32', '38', '39'], ['02', '06', '22', '23', '27'], ['07', '09', '16', '27', '32'], ['13', '17', '21', '24', '28'], ['05', '17', '18', '28', '39'], ['10', '16', '22', '31', '34'], ['03', '19', '26', '28', '32'], ['13', '14', '23', '25', '37'], ['17', '19', '26', '30', '39'], ['02', '12', '14', '17', '38'], ['08', '10', '12', '15', '33'], ['01', '03', '08', '10', '26'], ['12', '22', '27', '37', '39'], ['13', '18', '22', '25', '34'], ['02', '10', '14', '31', '33'], ['24', '25', '26', '28', '39'], ['02', '10', '11', '29', '34'], ['06', '07', '08', '34', '35'], ['08', '11', '33', '34', '38'], ['02', '07', '09', '27', '34'], ['07', '10', '22', '23', '31'], ['03', '10', '25', '27', '39'], ['02', '05', '11', '20', '23'], ['07', '11', '15', '36', '38'], ['05', '14', '25', '32', '34'], ['25', '28', '30', '37', '39'], ['11', '12', '34', '37', '38'], ['03', '09', '11', '17', '28'], ['02', '11', '12', '15', '22'], ['02', '07', '09', '10', '39'], ['07', '08', '23', '31', '36'], ['09', '18', '30', '35', '37'], ['04', '16', '19', '28', '33'], ['02', '22', '29', '30', '39'], ['06', '10', '16', '25', '32'], ['03', '08', '18', '22', '35'], ['09', '22', '27', '30', '33'], ['21', '22', '33', '36', '39'], ['03', '04', '13', '15', '25'], ['11', '16', '25', '30', '32']]
As you see the format is better as get the list of list back as it is,after save it.
So example first 3 and get a singel number.
>>> win_numbers[0:3]
[['08', '16', '23', '24', '26'],
 ['02', '13', '25', '32', '37'],
 ['13', '19', '33', '36', '39']]
>>> win_numbers[0]
['08', '16', '23', '24', '26']
>>> win_numbers[0][-1]
'26'
Reply
#6
Here a Notebook with frequency of the lottery numbers.
So the number 02๐Ÿ‘€ is in lead with 1323,so do have that in line,
and 38 is in bottom with 1139,or it doesn't matter at all๐Ÿฆ„
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Weird characters scraped samuelbachorik 3 857 Oct-29-2023, 02:36 PM
Last Post: DeaD_EyE
  Web scraper tomenzo123 8 4,293 Aug-18-2023, 12:45 PM
Last Post: Gaurav_Kumar
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,163 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,690 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  Image Scraper (beautifulsoup), stopped working, need to help see why woodmister 9 3,961 Jan-12-2021, 04:10 PM
Last Post: woodmister
  Any way to remove HTML tags from scraped data? (I want text only) SeBz2020uk 1 3,414 Nov-02-2020, 08:12 PM
Last Post: Larz60+
  cant loop through scraped site matt42 3 2,377 Aug-12-2020, 06:48 AM
Last Post: ndc85430
  Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL MidnightDreamer 4 2,962 Mar-12-2020, 09:57 AM
Last Post: BrandonKastning
  Pre-populating WTForms form values for edit danfoster 0 2,351 Feb-25-2020, 01:37 PM
Last Post: danfoster
  Python using BS scraper paulfearn100 1 2,501 Feb-07-2020, 10:22 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020