Web scraper not populating .txt with scraped data

BlackHeart · Apr-01-2023, 07:08 PM

Hey everyone,

I was wondering if I could get you guys to help me out a little. I'm attempting to make a web scraper to scrape a site for some strings of numbers. I'm first trying to scrape a list of links and then join those URLs and scrape the strings of numbers that I'm looking for. At the end of that I'm just trying to save the scraped strings of numbers to a .txt file for later use.

I'm surprisingly not getting any errors, but my .txt file is not being populated with the scraped data. I'm thinking that maybe I'm not directing it correctly through the html via html tags/classes?

here is my code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
for link in soup.find_all('a', class_='noline'):
    anchor = link.find('a')
    if anchor and (href := anchor.get('href')):
        links.append(urljoin(url, href))

# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
    response = requests.get(link)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for tag in soup.find_all('div', class_='ball blue5 fcblack1'):
        numbers = tag.text.strip().split()
        winning_numbers.append(numbers)

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as f:
    for numbers in winning_numbers:
        f.write(''.join(numbers) + '\n')

I joined a while ago, but I'm still not good at coding lol, so please forgive my ignorance, or my bad looking code. Any help in pointing out my mistakes and how to fix them would be greatly appreciated please.

***snippsat*** · Apr-01-2023, 09:35 PM

To help with first part.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = []
base_url = 'http://www.calotteryx.com'
for link in soup.find_all('a', class_='noline'):
    if 'Fantasy' in link.get('href'):
        #print(f"{base_url}{link.get('href')}")
        links.append(f"{base_url}{link.get('href')}")

print(links)

Tips try to test code incremental at all stages as you code,eg after line 12 nothing will work.
print() work fine for testing this out.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
for link in soup.find_all('a', class_='noline'):
    anchor = link.find('a')
    print(anchor)

Output:None
None
None
...

BlackHeart · (This post was last modified: Apr-02-2023, 04:15 PM by BlackHeart.)

Thank you! That helped a lot, and that was such a great tip. I actually got it to work here is my code:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, 'html.parser')
links = []
Base_url = 'http://www.calotteryx.com'
for link in soup.find_all('a', class_='noline'):
   if 'Fantasy' in link.get('href'):
      links.append(f"{Base_url}{link.get('href')}")

# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
    response = requests.get(link)
    content = response.content
    soup = BeautifulSoup(content, 'html.parser')
    for tag in soup.find_all('div', class_='ball blue5 fcblack1'):
        numbers = tag.text.strip().split()
        winning_numbers.append(numbers)

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as Nums:
    for winners in winning_numbers:
        Nums.write("%s\n" % winners)
    print('Done')

The last thing I need to figure out is how to format the data that's been scraped. Currently every digit (01,15,02,10, etc) is considered it's own string I think? So it's being written to the document via

 Nums.write("%s\n" % winners)

and it's format is vertical. After every string it creates a new line, which I think I understand why. It's the %s\n, but I want it to create a new line every 5, or I guess it would be every 10 numbers because each string has two digits in it. My solution was to try something like

Nums.write('%s %s %s %s %s\n' % winners)

but I get a TypeError: not enough arguments in string.

I appreciate being pointed in the right direction please. I'm thinking may possibly use f-strings, or format(), but am unsure if those will work like how I want them to.

anyway thanks for all the help.

BlackHeart · Apr-02-2023, 07:04 PM

So I was able to figure it out. This is how I accomplished it:

# Write the winning numbers to a file
with open('winning_numbers.txt', 'w') as f:
    count = 0
    for winner in winning_numbers:
        f.write('{:<3}'.format(str(winner)))
        count += 1
        if count % 5 == 0:
            f.write('\n')

I'll leave this here in case someone else is having the same issue.

***snippsat*** · (This post was last modified: Apr-03-2023, 12:00 PM by snippsat.)

Some tips,the structure that you save is not so useful if shall do something with numbers.
So when you save it look like this a text file with number in a list.

Output:['09']['10']['18']['29']['32']
['08']['16']['17']['20']['37']
['02']['10']['15']['30']['36']
['01']['13']['29']['33']['36']
['03']['16']['24']['31']['36']
.....

So this is my charges, this an other way to get numbers and all numbers that's in class_="idx".
Then format is like this 06-10-16-25-32.
Also i save the data structure(list of list) to json, then can get same data structure(Serialization) back when read the file.
Example:

import requests
from bs4 import BeautifulSoup
from itertools import islice
import re
import json

def link_monthly():
    '''Scrape the links to each monthly results page'''
    url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    base_url = 'http://www.calotteryx.com'
    for link in soup.find_all('a', class_='noline'):
        if 'by-month' in link.get('href'):
             yield f"{base_url}{link.get('href')}"

if __name__ == '__main__':
    # Scrape the links to each monthly results page
    first_2 = islice(link_monthly(), 2)
    winning_numbers = []
    #for link in link_monthly(): # all months
    for link in first_2:
        response = requests.get(link)
        soup = BeautifulSoup(response.content, 'lxml')
        title_tag = soup.find_all(class_="idx")
        for numb in title_tag:
           res = re.search('are\s(.*)', numb.get('title'))
           winning_numbers.append(res.group(1).split('-'))
    # Save
    with open('winning_numbs.json', 'w') as js:
        json.dump(winning_numbers, js)
    # Read
    with open('winning_numbs.json') as j_read:
        win_numbers = json.load(j_read)

Output:>>> win_numbers
[['08', '16', '23', '24', '26'],
 ['02', '13', '25', '32', '37'],
 ['13', '19', '33', '36', '39'],
 ['08', '10', '22', '26', '30'],
 ['08', '13', '28', '30', '35'],
 ['01', '26', '27', '30', '39'],
 ['12', '15', '17', '26', '39'],
 ['09', '19', '21', '33', '36'],
 ['01', '14', '21', '32', '39'],
 ['10', '15', '20', '32', '39'],
 ['02', '16', '22', '30', '36'],
 ['01', '07', '18', '37', '38'],
 ['06', '18', '28', '32', '39'],
 ['01', '12', '18', '25', '35'],
 ['11', '12', '14', '21', '26'],
 ['02', '16', '31', '34', '38'],
 ['19', '29', '35', '37', '39'],
 ['08', '21', '22', '28', '38'],
 ['09', '23', '24', '30', '38'],
 ['12', '16', '32', '38', '39'],
 ['02', '06', '22', '23', '27'],
 ['07', '09', '16', '27', '32'],
 ['13', '17', '21', '24', '28'],
 ['05', '17', '18', '28', '39'],
 ['10', '16', '22', '31', '34'],
 ['03', '19', '26', '28', '32'],
 ['13', '14', '23', '25', '37'],
 ['17', '19', '26', '30', '39'],
 ['02', '12', '14', '17', '38'],
 ['08', '10', '12', '15', '33'],
 ['01', '03', '08', '10', '26'],
 ['12', '22', '27', '37', '39'],
 ['13', '18', '22', '25', '34'],
 ['02', '10', '14', '31', '33'],
 ['24', '25', '26', '28', '39'],
 ['02', '10', '11', '29', '34'],
 ['06', '07', '08', '34', '35'],
 ['08', '11', '33', '34', '38'],
 ['02', '07', '09', '27', '34'],
 ['07', '10', '22', '23', '31'],
 ['03', '10', '25', '27', '39'],
 ['02', '05', '11', '20', '23'],
 ['07', '11', '15', '36', '38'],
 ['05', '14', '25', '32', '34'],
 ['25', '28', '30', '37', '39'],
 ['11', '12', '34', '37', '38'],
 ['03', '09', '11', '17', '28'],
 ['02', '11', '12', '15', '22'],
 ['02', '07', '09', '10', '39'],
 ['07', '08', '23', '31', '36'],
 ['09', '18', '30', '35', '37'],
 ['04', '16', '19', '28', '33'],
 ['02', '22', '29', '30', '39'],
 ['06', '10', '16', '25', '32'],
 ['03', '08', '18', '22', '35'],
 ['09', '22', '27', '30', '33'],
 ['21', '22', '33', '36', '39'],
 ['03', '04', '13', '15', '25'],
 ['11', '16', '25', '30', '32']]

As you see the format is better as get the list of list back as it is,after save it.
So example first 3 and get a singel number.

>>> win_numbers[0:3]
[['08', '16', '23', '24', '26'],
 ['02', '13', '25', '32', '37'],
 ['13', '19', '33', '36', '39']]
>>> win_numbers[0]
['08', '16', '23', '24', '26']
>>> win_numbers[0][-1]
'26'

***snippsat*** · Apr-03-2023, 05:12 PM

Here a Notebook with frequency of the lottery numbers.
So the number 02👀 is in lead with 1323,so do have that in line,
and 38 is in bottom with 1139,or it doesn't matter at all🦄

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Weird characters scraped	samuelbachorik	3	2,145	Oct-29-2023, 02:36 PM Last Post: DeaD_EyE
	Web scraper	tomenzo123	8	6,435	Aug-18-2023, 12:45 PM Last Post: Gaurav_Kumar
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	3,384	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	2,637	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	Image Scraper (beautifulsoup), stopped working, need to help see why	woodmister	9	5,710	Jan-12-2021, 04:10 PM Last Post: woodmister
	Any way to remove HTML tags from scraped data? (I want text only)	SeBz2020uk	1	4,601	Nov-02-2020, 08:12 PM Last Post: Larz60+
	cant loop through scraped site	matt42	3	3,418	Aug-12-2020, 06:48 AM Last Post: ndc85430
	Court Opinion Scraper in Python w/ BS4 (Currently exports to CSV) need help with SQL	MidnightDreamer	4	4,193	Mar-12-2020, 09:57 AM Last Post: BrandonKastning
	Pre-populating WTForms form values for edit	danfoster	0	4,143	Feb-25-2020, 01:37 PM Last Post: danfoster
	Python using BS scraper	paulfearn100	1	3,237	Feb-07-2020, 10:22 PM Last Post: snippsat

Web scraper not populating .txt with scraped data

User Panel Messages

Announcements