Python Forum
Using Python to search through a list of urls
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using Python to search through a list of urls
#1
I want to be able to extract data from multiple pages. The pages are in the following format:

Output:
https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=1&sort_order=price_asc https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=2&sort_order=price_asc https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=3&sort_order=price_asc
In these links the only thing that changes in the url is the number following page=

I have created code so far that exports into a results into a csv file. However this only works for 1 url:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=1&sort_order=price_asc'

# opening up connection, grabbing the page
uClient =  uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parser
page_soup = soup(page_html, "html.parser")


# grabs each property
listings = page_soup.findAll("div",{"class":"tmp-search-card-list-view__card-content"})

filename = "trademe.csv"
f = open(filename, "w")

headers = "title, price, area\n"

f.write(headers)

for listing in listings:

	title_listing = listing.findAll("div", {"class":"tmp-search-card-list-view__title"})
	price_listing = listing.findAll("div", {"class":"tmp-search-card-list-view__price"})
	area_listing = listing.findAll("div", {"class":"tmp-search-card-list-view__subtitle"})
	title = title_listing[0].text.strip()
	price = price_listing[0].text.strip()
	area = area_listing[0].text.strip()

	print("title: " + title)
	print("price: " + price)
	print("area: " + area)

	f.write(title.replace(",", "^") + "," + price.replace(",", "") + "," + area.replace(",", "^") + "\n")

f.close()
How would I get these working so that it keeps going through all the numbers of urls?

I could create a textfile with the possible links but still not sure what to do to get this to work

I'm new to python
Reply
#2
You can do this with a generator
Note that I have given this default values for start and end, so as in first call, the defaults will be used if called without values:
def get_next_url(start_page_no=1, end_page_no=5):
    for pgno in range(start_page_no, end_page_no+1):
        yield(f"https://www.trademe.co.nz/browse/" \
            f"categoryattributesearchresults.aspx" \
            f"?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&" \
            f"rsqid=d4360a620e944164b321dc2498f327b9-002" \
            f"&nofilters=1&originalsidebar=1&key=1227701521" \
            f"&page={pgno}&sort_order=price_asc")

def main():
    # called, using default:
    print('\nUsing defailt:')
    for url in get_next_url():
        print(f"\nurl: {url}")

    # called providing start and end values
    start_page = 7
    end_page = 10
    print(f"\n\nProviding start and end pages")
    for url in get_next_url(start_page, end_page):
        print(f"\nurl: {url}")

if __name__ == '__main__':
    main()
results of running above:
Output:
Using defailt: url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=1&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=2&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=3&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=4&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=5&sort_order=price_asc Providing start and end pages url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=7&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=8&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=9&sort_order=price_asc url: https://www.trademe.co.nz/browse/categoryattributesearchresults.aspx?cid=5748&search=1&134=9&135=2&rptpath=350-5748-&rsqid=d4360a620e944164b321dc2498f327b9-002&nofilters=1&originalsidebar=1&key=1227701521&page=10&sort_order=price_asc
Reply
#3
Thanks so much for your help Larz! Still not sure how to get my code incorporated into this. How do I get it so that for my scrape it scrapes all of the pages and gives results in excel such as:



title price area
Project Sleepout/extra room for relocation $28,750 Saleyards Road^ Kauri^ Whangarei
Leisurebuilt Cabin $56,000 Waipu Cove^ Whangarei
Large Two Bedroom Open Plan Relocatable $74,000 Kamo^ Whangarei
3 Bedroom Family Home - 1 Piece Move $82,000 Otangarei^ Whangarei
Solid 3 Bedroom 1 Piece Move $82,000 Kamo^ Whangarei
Three bedroom house for relocation $85,000 Saleyards Road^ Kauri^ Whangarei
BUILD YOUR EASY CARE RURAL HOME HERE! $95,000 Titoki^ Whangarei
Vendors Say Present All Offers Price by negotiation Whangarei Heads^ Whangarei
Relocatable homes - New $105,900 34 Lakeside Park Road^ Ruakaka^ Whangarei
Lovely Character Villa Relocated To Your Site $110,000 Waipu^ Whangarei
Lovely Character Villa $110,000 Kamo^ Whangarei
Solid 3 Bedroom Bungalow $135,000 Maungakaramea^ Whangarei
Create Your Dream Right Here! Price by negotiation Raumanga^ Whangarei
Let your imagination run wild ... $149,000 Maunu^ Whangarei
WHANGAREI'S CHEAPEST SECTION $150,000 Raumanga^ Whangarei
Price reduced - now only $159^000 $159,000 1/891 Cove Road^ Waipu^ Whangarei
Your holiday awaits! $167,000 3/891 Cove Road^ Waipu^ Whangarei
Best Value Around? $169,000 Kamo^ Whangarei
Look^ the price is not a mistake!! $169,000 Kamo^ Whangarei
The perfect beach getaway and investment $179,500 5/891 Cove Road^ Waipu^ Whangarei
Elevated with 180 degree views $185,000 Kamo^ Whangarei
Prime Sections - Kotata Heights Morningside Enquiries over $190000 Morningside^ Whangarei
Dream it - Build It - Live Here Price by negotiation Horahora^ Whangarei
Price Reduced - Pebble Beach Boulevard Section $190,000 Kamo^ Whangarei



My current file does this but it only does this on one page
Reply
#4
you need to put it all in a loop:
for url in get_next_url(start_page, end_page):
    # code for each page goes here
Reply
#5
You are getting parsed values from for loop right? that you need to write to csv file in parallel

For ex:

with open('data.csv','w+') as f:
    for url in get_next_url(start_page, end_page):
        print(f"\nurl: {url}")
        f.write(url)
And if you need more specific headers then you can very well use csv module to create all the headers you want and you have to parse all the urls accordingly to get the details you are looking for. Details needs to be populated under respective headers in csv file
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Search Excel File with a list of values huzzug 4 1,217 Nov-03-2023, 05:35 PM
Last Post: huzzug
  search a list or tuple for a specific type ot class Skaperen 8 1,918 Jul-22-2022, 10:29 PM
Last Post: Skaperen
  Use one list as search key for another list with sublist of list jc4d 4 2,158 Jan-11-2022, 12:10 PM
Last Post: jc4d
  Search in an unsorted list amir_0402 2 14,436 Jun-04-2020, 10:25 PM
Last Post: deanhystad
  Alpha numeric element list search rhubarbpieguy 1 1,779 Apr-01-2020, 12:41 PM
Last Post: pyzyx3qwerty
  search binary file and list all founded keyword offset Pyguys 4 2,758 Mar-17-2020, 06:46 AM
Last Post: Pyguys
  Urls in a file to be executed pyseeker 2 2,034 Sep-09-2019, 03:38 PM
Last Post: pyseeker
  user validation for opening urls Ashley 6 2,708 Jul-08-2019, 09:08 PM
Last Post: metulburr
  Search a List of Dictionaries by Key-Value Pair; Return Dictionary/ies Containing KV dn237 19 6,696 May-29-2019, 02:27 AM
Last Post: heiner55
  Linear search/searching for a word in a file/list kietrichards 3 3,449 Mar-08-2019, 07:58 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020