Dec-14-2022, 12:33 PM
Good evening,
I hope someone will be able to help me with this very basic algorithm I'm working on, this is my first post so please apologise if I miss something.
The scenario is this: there is a job ads website I should scrape. It is not dynamic.
Ads are grouped into pages, whose URL would be like this:
where
When a page does not have any job ads in it, it won't retrieve a 404 error or something like that, but instead a blank webpage with a specific message in Italian.
I already have a method called
That being said, the first step of my web scraping routine would be to know what is the last non-empty page.
I came up with this simple algorithm based on a binary search:
[i]* Update: It's also been verified that page numbers are always contiguous, integer numbers.
It works exactly as expected, with one big problem: since it works with real numbers, it doesn't approximate to decimals and ends up always in an infinite loop, clearly because there are infinite decimals between two integers. To clarify, this is an example of output:
How I tried to fix it:
Any ideas?
I hope someone will be able to help me with this very basic algorithm I'm working on, this is my first post so please apologise if I miss something.
The scenario is this: there is a job ads website I should scrape. It is not dynamic.
Ads are grouped into pages, whose URL would be like this:
https://www.jobintourism.it/search-offerte/?sf_paged=K
where
K
is the page number.When a page does not have any job ads in it, it won't retrieve a 404 error or something like that, but instead a blank webpage with a specific message in Italian.
I already have a method called
page_exists(no: int)
that checks that, and returns True
if the page is not empty and print a message in the console. That being said, the first step of my web scraping routine would be to know what is the last non-empty page.
I came up with this simple algorithm based on a binary search:
def get_last_page_available(self) -> int: # Gets last non-empty ads page from JIT with a modified binary search algorithm. # Begins from an arbitrary value of 100, as it's the most common scenario for JIT. MIN = 0 max = 100 test = max / 2 while test != 0: if self.page_exists(test): MIN = test test = (max + MIN) / 2 else: max = test test = (max + MIN) / 2 else: return testI think it is pretty clear how it works but just to clarify: it begins with a possible range of page numbers (min >>> max), then tests if
test
value exists: if YES , it ADDS (max+min)/2
to the test value to widen the search, if NOT does the opposite. This because I have already verified that if page number k exists, also k-1, k-2 ... 1 do.* In short, this is a divide et impera algorithm. [i]* Update: It's also been verified that page numbers are always contiguous, integer numbers.
It works exactly as expected, with one big problem: since it works with real numbers, it doesn't approximate to decimals and ends up always in an infinite loop, clearly because there are infinite decimals between two integers. To clarify, this is an example of output:
Output:Page 50.0 exists.
Page 75.0 does not exist.
Page 62.5 exists.
Page 68.75 exists.
Page 71.875 exists.
Page 73.4375 exists.
Page 74.21875 does not exist.
Page 73.828125 exists.
Page 74.0234375 does not exist.
Page 73.92578125 exists.
Page 73.974609375 exists.
Page 73.9990234375 exists.
Page 74.01123046875 does not exist.
Page 74.005126953125 does not exist.
Page 74.0020751953125 does not exist.
Page 74.00054931640625 does not exist.
Page 73.99978637695312 exists.
Page 74.00016784667969 does not exist.
Page 73.9999771118164 exists.
Page 74.00007247924805 does not exist.
Page 74.00002479553223 does not exist.
Page 74.00000095367432 does not exist.
Page 73.99998903274536 exists.
Page 73.99999499320984 exists.
Page 73.99999797344208 exists.
Page 73.9999994635582 exists.
Page 74.00000020861626 does not exist.
Page 73.99999983608723 exists.
Page 74.00000002235174 does not exist.
Page 73.99999992921948 exists.
Page 73.99999997578561 exists.
Page 73.99999999906868 exists.
Page 74.00000001071021 does not exist.
As a matter of fact, page 73 was the last one available, but it ended up in an infinite loop. How I tried to fix it:
- Tried with floor division instead of ordinary division inside the method, but it messes up results.
- Tried with keeping track of
test
value by declaringtest_history = []
at the beginning (the idea was to check each time whether the difference between one result and the other one after is less than 1), but it doesn't work because at the beginning the list is empty.
- Tried the less-elegant solution of checking each page one by one until
page_exists()
returnsFalse
, but since we are talking of web scraping it is very time and bandwidth consuming and generally looks like a very error prone and bad solution.
Any ideas?