Python Forum

Good evening,

I hope someone will be able to help me with this very basic algorithm I'm working on, this is my first post so please apologise if I miss something.

The scenario is this: there is a job ads website I should scrape. It is not dynamic.
Ads are grouped into pages, whose URL would be like this:
https://www.jobintourism.it/search-offerte/?sf_paged=K
where K is the page number.
When a page does not have any job ads in it, it won't retrieve a 404 error or something like that, but instead a blank webpage with a specific message in Italian.
I already have a method called page_exists(no: int) that checks that, and returns True if the page is not empty and print a message in the console.

That being said, the first step of my web scraping routine would be to know what is the last non-empty page.
I came up with this simple algorithm based on a binary search:

 def get_last_page_available(self) -> int:
        # Gets last non-empty ads page from JIT with a modified binary search algorithm.
        # Begins from an arbitrary value of 100, as it's the most common scenario for JIT.
        MIN = 0
        max = 100
        test = max / 2

        while test != 0:
            if self.page_exists(test):
                MIN = test
                test = (max + MIN) / 2
            else:
                max = test
                test = (max + MIN) / 2
        else:
            return test

I think it is pretty clear how it works but just to clarify: it begins with a possible range of page numbers (min >>> max), then tests if test value exists: if YES , it ADDS (max+min)/2 to the test value to widen the search, if NOT does the opposite. This because I have already verified that if page number k exists, also k-1, k-2 ... 1 do.* In short, this is a divide et impera algorithm.
[i]* Update: It's also been verified that page numbers are always contiguous, integer numbers.

It works exactly as expected, with one big problem: since it works with real numbers, it doesn't approximate to decimals and ends up always in an infinite loop, clearly because there are infinite decimals between two integers. To clarify, this is an example of output:

Output:Page 50.0 exists.
Page 75.0 does not exist.
Page 62.5 exists.
Page 68.75 exists.
Page 71.875 exists.
Page 73.4375 exists.
Page 74.21875 does not exist.
Page 73.828125 exists.
Page 74.0234375 does not exist.
Page 73.92578125 exists.
Page 73.974609375 exists.
Page 73.9990234375 exists.
Page 74.01123046875 does not exist.
Page 74.005126953125 does not exist.
Page 74.0020751953125 does not exist.
Page 74.00054931640625 does not exist.
Page 73.99978637695312 exists.
Page 74.00016784667969 does not exist.
Page 73.9999771118164 exists.
Page 74.00007247924805 does not exist.
Page 74.00002479553223 does not exist.
Page 74.00000095367432 does not exist.
Page 73.99998903274536 exists.
Page 73.99999499320984 exists.
Page 73.99999797344208 exists.
Page 73.9999994635582 exists.
Page 74.00000020861626 does not exist.
Page 73.99999983608723 exists.
Page 74.00000002235174 does not exist.
Page 73.99999992921948 exists.
Page 73.99999997578561 exists.
Page 73.99999999906868 exists.
Page 74.00000001071021 does not exist.

As a matter of fact, page 73 was the last one available, but it ended up in an infinite loop.

How I tried to fix it:

Tried with floor division instead of ordinary division inside the method, but it messes up results.
Tried with keeping track of test value by declaring test_history = [] at the beginning (the idea was to check each time whether the difference between one result and the other one after is less than 1), but it doesn't work because at the beginning the list is empty.
Tried the less-elegant solution of checking each page one by one until page_exists() returns False, but since we are talking of web scraping it is very time and bandwidth consuming and generally looks like a very error prone and bad solution.

Any ideas?

Please wrap posted code in Python tags.

I don't see how a binary search is applicable. A binary search would work if page_exists(k) returned False when k >= number of pages, but it doesn't sound like that is what it does. Instead, if I understand you correctly, page_exists(k) returns False if k >= number of pages or if page[k] doesn't have any adds.

The code below demonstrates the problem. last_page() finds the last page when pages are contiguous, but when I make a list of pages that has holes, it can grab the last page before one of the holes.

import random

def page_exists(page):
    return page in pages

def last_page():
    lower = 0
    upper = 100
    mid  = upper
    while upper > lower+1:
        if page_exists(mid):
            lower = mid
        else:
            upper = mid
        mid = (upper + lower) // 2
    return mid

# This always finds the last page == 50
pages = list(range(51))
print(last_page())

# This always find a page, but not the always the last page
pages = sorted(random.sample(range(51), 15))
print(pages)
print(last_page())

Output:50
[0, 5, 6, 9, 10, 14, 15, 18, 21, 22, 23, 29, 39, 41, 47]
10

In the non-contiguous data set, last page found the hole between 10 and 14 and decide that was the end of pages.

Your implementation of a binary search is also incorrect. The end condition is not test == 0, it is test == last page. The way to know that you found the last page is max = min + 1.

(Dec-14-2022, 02:40 PM)deanhystad Wrote: [ -> ]Please wrap posted code in Python tags.

Thanks for the head up, corrected.

(Dec-14-2022, 02:40 PM)deanhystad Wrote: [ -> ]I don't see how a binary search is applicable. A binary search would work if page_exists(k) returned False when k >= number of pages, but it doesn't sound like that is what it does. Instead, if I understand you correctly, page_exists(k) returns False if k >= number of pages or if page[k] doesn't have any adds.

Thanks again, I understand the reference to binary search algorithm was wrong.

(Dec-14-2022, 02:40 PM)deanhystad Wrote: [ -> ]The code below demonstrates the problem. last_page() finds the last page when pages are contiguous, but when I make a list of pages that has holes, it can grab the last page before one of the holes.
In the non-contiguous data set, last page found the hole between 10 and 14 and decide that was the end of pages.

Your implementation of a binary search is also incorrect. The end condition is not test == 0, it is test == last page. The way to know that you found the last page is max = min + 1.

Pages are always contiguous (update the main post, I forgot to report that in the first place); so no problem with the "holes", as there are any in this case.
Algorithms are still a new thing for me, so this helps a lot for the future. Thank you, I understand where the problem was.

You could base yourself on the bisection algorithm implemented in module bisect, for exemple the function bisect_left returns the insertion point of a value x in a sorted array just before any occurrence of x in the array

def bisect_left(a, x, lo=0, hi=None, *, key=None):
    """Return the index where to insert item x in list a, assuming a is sorted.

    The return value i is such that all e in a[:i] have e < x, and all e in
    a[i:] have e >= x.  So if x already appears in the list, a.insert(i, x) will
    insert just before the leftmost x already there.

    Optional args lo (default 0) and hi (default len(a)) bound the
    slice of a to be searched.
    """

    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    # Note, the comparison uses "<" to match the
    # __lt__() logic in list.sort() and in heapq.
    if key is None:
        while lo < hi:
            mid = (lo + hi) // 2
            if a[mid] < x:
                lo = mid + 1
            else:
                hi = mid
    else:
        while lo < hi:
            mid = (lo + hi) // 2
            if key(a[mid]) < x:
                lo = mid + 1
            else:
                hi = mid
    return lo

In your case you can imagine that you have an array with value 0 if page_exists(k) and value 1 otherwise. Then the it is as if you wanted to insert the value 1 in the array, hence the modified code would be

def last_page(lo=0, hi=100):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    while lo < hi:
        mid = (lo + hi) // 2
        if page_exists(mid):
            lo = mid + 1
        else:
            hi = mid
    return lo - 1

You could also NOT implement your own algorithm and use bisect_left directly

import bisect

class A:
    def __getitem__(self, k):
        return 0 if page_exists(k) else 1

def last_page(lo=0, hi=100):
    return bisect.bisect_left(A(), 1, lo=lo, hi=hi) - 1

gradlon93

deanhystad

gradlon93

Gribouillis