(Apr-19-2018, 04:41 PM)gentoobob Wrote: [ -> ]Because the url at the end has a page number. I need a loop that starts at page one then goes to page two, three, etc until no more pages are left.
That setup work fine with a for loop to generate url's.
start = 1
stop = 5
for page in range(start, stop):
url = 'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber={}'.format(page)
print(url)
Output:
https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=1
https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=2
https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=3
https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=4
Has a similar example in
Web-scraping part-2,also a demo with
concurrent.futures to speed it up.
Ok great! Thank you.
Let me ask you, will I have to have a separate "soup" variable for each URL or can I cram all the pages into one variable? So when it visits each page, I want it to add on to the last page it put into the soup variable.
(Apr-19-2018, 05:43 PM)gentoobob Wrote: [ -> ]Let me ask you, will I have to have a separate "soup" variable for each URL or can I cram all the pages into one variable?
You pass url to Requests and BS in same loop.
Example:
start = 1
stop = 5
for page in range(start, stop):
url = 'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber={}'.format(page)
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
foo = soup.find('do scraping')
# save foo
Quote:or can I cram all the pages into one variable?
Not variable but a data structure like eg list,you can collect urls.
from pprint import pprint
urls = []
start = 1
stop = 5
for page in range(start, stop):
url = 'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber={}'.format(page)
urls.append(url)
pprint(urls)
Output:
['https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=1',
'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=2',
'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=3',
'https://10.10.10.0/vmrest/users?rowsPerPage=2000&pageNumber=4']
ok...so far I have this (below). I need to take my soup parsing data (alias and dtmf) and put it into a file, a .CSV file with a date time stamp in the file name instead of printing it to the screen.
import requests
import urllib3
urllib3.disable_warnings()
import lxml
from bs4 import BeautifulSoup
page = 1
while page < 6:
url = 'https://10.10.10.1/vmrest/users?rowsPerPage=2000&pageNumber=' + str(page)
request_page = requests.get(url, verify=False, auth=('User', 'Pass'))
soup = BeautifulSoup(request_page.text, 'lxml')
alias = soup.find_all('alias')
dtmf = soup.find_all('dtmfaccessid')
for i in range(0, len(alias)):
print(alias[i].get_text(),end=' ')
print(dtmf[i].get_text())
page += 1
Will check out your tutorial too. Thanks for the help.