Python Forum

Full Version: web scraping HTML - :(
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Hi,

I'm in need of some help to 'scrap' html code from multiple sites instead of manually go to 'view source'
- All I have found regarding this online seems to only focus on the 'design' different tables etc. but I need the html-code combined into one file/page.

I guy sent me this to try out, but it didn't work out.

pip install requests

import requests

domains = [
"https://domain1.dk",
"https://domain2.dk",
]

for domain in domains:
  response = requests.get(domain, verify=False)
  response.raise_for_status()
  # Save response.text
For reference here is one of the sites.

view-source:https://www.webdesigner.dk/
I keep getting the answer from a guy regarding I should just put the following in Python and then it should work.
Pip install request
import requests
domains = ['http://webdesigner.dk', 'http://www.jubii.dk', 'http://www.yahoo.dk']

for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see  source
    # print(response.text)
But it does not - Anyone know why?

Image of my screen : https://imgur.com/y7pFRdf
You most install Requests from command line(cmd),and not in interactive interpreter which has >>>.
# Test that pip work
C:\>pip -V
pip 21.0.1 from c:\python39\lib\site-packages\pip (python 3.9)

# Install Requests
C:\>pip install requests --upgrade
.....
Requirement already satisfied: idna<3,>=2.5 in c:\python39\lib\site-packages (fr
om requests) (2.10)

C:\>
Often it's not so useful at all source like this,as there can be mixed in CSS and JavaScript.
Look at Web-Scraping part-1 and part-2.
Thank you for your response.

I found out I needed PIP, BS4 and request installed - So that part is done.


I couldn't get my code further above to work, so I found something different, which is working for a scrap on 1 website, but not multiple - What is wrong here?

Import requests
import bs4
res = requests.get('https://webdesigner.dk','https://www.dk4.dk/item/4128-persondatapolitik')
type(res)
res.text
(Apr-05-2021, 12:46 AM)Kingoman Wrote: [ -> ]What is wrong here?
You can not do it like that,and code you have gotten work.
So could add some code so it save as a .html file.
import requests

domains = ['https://webdesigner.dk', 'https://www.dk4.dk/item/4128-persondatapolitik']
for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see source
    # print(response.text)

with open('source.html', 'w') as f:
    f.write(response.text.strip())
Thank you for your reply, but this is not working either.

I only get one of the two sites 'scraped' when I type response.text
and this text after I enter the last of your code

'Traceback (most recent call last):
File "<pyshell#14>", line 2, in <module>
f.write(response.text.strip())
File "C:\Users\Kim\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 677608: character maps to <undefined>'
Change to this.
with open('source.html', 'w', encoding='utf-8') as f:
    f.write(response.text.strip())
I get two data lenghts, se below

Response data length: 852038
Response data length: 851385

But still only one when wathcing the actual html-code, the last mentioned.
Yes,forget that most be append mode(a),so change from w to a.
Same result.

It will only get the html-code from one, the last mentioned.
Pages: 1 2 3