Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping HTML - :(
#1
Sad 
Hi,

I'm in need of some help to 'scrap' html code from multiple sites instead of manually go to 'view source'
- All I have found regarding this online seems to only focus on the 'design' different tables etc. but I need the html-code combined into one file/page.

I guy sent me this to try out, but it didn't work out.

pip install requests

import requests

domains = [
"https://domain1.dk",
"https://domain2.dk",
]

for domain in domains:
  response = requests.get(domain, verify=False)
  response.raise_for_status()
  # Save response.text
For reference here is one of the sites.

view-source:https://www.webdesigner.dk/
Reply
#2
I keep getting the answer from a guy regarding I should just put the following in Python and then it should work.
Pip install request
import requests
domains = ['http://webdesigner.dk', 'http://www.jubii.dk', 'http://www.yahoo.dk']

for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see  source
    # print(response.text)
But it does not - Anyone know why?

Image of my screen : https://imgur.com/y7pFRdf
snippsat write Apr-04-2021, 11:26 PM:
Added code tag in your post,look at BBCode on how to use.
Reply
#3
You most install Requests from command line(cmd),and not in interactive interpreter which has >>>.
# Test that pip work
C:\>pip -V
pip 21.0.1 from c:\python39\lib\site-packages\pip (python 3.9)

# Install Requests
C:\>pip install requests --upgrade
.....
Requirement already satisfied: idna<3,>=2.5 in c:\python39\lib\site-packages (fr
om requests) (2.10)

C:\>
Often it's not so useful at all source like this,as there can be mixed in CSS and JavaScript.
Look at Web-Scraping part-1 and part-2.
Reply
#4
Thank you for your response.

I found out I needed PIP, BS4 and request installed - So that part is done.


I couldn't get my code further above to work, so I found something different, which is working for a scrap on 1 website, but not multiple - What is wrong here?

Import requests
import bs4
res = requests.get('https://webdesigner.dk','https://www.dk4.dk/item/4128-persondatapolitik')
type(res)
res.text
Reply
#5
(Apr-05-2021, 12:46 AM)Kingoman Wrote: What is wrong here?
You can not do it like that,and code you have gotten work.
So could add some code so it save as a .html file.
import requests

domains = ['https://webdesigner.dk', 'https://www.dk4.dk/item/4128-persondatapolitik']
for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see source
    # print(response.text)

with open('source.html', 'w') as f:
    f.write(response.text.strip())
Reply
#6
Thank you for your reply, but this is not working either.

I only get one of the two sites 'scraped' when I type response.text
and this text after I enter the last of your code

'Traceback (most recent call last):
File "<pyshell#14>", line 2, in <module>
f.write(response.text.strip())
File "C:\Users\Kim\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 677608: character maps to <undefined>'
Reply
#7
Change to this.
with open('source.html', 'w', encoding='utf-8') as f:
    f.write(response.text.strip())
Reply
#8
I get two data lenghts, se below

Response data length: 852038
Response data length: 851385

But still only one when wathcing the actual html-code, the last mentioned.
Reply
#9
Yes,forget that most be append mode(a),so change from w to a.
Reply
#10
Same result.

It will only get the html-code from one, the last mentioned.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python Web Scraping can not getting all HTML content yqqwe123 0 1,647 Aug-02-2021, 08:56 AM
Last Post: yqqwe123
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,650 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,373 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020