web scraping HTML - :(

Kingoman · (This post was last modified: Apr-05-2021, 09:47 AM by buran.)

Hi,

I'm in need of some help to 'scrap' html code from multiple sites instead of manually go to 'view source'
- All I have found regarding this online seems to only focus on the 'design' different tables etc. but I need the html-code combined into one file/page.

I guy sent me this to try out, but it didn't work out.

pip install requests

import requests

domains = [
"https://domain1.dk",
"https://domain2.dk",
]

for domain in domains:
  response = requests.get(domain, verify=False)
  response.raise_for_status()
  # Save response.text

For reference here is one of the sites.

view-source:https://www.webdesigner.dk/

Kingoman

I keep getting the answer from a guy regarding I should just put the following in Python and then it should work.

Pip install request
import requests
domains = ['http://webdesigner.dk', 'http://www.jubii.dk', 'http://www.yahoo.dk']

for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see  source
    # print(response.text)

But it does not - Anyone know why?

Image of my screen : https://imgur.com/y7pFRdf

snippsat write Apr-04-2021, 11:26 PM:
Added code tag in your post,look at BBCode on how to use.

***snippsat*** · (This post was last modified: Apr-04-2021, 11:36 PM by snippsat.)

You most install Requests from command line(cmd),and not in interactive interpreter which has >>>.

# Test that pip work
C:\>pip -V
pip 21.0.1 from c:\python39\lib\site-packages\pip (python 3.9)

# Install Requests
C:\>pip install requests --upgrade
.....
Requirement already satisfied: idna<3,>=2.5 in c:\python39\lib\site-packages (fr
om requests) (2.10)

C:\>

Often it's not so useful at all source like this,as there can be mixed in CSS and JavaScript.
Look at Web-Scraping part-1 and part-2.

Kingoman · (This post was last modified: Apr-05-2021, 12:46 AM by Kingoman.)

Thank you for your response.

I found out I needed PIP, BS4 and request installed - So that part is done.

I couldn't get my code further above to work, so I found something different, which is working for a scrap on 1 website, but not multiple - What is wrong here?

Import requests
import bs4
res = requests.get('https://webdesigner.dk','https://www.dk4.dk/item/4128-persondatapolitik')
type(res)
res.text

***snippsat*** · Apr-05-2021, 12:58 AM

(Apr-05-2021, 12:46 AM)Kingoman Wrote: What is wrong here?

You can not do it like that,and code you have gotten work.
So could add some code so it save as a .html file.

import requests

domains = ['https://webdesigner.dk', 'https://www.dk4.dk/item/4128-persondatapolitik']
for domain in domains:
    print(domain)
    response = requests.get(domain)
    print(f"Response data length: {len(response.text)}")
    # Remove # from line below to see source
    # print(response.text)

with open('source.html', 'w') as f:
    f.write(response.text.strip())

Kingoman · Apr-05-2021, 01:30 AM

Thank you for your reply, but this is not working either.

I only get one of the two sites 'scraped' when I type response.text
and this text after I enter the last of your code

'Traceback (most recent call last):
File "<pyshell#14>", line 2, in <module>
f.write(response.text.strip())
File "C:\Users\Kim\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 677608: character maps to <undefined>'

***snippsat*** · Apr-05-2021, 01:34 AM

Change to this.

with open('source.html', 'w', encoding='utf-8') as f:
    f.write(response.text.strip())

Kingoman · (This post was last modified: Apr-05-2021, 01:49 AM by Kingoman.)

I get two data lenghts, se below

Response data length: 852038
Response data length: 851385

But still only one when wathcing the actual html-code, the last mentioned.

***snippsat*** · (This post was last modified: Apr-05-2021, 02:06 AM by snippsat.)

Yes,forget that most be append mode(a),so change from w to a.

Kingoman · Apr-05-2021, 02:12 AM

Same result.

It will only get the html-code from one, the last mentioned.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Python Web Scraping can not getting all HTML content	yqqwe123	0	1,647	Aug-02-2021, 08:56 AM Last Post: yqqwe123
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,650	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,373	Mar-22-2020, 06:10 AM Last Post: BrandonKastning

web scraping HTML - :(

User Panel Messages

Announcements