web scraping HTML - :( - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: web scraping HTML - :( (/thread-33182.html) |
web scraping HTML - :( - Kingoman - Apr-04-2021 Hi, I'm in need of some help to 'scrap' html code from multiple sites instead of manually go to 'view source' - All I have found regarding this online seems to only focus on the 'design' different tables etc. but I need the html-code combined into one file/page. I guy sent me this to try out, but it didn't work out. pip install requests import requests domains = [ "https://domain1.dk", "https://domain2.dk", ] for domain in domains: response = requests.get(domain, verify=False) response.raise_for_status() # Save response.textFor reference here is one of the sites. view-source:https://www.webdesigner.dk/ RE: web scraping HTML - :( - Kingoman - Apr-04-2021 I keep getting the answer from a guy regarding I should just put the following in Python and then it should work. Pip install request import requests domains = ['http://webdesigner.dk', 'http://www.jubii.dk', 'http://www.yahoo.dk'] for domain in domains: print(domain) response = requests.get(domain) print(f"Response data length: {len(response.text)}") # Remove # from line below to see source # print(response.text)But it does not - Anyone know why? Image of my screen : https://imgur.com/y7pFRdf RE: web scraping HTML - :( - snippsat - Apr-04-2021 You most install Requests from command line( cmd ),and not in interactive interpreter which has >>> .# Test that pip work C:\>pip -V pip 21.0.1 from c:\python39\lib\site-packages\pip (python 3.9) # Install Requests C:\>pip install requests --upgrade ..... Requirement already satisfied: idna<3,>=2.5 in c:\python39\lib\site-packages (fr om requests) (2.10) C:\>Often it's not so useful at all source like this,as there can be mixed in CSS and JavaScript. Look at Web-Scraping part-1 and part-2. RE: web scraping HTML - :( - Kingoman - Apr-05-2021 Thank you for your response. I found out I needed PIP, BS4 and request installed - So that part is done. I couldn't get my code further above to work, so I found something different, which is working for a scrap on 1 website, but not multiple - What is wrong here? Import requests import bs4 res = requests.get('https://webdesigner.dk','https://www.dk4.dk/item/4128-persondatapolitik') type(res) res.text RE: web scraping HTML - :( - snippsat - Apr-05-2021 (Apr-05-2021, 12:46 AM)Kingoman Wrote: What is wrong here?You can not do it like that,and code you have gotten work. So could add some code so it save as a .html file. import requests domains = ['https://webdesigner.dk', 'https://www.dk4.dk/item/4128-persondatapolitik'] for domain in domains: print(domain) response = requests.get(domain) print(f"Response data length: {len(response.text)}") # Remove # from line below to see source # print(response.text) with open('source.html', 'w') as f: f.write(response.text.strip()) RE: web scraping HTML - :( - Kingoman - Apr-05-2021 Thank you for your reply, but this is not working either. I only get one of the two sites 'scraped' when I type response.text and this text after I enter the last of your code 'Traceback (most recent call last): File "<pyshell#14>", line 2, in <module> f.write(response.text.strip()) File "C:\Users\Kim\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2190' in position 677608: character maps to <undefined>' RE: web scraping HTML - :( - snippsat - Apr-05-2021 Change to this. with open('source.html', 'w', encoding='utf-8') as f: f.write(response.text.strip()) RE: web scraping HTML - :( - Kingoman - Apr-05-2021 I get two data lenghts, se below Response data length: 852038 Response data length: 851385 But still only one when wathcing the actual html-code, the last mentioned. RE: web scraping HTML - :( - snippsat - Apr-05-2021 Yes,forget that most be append mode( a ),so change from w to a .
RE: web scraping HTML - :( - Kingoman - Apr-05-2021 Same result. It will only get the html-code from one, the last mentioned. |