Encoding Problem? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Encoding Problem? (/thread-33081.html) |
Encoding Problem? - Konlork - Mar-27-2021 Hi all, i have an issue crawling results from google. The first results are extracted correctly, but the 8th result throws an exception. This is my Spider: #GoogleSpider class GoogleSpider(scrapy.Spider): name = "GoogleSpider" start_urls = ["https://www.google.com/search?q=journal+dev"] def parse(self, response): xlink = LinkExtractor() link_list=[] link_text=[] divs = response.xpath('//div') text_list=[] for span in divs.xpath('text()'): if len(str(span.get()))>100: text_list.append(span.get()) for link in xlink.extract_links(response): if len(str(link))>200 or 'Journal' in link.text: print(len(str(link)),link.text,link,"\n") link_list.append(link) link_text.append(link.text) for i in range(len(link_text)-len(text_list)): text_list.append(" ")the link that causes the error has link.text = "_ElementUnicodeResult: Pankaj Kumar (@JournalDev) | টুইটার - Twittertwitter.com › journaldev" The error says: UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-34: character maps to <undefined> anyone who can help me with that? thx, kon RE: Encoding Problem? - snippsat - Mar-29-2021 If i run your spider i don't get any error,Python 3.9, Scrapy 2.4.1. Your response object get wrong encoding charmap ,that happens for me to but don't get error.Best is of course if response is utf-8 ,If i start Scrapy shell, scrapy shell -L INFO https://python-forum.io/ >>> response.encoding 'utf-8'Testing your site. scrapy shell -L INFO https://www.google.com/search?q=journal+dev >>> >>> response.encoding 'cp1252'So site probably don't have charmap or cp1252 as encoding,Scrapy can find encoding so take one from OS. Still work for me if i test with a links with Unicode. >>> from scrapy.linkextractors import LinkExtractor >>> >>> response.encoding 'cp1252' >>> xlink = LinkExtractor() >>> link = xlink.extract_links(response) >>> >>> link[8] Link(url='https://www.google.com/search?q=journal+dev&ie=UTF-8&source=lnms&tbm=bks&sa=X&ved=0ahUKEwic3rvJsdbvAhWllosKHQNUCWEQ_AUIDSgG', text='Bøker', fragment='', nofollow=False) >>> link[8].text 'Bøker'If you run older version Python try upgrade. RE: Encoding Problem? - Konlork - Mar-30-2021 Thanks for your reply. My Versions are Python 3.9 and Scrapy 2.4.1, too. I try to solve the problem in a total different way, hope that the problems wont appear again. Thx |