Python Forum

Full Version: Encoding Problem?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

i have an issue crawling results from google. The first results are extracted correctly, but the 8th result throws an exception.
This is my Spider:

#GoogleSpider
class GoogleSpider(scrapy.Spider): 
    name = "GoogleSpider" 
    start_urls = ["https://www.google.com/search?q=journal+dev"]
    
    def parse(self, response):
        xlink = LinkExtractor()
        link_list=[]
        link_text=[]
        divs = response.xpath('//div')
        text_list=[]
        for span in divs.xpath('text()'):
            if len(str(span.get()))>100:
                text_list.append(span.get())
                for link in xlink.extract_links(response):
                    if len(str(link))>200 or 'Journal' in link.text:
                        print(len(str(link)),link.text,link,"\n")
                        link_list.append(link)
                        link_text.append(link.text)
                        for i in range(len(link_text)-len(text_list)):
                            text_list.append(" ")
the link that causes the error has link.text = "_ElementUnicodeResult: Pankaj Kumar (@JournalDev) | টুইটার - Twittertwitter.com › journaldev"

The error says: UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-34: character maps to <undefined>

anyone who can help me with that?

thx,
kon
If i run your spider i don't get any error,Python 3.9, Scrapy 2.4.1.
Your response object get wrong encoding charmap,that happens for me to but don't get error.
Best is of course if response is utf-8,If i start Scrapy shell,
scrapy shell -L INFO https://python-forum.io/

>>> response.encoding
'utf-8'
Testing your site.
scrapy shell -L INFO https://www.google.com/search?q=journal+dev
>>>
>>> response.encoding
'cp1252'
So site probably don't have charmap or cp1252 as encoding,
Scrapy can find encoding so take one from OS.

Still work for me if i test with a links with Unicode.
>>> from scrapy.linkextractors import LinkExtractor
>>>
>>> response.encoding
'cp1252'

>>> xlink = LinkExtractor()
>>> link = xlink.extract_links(response)
>>>
>>> link[8]
Link(url='https://www.google.com/search?q=journal+dev&ie=UTF-8&source=lnms&tbm=bks&sa=X&ved=0ahUKEwic3rvJsdbvAhWllosKHQNUCWEQ_AUIDSgG', text='Bøker', fragment='', nofollow=False)
>>> link[8].text
'Bøker'
If you run older version Python try upgrade.
Thanks for your reply.

My Versions are Python 3.9 and Scrapy 2.4.1, too. I try to solve the problem in a total different way, hope that the problems wont appear again.

Thx