Python Forum

Full Version: Encoding Problem?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all,

i have an issue crawling results from google. The first results are extracted correctly, but the 8th result throws an exception.
This is my Spider:

class GoogleSpider(scrapy.Spider): 
    name = "GoogleSpider" 
    start_urls = [""]
    def parse(self, response):
        xlink = LinkExtractor()
        divs = response.xpath('//div')
        for span in divs.xpath('text()'):
            if len(str(span.get()))>100:
                for link in xlink.extract_links(response):
                    if len(str(link))>200 or 'Journal' in link.text:
                        for i in range(len(link_text)-len(text_list)):
                            text_list.append(" ")
the link that causes the error has link.text = "_ElementUnicodeResult: Pankaj Kumar (@JournalDev) | টুইটার - › journaldev"

The error says: UnicodeEncodeError: 'charmap' codec can't encode characters in position 29-34: character maps to <undefined>

anyone who can help me with that?

If i run your spider i don't get any error,Python 3.9, Scrapy 2.4.1.
Your response object get wrong encoding charmap,that happens for me to but don't get error.
Best is of course if response is utf-8,If i start Scrapy shell,
scrapy shell -L INFO

>>> response.encoding
Testing your site.
scrapy shell -L INFO
>>> response.encoding
So site probably don't have charmap or cp1252 as encoding,
Scrapy can find encoding so take one from OS.

Still work for me if i test with a links with Unicode.
>>> from scrapy.linkextractors import LinkExtractor
>>> response.encoding

>>> xlink = LinkExtractor()
>>> link = xlink.extract_links(response)
>>> link[8]
Link(url='', text='Bøker', fragment='', nofollow=False)
>>> link[8].text
If you run older version Python try upgrade.
Thanks for your reply.

My Versions are Python 3.9 and Scrapy 2.4.1, too. I try to solve the problem in a total different way, hope that the problems wont appear again.
