Python Forum
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont
#1
Hello. I have a lots of ANSI docx files. I made this Python code, and I got an error of codec:

import re
import os
from pathlib import Path
from docx import Document
from docx.shared import Inches
import sys
from docx2pdf import convert

# The location where the files are located
input_path = r'c:\Folder7\input'
# The location where we will write the PDF files
output_path = r'c:\Folder7\output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)

# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
    print(directory_path, "exists")
else:
    print(directory_path, "is invalid")
    sys.exit(1)

for file_path in directory_path.glob("*"):
    # file_path is a Path object

    print("Procesez fisierul:", file_path)
    document = Document()
    # file_path.name is the name of the file as str without the Path
    document.add_heading(file_path.name, 0)

    file_content = file_path.read_text(encoding='UTF-8')
    document.add_paragraph(file_content)

    # build the new path where we store the files
    output_file_path = os.path.join(output_path, file_path.name + ".pdf")

    document.save(output_file_path)
    print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)
This is the error:

Traceback (most recent call last):
  File "D:\Convert docx to pdf.py", line 32, in <module>
    file_content = file_path.read_text(encoding='UTF-8')
  File "C:\Program Files\Python39\lib\pathlib.py", line 1133, in read_text
    return f.read()
  File "C:\Program Files\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte
Can anyone update my code, so that it can read ANSI and save as UTF-8 into PDF?
Reply
#2
You tried to read the file with encoding='UTF-8' and it failed, which indicates that the document is not utf8-encoded. You could try another encoding, such as 'ascii' or 'latin-1' or 'cp1252'. You could also install the chardet module and use it to detect your file's encoding.

ANSI is not an encoding, it is a non-profit organization.
Melcu54 likes this post
Reply
#3
(Mar-26-2023, 09:05 AM)Gribouillis Wrote: You tried to read the file with encoding='UTF-8' and it failed, which indicates that the document is not utf8-encoded. You could try another encoding, such as 'ascii' or 'latin-1' or 'cp1252'. You could also install the chardet module and use it to detect your file's encoding.

ANSI is not an encoding, it is a non-profit organization.

thanks, can you uodate my code as to work with Chardet module? Doesn't work if I simply change the encoding name..
Reply
#4
(Mar-26-2023, 10:37 AM)Melcu54 Wrote: can you uodate my code as to work with Chardet module?
No, but you could try the following in a terminal / cmd tool to guess your file's encoding
Output:
python -m chardet.cli.chardetect FILENAME
For example I have a file named 'ham' and here is the output with this command
Output:
λ python -m chardet.cli.chardetect ham ham: ascii with confidence 1.0
Melcu54 likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord ctrldan 23 4,865 Apr-24-2023, 03:40 PM
Last Post: ctrldan
  Decode string ? JohnnyCoffee 1 826 Jan-11-2023, 12:29 AM
Last Post: bowlofred
  [SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec Winfried 1 1,036 Nov-16-2022, 11:41 AM
Last Post: Winfried
  UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character Melcu54 7 19,042 Sep-26-2022, 10:09 AM
Last Post: Melcu54
  UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin Armandito 6 2,738 Apr-29-2022, 12:36 PM
Last Post: Armandito
  ASCII-Codec in Python3 [SOLVED] AlphaInc 4 6,166 Jul-07-2021, 07:05 PM
Last Post: AlphaInc
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 error from Mysql call AkaAndrew123 1 3,463 Apr-28-2021, 08:16 AM
Last Post: AkaAndrew123
Question UnicodeDecodeError . . . JohnnyCoffee 5 3,581 Feb-28-2021, 02:32 AM
Last Post: JohnnyCoffee
  open(file, 'rb') raises UnicodeDecodeError binnybit 1 2,482 Sep-28-2020, 06:55 AM
Last Post: Gribouillis
  codec for byte transparency Skaperen 7 3,867 Sep-25-2020, 02:20 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020