UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont

Melcu54 · Mar-26-2023, 07:41 AM

Hello. I have a lots of ANSI docx files. I made this Python code, and I got an error of codec:

        
          
          
              
              import re
import os
from pathlib import Path
from docx import Document
from docx.shared import Inches
import sys
from docx2pdf import convert
 
# The location where the files are located
input_path = r'c:\Folder7\input'
# The location where we will write the PDF files
output_path = r'c:\Folder7\output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)
 
# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
    print(directory_path, "exists")
else:
    print(directory_path, "is invalid")
    sys.exit(1)
 
for file_path in directory_path.glob("*"):
    # file_path is a Path object
 
    print("Procesez fisierul:", file_path)
    document = Document()
    # file_path.name is the name of the file as str without the Path
    document.add_heading(file_path.name, 0)
 
    file_content = file_path.read_text(encoding='UTF-8')
    document.add_paragraph(file_content)
 
    # build the new path where we store the files
    output_file_path = os.path.join(output_path, file_path.name + ".pdf")
 
    document.save(output_file_path)
    print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)

            

        
      

This is the error:

        
              Traceback (most recent call last):
  File "D:\Convert docx to pdf.py", line 32, in <module>
    file_content = file_path.read_text(encoding='UTF-8')
  File "C:\Program Files\Python39\lib\pathlib.py", line 1133, in read_text
    return f.read()
  File "C:\Program Files\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte

Can anyone update my code, so that it can read ANSI and save as UTF-8 into PDF?

**Gribouillis** · (This post was last modified: Mar-26-2023, 09:11 AM by Gribouillis.)

You tried to read the file with encoding='UTF-8' and it failed, which indicates that the document is not utf8-encoded. You could try another encoding, such as 'ascii' or 'latin-1' or 'cp1252'. You could also install the chardet module and use it to detect your file's encoding.

ANSI is not an encoding, it is a non-profit organization.

Melcu54 · Mar-26-2023, 10:37 AM

(Mar-26-2023, 09:05 AM)Gribouillis Wrote: You tried to read the file with encoding='UTF-8' and it failed, which indicates that the document is not utf8-encoded. You could try another encoding, such as 'ascii' or 'latin-1' or 'cp1252'. You could also install the chardet module and use it to detect your file's encoding.

ANSI is not an encoding, it is a non-profit organization.

thanks, can you uodate my code as to work with Chardet module? Doesn't work if I simply change the encoding name..

**Gribouillis** · (This post was last modified: Mar-26-2023, 12:12 PM by Gribouillis.)

(Mar-26-2023, 10:37 AM)Melcu54 Wrote: can you uodate my code as to work with Chardet module?

No, but you could try the following in a terminal / cmd tool to guess your file's encoding

Output:
python -m chardet.cli.chardetect FILENAME

For example I have a file named 'ham' and here is the output with this command

Output:λ python -m chardet.cli.chardetect ham
ham: ascii with confidence 1.0

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 562: ord	ctrldan	23	9,526	Apr-24-2023, 03:40 PM Last Post: ctrldan
	Decode string ?	JohnnyCoffee	1	1,462	Jan-11-2023, 12:29 AM Last Post: bowlofred
	[SOLVED] [Debian] UnicodeEncodeError: 'ascii' codec	Winfried	1	1,667	Nov-16-2022, 11:41 AM Last Post: Winfried
	UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 34: character	Melcu54	7	29,256	Sep-26-2022, 10:09 AM Last Post: Melcu54
	UnicodeEncodeError: 'ascii' codec can't encode character '\xfd' in position 14: ordin	Armandito	6	4,369	Apr-29-2022, 12:36 PM Last Post: Armandito
	ASCII-Codec in Python3 [SOLVED]	AlphaInc	4	9,464	Jul-07-2021, 07:05 PM Last Post: AlphaInc
	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 error from Mysql call	AkaAndrew123	1	4,330	Apr-28-2021, 08:16 AM Last Post: AkaAndrew123
	UnicodeDecodeError . . .	JohnnyCoffee	5	5,409	Feb-28-2021, 02:32 AM Last Post: JohnnyCoffee
	open(file, 'rb') raises UnicodeDecodeError	binnybit	1	3,276	Sep-28-2020, 06:55 AM Last Post: Gribouillis
	codec for byte transparency	Skaperen	7	5,194	Sep-25-2020, 02:20 AM Last Post: Skaperen

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid cont

User Panel Messages

Announcements