Python Forum
Encoding problems on multiple files in one folder
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Encoding problems on multiple files in one folder
#1
Hello, I'm pretty new to Python, so please bear over with me, if I ask any beginner's questions.

I'm trying to work on multiple files in a folder, my code seem to be working. But I have some encoding problem. I get the "ufeff" including in some of my printed text.

I have saved kall the files with the UTF-8 format. And used "encoding = "utf-8" in my code. It always worked before.

Can anyone identify why it doesn't work or how I could maybe alter my code (the placement of "encoding = "utf-8" for example.



Import os

path = "/Users/nikolajkorsgaard/Documents/Eksamenen"

for filename in os.listdir(path):
    with open(filename, "r", encoding = "utf-8") as f:
        if filename.endswith(".txt"):
            for line in f:
                print(line.split())
Reply
#2
You can 'ignore' or 'replace' letters, if there are problems with encoding. You have also some other
options. It's described here: Built-in Functions
Look for the argument "errors"

But this won't fix broken encodings. To fix a broken encoding, you can use tools like ftfy.
This can be a approach to fix errors:

from pathlib import Path
import ftfy


path = Path.home() / "Documents/Eksamenen"


def read_text(file):
    binary_data = file.read_bytes()
    # read data binary
    try:
        # try to decode binary data with utf8
        text = binary_data.decode() # utf8 is implicit the default
        print(text_file, 'has right encoding')
    except UnicodeDecodeError:
        # if the decoding raised an decode error
        # try to load the file as text, interpret this as
        # utf8 and ignore all errors
        # then this broken text is tried to fix by ftfy
        # this module can fix broken encodings
        text = ftfy.fix_encoding(file.read_text(errors='ignore'))
        print(file, 'has wrong encoding. Fixed the encoding with ftfy.')
    return text


def read_text_files(root):
    root = Path(root)
    results = []
    for text_file in root.glob('*.txt'):
        text = read_text(text_file)
        results.append(text)
    return results
If you are not interested in fixing broken encodings, you can ignore or replace them and then ftfy is not needed.
With your original function:
for filename in os.listdir(path):
    with open(filename, "r", encoding="utf-8", errors='ignore') as f:
        if filename.endswith(".txt"):
            for line in f:
                print(line.split())
In my example I use Path objects, because they are easier to handle.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
Is there a simpler way just to target "\ufeff". When I look at my printed shell, it's only the title of the documents the a encoding problem.

I've removed unwanted punctuation and symbol with:

for char in '•,.?!)(][%$#&\/=;:':
   word = word.replace(char,'')
Can I do it like that in a simple way with little lines of code.

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import os
import string

path = "/Users/nikolajkorsgaard/Documents/Eksamenen"


d={}
for filename in os.listdir(path):
    with open(filename, encoding = "utf-8") as f:
        if filename.endswith(".txt"):
            for word in f:
                word=word.lower()
                for char in '•,.?!)(][%$#&\/=;:':
                    word = word.replace(char,'')
                tokens = nltk.word_tokenize(word)
                print(tokens)
Reply
#4
it looks like you have files with different encoding.
see https://stackoverflow.com/questions/1791...hon-string
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think
Reply
#6
(Jun-05-2019, 04:05 PM)NikolajKorsgaard Wrote: Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think
That's one of the exact reasons we don't delete posts - if you shouldn't have posted, then it was academic dishonesty, and we want professors to be able to discover that.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Compare folder A and subfolder B and display files that are in folder A but not in su Melcu54 3 529 Jan-05-2024, 05:16 PM
Last Post: Pedroski55
  python convert multiple files to multiple lists MCL169 6 1,533 Nov-25-2023, 05:31 AM
Last Post: Iqratech
  Rename files in a folder named using windows explorer hitoxman 3 734 Aug-02-2023, 04:08 PM
Last Post: deanhystad
  splitting file into multiple files by searching for string AlphaInc 2 879 Jul-01-2023, 10:35 PM
Last Post: Pedroski55
  Rename all files in a folder hitoxman 9 1,482 Jun-30-2023, 12:19 AM
Last Post: Pedroski55
  Merging multiple csv files with same X,Y,Z in each Auz_Pete 3 1,146 Feb-21-2023, 04:21 AM
Last Post: Auz_Pete
  unittest generates multiple files for each of my test case, how do I change to 1 file zsousa 0 955 Feb-15-2023, 05:34 PM
Last Post: zsousa
  Find duplicate files in multiple directories Pavel_47 9 3,068 Dec-27-2022, 04:47 PM
Last Post: deanhystad
  How to loop through all excel files and sheets in folder jadelola 1 4,459 Dec-01-2022, 06:12 PM
Last Post: deanhystad
  python gzip all files from a folder mg24 3 3,977 Oct-28-2022, 03:59 PM
Last Post: mg24

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020