Python Forum

Hello, I'm pretty new to Python, so please bear over with me, if I ask any beginner's questions.

I'm trying to work on multiple files in a folder, my code seem to be working. But I have some encoding problem. I get the "ufeff" including in some of my printed text.

I have saved kall the files with the UTF-8 format. And used "encoding = "utf-8" in my code. It always worked before.

Can anyone identify why it doesn't work or how I could maybe alter my code (the placement of "encoding = "utf-8" for example.

Import os

path = "/Users/nikolajkorsgaard/Documents/Eksamenen"

for filename in os.listdir(path):
    with open(filename, "r", encoding = "utf-8") as f:
        if filename.endswith(".txt"):
            for line in f:
                print(line.split())

You can 'ignore' or 'replace' letters, if there are problems with encoding. You have also some other
options. It's described here: Built-in Functions
Look for the argument "errors"

But this won't fix broken encodings. To fix a broken encoding, you can use tools like ftfy.
This can be a approach to fix errors:

from pathlib import Path
import ftfy


path = Path.home() / "Documents/Eksamenen"


def read_text(file):
    binary_data = file.read_bytes()
    # read data binary
    try:
        # try to decode binary data with utf8
        text = binary_data.decode() # utf8 is implicit the default
        print(text_file, 'has right encoding')
    except UnicodeDecodeError:
        # if the decoding raised an decode error
        # try to load the file as text, interpret this as
        # utf8 and ignore all errors
        # then this broken text is tried to fix by ftfy
        # this module can fix broken encodings
        text = ftfy.fix_encoding(file.read_text(errors='ignore'))
        print(file, 'has wrong encoding. Fixed the encoding with ftfy.')
    return text


def read_text_files(root):
    root = Path(root)
    results = []
    for text_file in root.glob('*.txt'):
        text = read_text(text_file)
        results.append(text)
    return results

If you are not interested in fixing broken encodings, you can ignore or replace them and then ftfy is not needed.
With your original function:

for filename in os.listdir(path):
    with open(filename, "r", encoding="utf-8", errors='ignore') as f:
        if filename.endswith(".txt"):
            for line in f:
                print(line.split())

In my example I use Path objects, because they are easier to handle.

Is there a simpler way just to target "\ufeff". When I look at my printed shell, it's only the title of the documents the a encoding problem.

I've removed unwanted punctuation and symbol with:

for char in '•,.?!)(][%$#&\/=;:':
   word = word.replace(char,'')

Can I do it like that in a simple way with little lines of code.

import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import os
import string

path = "/Users/nikolajkorsgaard/Documents/Eksamenen"


d={}
for filename in os.listdir(path):
    with open(filename, encoding = "utf-8") as f:
        if filename.endswith(".txt"):
            for word in f:
                word=word.lower()
                for char in '•,.?!)(][%$#&\/=;:':
                    word = word.replace(char,'')
                tokens = nltk.word_tokenize(word)
                print(tokens)

it looks like you have files with different encoding.
see https://stackoverflow.com/questions/1791...hon-string

Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think

(Jun-05-2019, 04:05 PM)NikolajKorsgaard Wrote: [ -> ]Nevermind, will a moderater please delete this thread? - this is an exam project, so I probably shouldn't writing this when I think

That's one of the exact reasons we don't delete posts - if you shouldn't have posted, then it was academic dishonesty, and we want professors to be able to discover that.

NikolajKorsgaard

DeaD_EyE

NikolajKorsgaard

buran

NikolajKorsgaard

micseydel